Volume P-242(2015) - Mathematical Journals

Transcription

GI-Edition
Norbert Ritter, Andreas Henrich, Wolfgang Lehner, Andreas Thor,
Steffen Friedrich, Wolfram Wingerath (Hrsg.): BTW 2015 – Workshopband
Lecture Notes
in Informatics
242
Norbert Ritter, Andreas Henrich,
Wolfgang Lehner, Andreas Thor,
Steffen Friedrich, Wolfram Wingerath
(Hrsg.)
Datenbanksysteme für
Business, Technologie
und Web (BTW 2015) –
Workshopband
02. – 03. März 2015
Hamburg
Proceedings
Norbert Ritter, Andreas Henrich,
Wolfgang Lehner, Andreas Thor, Steffen Friedrich,
Wolfram Wingerath (Hrsg.)
Datenbanksysteme für
Business, Technologie und Web
(BTW 2015)
Workshopband
02. – 03.03.2015
in Hamburg, Germany
Gesellschaft für Informatik e.V. (GI)
Lecture Notes in Informatics (LNI) - Proceedings
Series of the Gesellschaft für Informatik (GI)
Volume P-242
ISBN 978-3-88579-636-7
ISSN 1617-5468
Volume Editors
Norbert Ritter
Universität Hamburg
Fachbereich Informatik
Datenbanken und Informationssysteme
22527 Hamburg, Germany
E-Mail: [email protected]
Andreas Henrich
Otto-Friedrich-Universität Bamberg
Fakultät Wirtschaftsinformatik und Angewandte Informatik
Lehrstuhl für Medieninformatik
96047 Bamberg, Germany
Wolfgang Lehner
Technische Universität Dresden
Fakultät Informatik
Institut für Systemarchitektur
01062 Dresden, Germany
Email: [email protected]
Andreas Thor
Deutsche Telekom Hochschule für Telekommunikation Leipzig
Gustav-Freytag-Str. 43-45
04277 Leipzig, Germany
Steffen Friedrich
Wolfram Wingerath
Series Editorial Board
Heinrich C. Mayr, Alpen-Adria-Universität Klagenfurt, Austria
(Chairman, [email protected])
Dieter Fellner, Technische Universität Darmstadt, Germany
Ulrich Flegel, Hochschule für Technik, Stuttgart, Germany
Ulrich Frank, Universität Duisburg-Essen, Germany
Johann-Christoph Freytag, Humboldt-Universität zu Berlin, Germany
Michael Goedicke, Universität Duisburg-Essen, Germany
Ralf Hofestädt, Universität Bielefeld, Germany
Michael Koch, Universität der Bundeswehr München, Germany
Axel Lehmann, Universität der Bundeswehr München, Germany
Peter Sanders, Karlsruher Institut für Technologie (KIT), Germany
Sigrid Schubert, Universität Siegen, Germany
Ingo Timm, Universität Trier, Germany
Karin Vosseberg, Hochschule Bremerhaven, Germany
Maria Wimmer, Universität Koblenz-Landau, Germany
Dissertations
Steffen Hölldobler, Technische Universität Dresden, Germany
Seminars
Reinhard Wilhelm, Universität des Saarlandes, Germany
Thematics
Andreas Oberweis, Karlsruher Institut für Technologie (KIT), Germany
 Gesellschaft für Informatik, Bonn 2015
printed by Köllen Druck+Verlag GmbH, Bonn
Vorwort
In den letzten Jahren hat es auf dem Gebiet des Datenmanagements große Veränderungen gegeben. Dabei muss sich die Datenbankforschungsgemeinschaft insbesondere den Herausforderungen von „Big Data“ stellen, welche die Analyse
von riesigen Datenmengen unterschiedlicher Struktur mit kurzen Antwortzeiten
im Fokus haben. Neben klassisch strukturierten Daten müssen moderne Datenbanksysteme und Anwendungen semistrukturierte, textuelle und andere multimodale Daten sowie Datenströme in völlig neuen Größenordnungen verwalten.
Gleichzeitig müssen die Verarbeitungssysteme die Korrektheit und Konsistenz
der Daten sicherstellen.
Die jüngsten Fortschritte bei Hardware und Rechnerarchitektur ermöglichen neuartige Datenmanagementtechniken, die von neuen Index- und Anfrageverarbeitungsparadigmen (In-Memory, SIMD, Multicore) bis zu neuartigen Speichertechniken (Flash, Remote Memory) reichen. Diese Entwicklungen spiegeln sich
in aktuell relevanten Themen wie Informationsextraktion, Informationsintegration, Data Analytics, Web Data Management, Service-Oriented Architectures,
Cloud Computing oder Virtualisierung wider.
Wie auf jeder BTW-Konferenz gruppieren sich um die Tagung eine Reihe von
Workshops, die spezielle Themen in kleinen Gruppen aufgreifen und diskutieren.
Im Rahmen der BTW 2015 finden folgende Workshops statt:
•
Databases in Biometrics, Forensics and Security Applications: DBforBFS
•
Data Streams and Event Processing: DSEP
•
Data Management for Science: DMS
Dabei fasst der letztgenannte Workshop DMS als Joint Workshop die beiden
Initiativen Big Data in Science (BigDS) und Data Management for Life Sciences
(DMforLS) zusammen.
Mit seinen Schwerpunkten reflektiert das Workshopprogramm aktuelle Forschungsgebiete von hoher praktischer Relevanz. Zusätzlich präsentieren Studenten im Rahmen des Studierendenprogramms die Ergebnisse ihrer aktuellen Abschlussarbeiten im Bereich Datenmanagement. Für jeden Geschmack sollte sich
somit ein Betätigungsfeld finden lassen!
Die Materialien zur BTW 2015 werden auch über die Tagung hinaus unter
http://www.btw-2015.de zur Verfügung stehen.
VII
Die Organisation einer so großen Tagung wie der BTW mit ihren angeschlossenen Veranstaltungen ist nicht ohne zahlreiche Partner und Unterstützer möglich.
Sie sind auf den folgenden Seiten aufgeführt. Ihnen gilt unser besonderer Dank
ebenso wie den Sponsoren der Tagung und der GI-Geschäftsstelle.
Hamburg, Bamberg, Dresden, Leipzig, im Januar 2015
Norbert Ritter, Tagungsleitung und Vorsitzender des Organisationskomitees
Andreas Henrich und Wolfgang Lehner, Leitung Workshopkomitee
Andreas Thor, Leitung Studierendenprogramm
Wolfram Wingerath, Steffen Friedrich, Tagungsband und Organisationskomitee
VIII
Tagungsleitung
Norbert Ritter, Universität Hamburg
Organisationskomitee
Felix Gessert
Fabian Panse
Volker Nötzold
Norbert Ritter
Anne Hansen-Awizen
Steffen Friedrich
Wolfram Wingerath
Studierendenprogramm
Andreas Thor, HfT Leipzig
Koordination Workshops
Andreas Henrich, Univ. Bamberg
Wolfgang Lehner, TU Dresden
Tutorienprogramm
Norbert Ritter, Univ. Hamburg
Thomas Seidl, RWTH Aachen
Andreas Henrich, Univ. Bamberg
Second Workshop on Databases in Biometrics, Forensics and Security
Applications (DBforBFS)
Vorsitz: Jana Dittmann, Univ. Magdeburg; Veit Köppen, Univ. Magdeburg;
Gunter Saake, Univ. Magdeburg; Claus Vielhauer, FH Brandenburg
Ruediger Grimm, Univ. Koblenz
Dominic Heutelbeck, FTK
Stefan Katzenbeisser, TU Darmstadt
Claus-Peter Klas, GESIS
Günther Pernul, Univ. Regensburg
Ingo Schmitt, BTU Cottbus
Claus Vielhauer, FH Brandenburg
Sviatoslav Voloshynovskiy, UNIGE, CH
Edgar R. Weippl, SBA Research, Austria
Data Streams and Event Processing (DSEP)
Vorsitz: Marco Grawunder, Univ. Oldenburg, Daniela Nicklas Univ. Bamberg
Andreas Behrend, Univ. Bonn
Klemens Boehm, KIT
Peter Fischer, Univ. Freiburg
Dieter Gawlick, Oracle
Boris Koldehofe, TU Darmstadt
Richard Lenz, Univ. Erlangen-Nürnberg
Klaus Meyer-Wegener, Univ. ErlangenNürnberg
Gero Mühl, Univ. Rostock
Kai-Uwe Sattler, TU Ilmenau
Thorsten Schöler, HS Augsburg
IX
Joint Workshop on Data Management for Science (DMS)
Workshop on Big Data in Science (BigDS)
Vorsitz: Birgitta König-Ries, Univ. Jena; Erhard Rahm, Univ. Leipzig;
Bernhard Seeger, Univ. Marburg
Jens Kattge, MPI für Biogeochemie
Alfons Kemper, TU München
Meike Klettke, Univ. Rostock
Alex Markowetz, Univ. Bonn
Thomas Nauss, Univ. Marburg
Jens Nieschulze, Univ. Göttingen
Kai-Uwe Sattler, TU Ilmenau
Stefanie Scherzinger, OTH Regensburg
Myro Spiliopoulou, Univ. Magdeburg
Uta Störl, Hochschule Darmstadt
Alsayed Algergawy, Univ. Jena
Peter Baumann, Jacobs Univ.
Matthias Bräger, CERN
Thomas Brinkhoff, FH Oldenburg
Michael Diepenbroeck, AWI
Christoph Freytag, HU Berlin
Michael Gertz, Univ. Heidelberg
Frank-Oliver Glöckner, MPI-MM
Anton Güntsch, BGBM Berlin-Dahlem
Thomas Heinis, IC, London
Thomas Hickler, Senckenberg
Workshop on Data Management for Life Sciences (DMforLS)
Vorsitz: Sebastian Dorok, Bayer Pharma AG; Matthias Lange, IPK Gatersleben;
Gunter Saake, Univ. Magdeburg
Matthias Lange, IPK Gatersleben
Ulf Leser, HU Berlin
Wolfgang Müller, HITS GmbH
Erhard Rahm, Univ. Leipzig
Gunter Saake, Univ. Magdeburg
Uwe Scholz, IPK Gatersleben
Can Türker, ETH Zürich
Sebastian Breß, TU Dortmund
Sebastian Dorok, Bayer Pharma AG
Mourad Elloumi, UTM Tunisia
Ralf Hofestädt, Univ. Bielefeld
Andreas Keller, Saarland Univ.
Jacob Köhler, DOW AgroSciences
Horstfried Läpple, Bayer HealthCare
X
Inhaltsverzeichnis
Workshopprogramm
Second Workshop on Databases in Biometrics, Forensics and Security Applications (DBforBFS)
Jana Dittmann, Veit Köppen, Gunter Saake, Claus Vielhauer
Second Workshop on Databases in Biometrics, Forensics and Security
Applications (DBforBFS)....................................................................................................19
Veit Köppen, Mario Hildebrandt, Martin Schäler
On Performance Optimization Potentials Regarding Data Classification in Forensics.....21
Maik Schott, Claus Vielhauer, Christian Krätzer
Using Different Encryption Schemes for Secure Deletion While Supporting Queries........37
Data Streams and Event Processing (DSEP)
Marco Grawunder, Daniela Nicklas
Data Streams and Event Processing (DSEP)......................................................................49
Timo Michelsen, Michael Brand, H.-Jürgen Appelrath
Modulares Verteilungskonzept für Datenstrommanagementsysteme..................................51
Niko Pollner, Christian Steudtner, Klaus Meyer-Wegener
Placement-Safe Operator-Graph Changes in Distributed Heterogeneous
Data Stream Systems...........................................................................................................61
Michael Brand, Tobias Brandt, Carsten Cordes, Marc Wilken, Timo Michelsen
Herakles: A System for Sensor-Based Live Sport Analysis using Private
Peer-to-Peer Networks........................................................................................................71
Christian Kuka, Daniela Nicklas
Bestimmung von Datenunsicherheit in einem probabilistischen
Datenstrommanagementsystem...........................................................................................81
Cornelius A. Ludmann, Marco Grawunder, Timo Michelsen,
H.-Jürgen Appelrath
Kontinuierliche Evaluation von kollaborativen Recommender-Systeme in
Datenstrommanagementsystemen.......................................................................................91
Sebastian Herbst, Johannes Tenschert, Klaus Meyer-Wegener
Using Data-Stream and Complex-Event Processing to Identify Activities of Bats .............93
Peter M. Fischer, Io Taxidou
Streaming Analysis of Information Diffusion......................................................................95
XI
Henrik Surm, Daniela Nicklas
Towards a Framework for Sensor-based Research and Development Platform for
Critical, Socio-technical Systems........................................................................................97
Felix Beier, Kai-Uwe Sattler, Christoph Dinh, Daniel Baumgarten
Dataflow Programming for Big Engineering Data...........................................................101
Joint Workshop on Data Management for Science (DMS)
Sebastian Dorok, Birgitta König-Ries, Matthias Lange, Erhard Rahm,
Gunter Saake, Bernhard Seeger
Joint Workshop on Data Management for Science (DMS) ...............................................105
Alexandr Uciteli, Toralf Kirsten
Ontology-based Retrieval of Scientific Data in LIFE .......................................................109
Christian Colmsee, Jinbo Chen, Kerstin Schneider, Uwe Scholz, Matthias Lange
Improving Search Results in Life Science by Recommendations based on
Semantic Information........................................................................................................115
Marc Schäfer, Johannes Schildgen, Stefan Deßloch
Sampling with Incremental MapReduce ...........................................................................121
Andreas Heuer
METIS in PArADISE Provenance Management bei der Auswertung von
Sensordatenmengen für die Entwicklung von Assistenzsystemen .....................................131
Martin Scharm, Dagmar Waltemath
Extracting reproducible simulation studies from model repositories using the
CombineArchive Toolkit ...................................................................................................137
Robin Cijvat, Stefan Manegold, Martin Kersten, Gunnar W. Klau,
Alexander Schönhuth, Tobias Marschall, Ying Zhang
Genome sequence analysis with MonetDB: a case study on Ebola virus diversity...........143
Ahmet Bulut
RightInsight: Open Source Architecture for Data Science ...............................................151
Christian Authmann, Christian Beilschmidt, Johannes Drönner, Michael Mattig,
Bernhard Seeger
Rethinking Spatial Processing in Data-Intensive Science ................................................161
Studierendenprogramm
Marc Büngener
CBIR gestütztes Gemälde-Browsing .................................................................................173
David Englmeier, Nina Hubig, Sebastian Goebl, Christian Böhm
Musical Similarity Analysis based on Chroma Features and Text Retrieval Methods .....183
XII
Alexander Askinadze
Vergleich von Distanzen und Kernel für Klassifikatoren zur Optimierung der
Annotation von Bildern .....................................................................................................193
Matthias Liebeck
Aspekte einer automatischen Meinungsbildungsanalyse von Online-Diskussionen .........203
Martin Winter, Sebastian Goebl, Nina Hubig, Christopher Pleines, Christian Böhm
Development and Evaluation of a Facebook-based Product Advisor for Online
Dating Sites.......................................................................................................................213
Daniel Töws, Marwan Hassani, Christian Beecks, Thomas Seidl
Optimizing Sequential Pattern Mining Within Multiple Streams......................................223
Marcus Pinnecke
Konzept und prototypische Implementierung eines föderativen Complex
Event Processing Systeme mit Operatorverteilung...........................................................233
Monika Walter, Axel Hahn
Unterstützung von datengetriebenen Prozessschritten in Simulationsstudien
durch Verwendung multidimensionaler Datenmodelle.....................................................243
Niklas Wilcke
DduP – Towards a Deduplication Framework utilising Apache Spark............................253
Tutorienprogramm
Christian Beecks, Merih Uysal, Thomas Seidl
Distance-based Multimedia Indexing ...............................................................................265
Kai-Uwe Sattler, Jens Teubner, Felix Beier, Sebastian Breß
Many-Core-Architekturen zur Datenbankbeschleunigung ...............................................269
Felix Gessert, Norbert Ritter
Skalierbare NoSQL- und Cloud-Datenbanken in Forschung und Praxis .........................271
Jens Albrecht, Uta Störl
Big-Data-Anwendungsentwicklung mit SQL und NoSQL .................................................275
XIII
Workshopprogramm
Second Workshop on Databases in
Biometrics, Forensics and Security
Applications
Second Workshop on Databases in Biometrics, Forensics
and Security Applications
Jana Dittmann1 , [email protected]
Veit Köppen1 , [email protected]
Gunter Saake1 , [email protected]
Claus Vielhauer2 , [email protected]
1
2
Otto-von-Guericke-University Magdeburg
Brandenburg University of Applied Science
The 1st Workshop on Databases in Biometrics, Forensics and Security Applications (DBforBFS) was held as satellite workshop of the BTW 2013. The workshop series is intended
for disseminating knowledge in the areas of databases in the focus for biometrics, forensics, and security complementing the regular conference program by providing a place for
in-depth discussions of this specialized topic. The workshop will consist of two parts:
First, presentation of accepted workshop papers and second, a discussion round. In the
discussion round, the participants will derive research questions and goals to address important issues in the domain databases and security. We expect the workshop to facilitate
cross-fertilization of ideas among key stakeholders from academia, industry, practitioners
and government agencies. Theoretical and practical coverage of the topics will be considered. We also welcome software and hardware demos. Full and short papers are solicited.
Motivated by today’s challenges from both disciplines several topics include but are not
limited to:
• approaches increasing the search speed in databases for biometrics, forensics and
security,
• database validation procedures for integrity verification of digital stored content
• design aspects to support multimodal biometric evidence and its combination with
other forensic evidence
• interoperability methodologies and exchange protocols of data of large-scale operational (multimodal) databases of identities and biometric data for forensic case
assessment and interpretation, forensic intelligence and forensic ID management
• database security evaluation and benchmarks for forensics and biometric applications
• the role of databases in emerging applications in Biometrics and Forensics
• privacy, policy, legal issues, and technologies in databases of biometric, forensic and
security data.
19
1
Workshop Organizers
Jana Dittmann (Otto-von-Guericke-University Magdeburg)
Veit Köppen (Otto-von-Guericke-University Magdeburg)
Gunter Saake (Otto von Guericke University Magdeburg)
Claus Vielhauer (Brandenburg University of Applied Science)
2 Program Committee
Ruediger Grimm (University of Koblenz, DE)
Dominic Heutelbeck (FTK, DE)
Stefan Katzenbeisser (Technical University Darmstadt, DE)
Claus-Peter Klas (GESIS, DE)
Günther Pernul (Universität Regensburg, DE)
Ingo Schmitt (Brandenburg University of Technology, DE)
Claus Vielhauer (Brandenburg University of Applied Science, DE)
Sviatoslav Voloshynovskiy (unige, CH)
Edgar R. Weippl (sba-research, Austria)
20
On Performance Optimization Potentials Regarding Data
Classification in Forensics
Veit Köppen, Mario Hildebrandt, Martin Schäler
Faculty of Computer Science
Universitätsplatz 2
39106 Magdeburg
[email protected]
[email protected]
[email protected]
Abstract: Classification of given data sets according to a training set is one of the essentials bread and butter tools in machine learning. There are several application scenarios, reaching from the detection of spam and non-spam mails to recognition of malicious behavior, or other forensic use cases. To this end, there are several approaches
that can be used to train such classifiers. Often, scientists use machine learning suites,
such as WEKA, ELKI, or RapidMiner in order to try different classifiers that deliver
best results. The basic purpose of these suites is their easy application and extension
with new approaches. This, however, results in the property that the implementation of
the classifier is and cannot be optimized with respect to response time. This is due to
the different focus of these suites. However, we argue that especially in basic research,
systematic testing of different promising approaches is the default approach. Thus,
optimization for response time should be taken into consideration as well, especially
for large scale data sets as they are common for forensic use cases. To this end, we
discuss in this paper, in how far well-known approaches from databases can be applied
and in how far they affect the classification result of a real-world forensic use case. The
results of our analyses are points and respective approaches where such performance
optimizations are most promising. As a first step, we evaluate computation times and
model quality in a case study on separating latent fingerprint patterns.
1
Motivation
Data are drastically increased in a given time period. This is not only true for the number
of data sets (comparable to new data entries), but also with respect to dimensionality. To
get into control of this information overload, data mining techniques are used to identify
patterns within the data. Different application domains require for similar techniques and
therefore, can be improved as the general method is enhanced.
In our application scenario, we are interested in the identification of patterns in data that
are acquired from latent fingerprints. Within the acquired scanned data a two-class classification is of interest, to identify the fingerprint trace and the background noise. As point
21
of origin, experts classify different cases. This supervised approach is used to learn a classification and thus, to support experts in their daily work. With a small number of scanned
data sets that the expert has to check and classify, a high number of further data sets can
be automatically classified.
Currently, the system works in a semi-automatic process and several manual steps have
to be performed. Within this paper, we investigate the influence on system response and
model quality, in terms of accuracy and precision, in the context of integrating the data and
corresponding processes in a holistic system. Although a complete integration is feasible,
different tools are currently used, which do not fully cooperate. Therefore, the efficiency
or optimization regarding computation or response time are not in the focus of this work.
With this paper, we step forward to create a cooperating and integrated environment that
performs efficient with respect to model quality.
This paper is structured as follows: In the next section, we briefly present some background regarding classification and database technologies for accessing multi-dimensional
data. In Section 3, we describe the case study that is the motivation for our analysis. Within
Section 4, we present our evaluation on the case study data regarding optimization due to
feature and data space reduction. Finally, we conclude our work in Section 5.
2
Background
In this section, we give background on classification algorithms in general. Then, we
explain one of these algorithms that we apply in the remainder of this paper in more details.
Finally, we introduce promising optimization approaches known from databases. We use
these approaches in the remainder to discuss their optimization potential with respect to
classification.
2.1 Classification Algorithms
In the context of our case study in Section 3, several classification algorithms can be utilized, see, e.g., [MKH+ 13]. Each of those algorithms is used for supervised learning. Such
type of learning consists of a model generation based on training data, which are labeled
according to a ground-truth. The utilized classification algorithms in [MKH+ 13] partition
the feature space to resemble the distribution of each instance (data point) in this space.
Afterward, the quality of the model can be evaluated using an independent set of labeled
test data by comparing the decision of the classifier with the assigned label.
The utilized classification schemes from the WEKA data mining software [HFH+ 09] in
[MKH+ 13] include support vector machines, multilayer perceptrons, rule based classifiers, decision trees, and ensemble classifiers. The latter ones combine multiple models in
their decision process.
22
C4.5 decision tree
In this paper, we use the classifier J48, WEKA’s [HFH+ 09] implementation of the fast
C4.5 decision tree [Qui93], which is an improvement of the ID 3 algorithm [Qui86] and
one of the most widely known decision tree classifiers for such problems. The advantage
of decision trees is their comprehensiveness: the classifier’s decision is a leaf reached by
a path of single feature thresholds. The size of the tree is reduced by a pruning algorithm
which replaces subtrees. Furthermore, this particular implementation is able to deal with
missing values. In order to do that, the distribution of the available values for this particular
feature is taken into account.
1 build_tree ( Data (R:{r_1,..,,r_n},C)
R: non-categorical_attributes r_1 to r_n,
2
3
C: categorical attribute,
4
S: training set in same schema as Data)
5 returning decision_tree;
6
7 begin
8
-- begin exceptions
9
If is_empty(S)
10
return FAILURE;
If only_one_category(DATA)
11
return single_node_tree(value(C));
12
If is_empty(R)
13
return single_node_tree(most_frequent_value(C));
14
15
-- end excpetions
16
Attribute r_d (elem of R) := largest_Gain(R,S);
17
{d_i| i=1,2, .., m} := values_of_attribute(r_d);
18
{S_i| i=1,2, .., k} := subsets(S) where in each subset value(r_d) = d_i holds;
decision_tree := tree(r_d) with nodes { d_1, d_2, .., d_m} pointing to trees
19
call ID3((R-{r_d}, C), S1), ID3((R-{r_d}), C, S2), .., ID3((R-{r_d}, C), S_k);
20
return decision_tree;
21
22 end build_tree;
Figure 1: Algorithm to build a C4.5 decision tree, adapted from [Qui86]
In Figure 1, we depict the general algorithm to build a C4.5 decision tree. The argument
for the algorithm is a training set consisting of: (1) n non-categorical attributes R reaching
from r1 to rn , (2) the categorical attribute (e.g., spam or not spam), and (3) a training set
with the same schema. In Lines 8 to 15, the exception handling is depicted, for instance if
there are only spam mails (Line 11). The actual algorithm tries to find the best attribute rd
and distributes the remaining tuples in S according to their value in rd . For each subtree
that is created in that way the algorithm is called recursively.
2.2 Approaches for Efficient Data Access
Data within databases have to be organized in such a way that they are efficiently accessed. In the case of multi-dimensional data, an intuitive order does not exist. This is
even more apparent for the identification of unknown patterns, where an ordering in a
multi-dimensional space always dominates some dimensions. For these reasons, differ-
23
ent approaches have been proposed. They can be differentiated into storage and index
structures.
Typical storage improvements within the domain of Data Warehousing [KSS14a] are
column-oriented storage [AMH08], Iceberg-Cube [FSGM+ 98], and Data Dwarf [SDRK02].
Whereas the Iceberg-Cube reduces computational effort, column-oriented storage improves
the I/O with respect to the application scenario, where operations are performed in a
column-oriented way. The Data Dwarf heavily reduces the stored data volume without
loss of information. It combines computational effort and I/O cost for improving efficiency.
Furthermore, there exist many different index structures for specialized purposes [GBS+ 12].
Very well-known index structures for multi-dimensional purposes are the kd-Tree [Ben75]
and R-Tree [Gut84]. Both mentioned indexes are candidates, which suffer especially from
the curse of dimensionality. The curse of dimensionality is a property of large and sparsely
populated high-dimensional spaces, which results in the effect that for tree-based indexes
often large parts have to be taken into consideration for a query (e.g., because of node
overlaps). To this end, several index structures, as the Pyramid technique [BBK98] or improved sequential scans, such as the VA-File [WB97] are proposed. In the following, we
briefly explain some well-known indexes that, according to prior evaluations [SGS+ 13],
result in a significant performance increase. A broader overview on index structures can be
found in [GG98] or [BBK01]. Challenges regarding parameterization of index structures
as well as implementation issues are discussed in [AKZ08, KSS14b, SGS+ 13].
2.2.1
Column vs. Row Stores
Traditionally, database systems store their data row-wise. That means that each tuple with
all its attributes is stored and then the next tuple follows. By contrast, columnar storage
means that all values of a column are stored sequentially and then the next column follows.
Dependent on the access pattern of a classification algorithm, the traditional row-based
storage should be replaced if, for instance, one dimension (column) is analyzed to find an
optimal split in this dimension. In this case, we expect a significant performance benefit.
2.2.2
Data Dwarf
The basic idea of the Data Dwarf storage structure is to use prefix and suffix redundancies
for multi-dimensional points to compress the data. For instance, the three dimensional
points A(1, 2, 3) and B(1, 2, 4) share the same pre-fix (1, 2, ). As a result, the Dwarf
has two interesting effects that are able to speed-up classifications. Firstly, due to the
compression, we achieve an increased caching performance. Secondly, the access path is
stable, which means that we require exactly the number of dimension look-ups to find a
point (e.g., three look-ups for three dimensional points).
24
2.2.3
kd-Tree
A kd-Tree index is a multi-dimensional adaption of the well-known B-Tree cycling through
the available dimensions. Per tree level, this index distributes the remaining points in the
current subtree into two groups. One group in the left subtree where the points have a
value smaller or equal than the separator value in the current dimension, while the remaining points belong to the right sub tree. The basic goal is to achieve logarithmic effort for
exact match queries. In summary, this index structure can be used to efficiently access and
analyze single dimensions in order to separate two classes.
2.2.4
VA-File
Tree-based index structures suffer from the curse of dimensionality. This may result in
the effect that they are slower than a sequential scan. To this end, improvements of the
sequential scan are proposed. The basic idea of the Vector Approximation File is to use
a compressed approximation of the existing data set that fits into the main memory (or
caches). On this compressed data an initial filter step is performed in order to minimize
actual point look-ups. In how far this technique can be applied to speed-up classifications
is currently unknown.
3 Case Study
As described in [HKDV14], the classification of contact-less scans of latent fingerprints
is performed using a block based approach. The following subsections summarize the
application scenario, the data gathering process, and a description of the feature space. We
depict this process in Fig. 2. We describe the steps in the following in more detail.
Filtering
CWL
Sensor
Contactless
Scan
1st Deviation Sobel X/Y
2nd Deviation Sobel X/Y
1st Deviation Sobel X
1st Deviation Sobel Y
Unsharp Masking
Feature Extraction
Block
Segmentation
- Statistical Features
- Structural Features
- Semantic Features
- Fingerprint Ridge
Feature
Selection
Classification
Orientation Semantics
1.1
5.9
...
...
3.3
7.5
BG
...
3.1
2.2
...
...
9.7
1.4
FP
1.1
5.9
...
...
3.3
7.5
BG
...
3.1
2.2
...
...
9.7
1.4
FP
Substrate
Scan Data
Fingerprint
Figure 2: Data acquisition process, processing, and classification
25
3.1
Application scenario
The application scenario for this case study is the contact-less, non-invasive acquisition of
latent fingerprints. The primary challenge of this technique is the inevitable acquisition
of the substrate characteristics superimposing the fingerprint pattern. Depending on the
substrate, the fingerprint can be rendered invisible. In order to allow for a forensic analysis
of the fingerprint, it is necessary to differentiate between areas of the surface without
fingerprint residue and others covered with fingerprint residue (fingerprint segmentation).
For this first evaluation, we solely rely on white furniture surface, because it provides a
rather large difference between the substrate and the fingerprint. The achieved classification accuracy in a two-fold cross-validation based on 10 fingerprint samples is 93.1%
for the J48 decision tree in [HKDV14]. The number of 10 fingerprints is sufficient for
our evaluation, because we do not perform a biometric analysis. Due to the block-based
classification, 1,003,000 feature vectors are extracted. For our extended 600 dimensional
feature space (see Section 3.3), we achieve a classification accuracy of 90.008% based on
501,500 data sets for each of the two classes ”fingerprint” and ”substrate”.
3.2 Data Gathering Process
The data gathering process utilizes a FRT CWL600 [Fri14] sensor mounted to a FRT
MicroProf200 surface measurement device. This particular sensor exploits the effect of
chromatic aberration of lenses to measure the distance and the intensity of the reflected
light simultaneously. Due to this effect, the focal length of different wavelength is different. Thus, only one wavelength from the source of white light is focused at a time. This
particular wavelength yields the highest intensity in the reflected light. So, it can be easily
detected using a spectrometer by locating the maximum within the spectrum.
The intensity value is derived from the amplitude of this peak within the value range
[1; 4, 095]. The wavelength of the peak can be translated into a distance between the sensor
and the measured object using a calibration table. The achieved resolution for this distance
is 20 nm. The data itself are stored within a 16 bit integer array which can be afterward
converted to a floating point distance value. The CWL600 is a point sensor which acquires
the sample point-by-point while the sample is moved underneath. Thus, it is possible to
select arbitrary lateral resolutions for the acquisition of the sample.
In our case study, we use a lateral dot distance of 10 µm which results in a resolution five
times as high as the commonly used resolution of 500 ppi in biometric systems.
3.3 Data Description
The feature space in [HKDV14] contains statistical, structural, and fingerprint semantic
features. The final feature space is extracted from the intensity and topography data (see
26
Section 3.2) and preprocessed versions of these data sets. Table 1 summarizes the 50
features which are extracted from each data set.
Feature Set
Statistical Features
Structural Features
Fingerprint Semantic Features
Features
Minimum value; maximum value; span; mean value;
median value; variance; skewness; kurtosis; mean
squared error; entropy; globally and locally normalized values of absolute min, max, median; globally
and locally normalized values of relative min, max,
span, median; globally normalized absolute and relative mean value of B
Covariance of upper and lower half of a block B; covariance of left and right half of the block B; line
variance of a block B; column variance of a block
B; most significant digit frequency derived from Benford’s Law [Ben38] (9 features); Hu moments [Hu62]
(7 features)
Maximum standard deviation in BM after Gabor filtering; mean value of the block B for the highest Gabor
response
Table 1: Overview of the extracted features
All features are extracted from blocks with a size of 5×5 pixels with the exception of
the fingerprint semantic feature of the maximum standard deviation in BM after Gabor
filtering. The fingerprint semantic features are motivated by the fingerprint enhancement,
e.g. [HWJ98], which utilize Gabor filters for emphasizing the fingerprint pattern after
determining the local ridge orientation and frequency. Since this filtering relies on a ridge
valley pattern, it requires larger blocks. In particular, we use a block size of 1.55 by
1.55 mm (155×155 pixels) as suggested in [HKDV14].
The features are extracted from the original and pre-processed data. In particular, the
intensity and topography data are pre-processed using Sobel operators in first and second
order in X and Y direction combined, Sobel operators in first order in X, as well as Y
direction separately, and unsharp masking (subtraction of a blurred version of the data).
In result, we get a 600-dimensional feature space. However, some of the features cannot
be determined, e.g., due to a division by zero in case of the relative statistical features.
Thus, either the classifier must be able to deal with missing values, or those features need
to be excluded. To this end, we apply the J48 classifier, because it handles missing data.
4 Evaluation
In this section, we present the evaluation of the classification according to the J48 algorithm. We restrict this study to performance measurements of computation time for
27
Test outcome positive
Test outcome negative
Condition positive
True Positive (TP)
False Negative (FN)
Condition negative
False Positive (FP)
True Negatives (TN)
Table 2: Contingency table
building the model and for the evaluation of the model. As influences on the performance,
we identify according to Section 2.2 cardinality of the dimensions and involved features.
Therefore, we investigate the model quality with respect to precision and recall. First,
we present our evaluation setup. This is followed by the result presentation. Finally, we
discuss our findings.
4.1 Setup
Our data are preprocessed as described in Section 3. We use the implementation of
C4.5 [Qui93] in WEKA, which is called J48. For the identification of relationships between included feature dimensions and feature cardinality and model build time and model
evaluation time, we use different performance measurements regarding the model. We
briefly describe the model performance measurements in the following.
In classification, the candidates can be classified correctly or incorrectly. Compared to the
test population four cases are possible, as presented in Table 2.
In the following we define measures that can be derived from the contingency table. The
recall (also called sensitivity or true positive rate) represents the correctly identified positive elements compared to all identified positive elements. This measure is defined as:
Recall =
TP
TP + FN
(1)
Accuracy describes all correctly classified positive and negative elements compared to all
elements. This measure assumes a non-skewed distribution of classes within the learning
as well as training data. It is defined as:
Accuracy =
TP + TN
TP + FN + FP + TN
(2)
Precision is also called positive prediction rate and measures all correctly identified positives compared to all positives in the ground truth. It is defined as:
Precision =
TP
TP + FP
(3)
Specificity is also called true negative rate and is a ratio comparing the correctly classified
negative elements to all negative classified elements. It is defined as:
Specificity =
TN
FP + TN
(4)
28
The measure Balanced Accuracy is applied in the case that the classes are not equally
distributed. This takes non-symmetric distributions into account. The balance is achieved
by computing the arithmetic mean of Recall and Specificity and it is defined as:
)
.
Recall + Specificity
1
TP
FN
Balanced Accuracy =
= ·
+
(5)
2
2
TP + FN
FP + TN
The F-Measure is the harmonic mean of precision and recall to deal with both interacting
indicators at the same time. This results in:
F-Measure =
2 · TP
2 · TP + FP + FN
(6)
Depending on the application scenario, a performance measure can be used for optimization. In Fig. 3, we depict all above stated performance measurements according to a filtering of our data set.
0.90
Model Performance Measures
●●● ●
●
0.88
●
●
0.87
Quality
0.89
●
0.86
●
●
●
●
F−Measure
Accuracy
balanced Accuracy
Precision
Recall
Specificity
●
●
●
0.85
●
300
400
500
600
Dimensions / Features
Figure 3: Performance Measures for different filters of the test case
In our evaluation, we investigate two different performance influences. On the one side,
we are interested in filtering out correlated data columns. At the other side, we measure
performance for a restricted data space domain. This is applied by a data discretization.
Evaluation is based on three important aspects:
• Building the model in terms of computation time,
• Testing the model in terms of computation time, and
• Quality of the model measured in model performance indicators.
29
From Fig. 3 it can be identified, that the computed models have a higher specificity than
recall. This results also in a lower F-Measure. Furthermore, it can be seen that the training
data are not imbalanced and accuracy is very close to balanced accuracy. However, all
values are close and at an acceptable range. Therefore, we use for the remainder of our
result presentation the F-Measure as model performance measure.
For reducing the dimensional space, we secondly discretize each feature. This is computed
in such a way that the variance within a feature is retained as best as possible. Currently,
there are no data structures within the WEKA environment, that use restricted data spaces
efficiently. Therefore, we assume that model creation and model evaluation times are not
significantly influenced. However, as a database system can be used in future, the question arises, which quality influence on model performance is achieved by discretization.
Therefore, we conduct an evaluation series with discretized feature dimensions, where all
feature dimensions are restricted to the following cardinalities:
• 256 values,
• 8 values,
• 512 values,
• 16 values,
• 1,024 values,
• 32 values,
• 2,048 values, and
• 64 values,
• full cardinality.
• 128 values,
4.2 Result Presentation
As a first investigation of our evaluation scenario, we present results regarding the elimination of features. For the feature elimination we decide for an statistical approach, where
correlated data columns are eliminated from the data set.
In Fig. 4, we present the dimensions that are included in the data set. At the x-axis, we
present the correlation criteria that are used for elimination. For instance, a correlation
criteria of 99% means that all data columns are eliminated from the data set that have a
correlation of 0.99 to another feature within the data set. Note, we compare every feature
column with every other and at an elimination decision; we left the first in the data set.
Therefore, we prefer the first data columns within the data set. Furthermore, we also
tested the feature reduction for discretized data sets. With a small cardinality, the feature
reduction due to correlation is lower, which means that the dimensional space is higher
compared to the others.
We evaluate in the following the reduction of the feature space in terms of computational
effort. We differentiate at this point two cases for this effort: On the one side the model
building time represents computational performance for creating (learning) the model. As
the amount of data tuples for learning the model we use 501,500 elements. As a second
measurement, we present evaluation times where 501,500 further elements are used in a
testing phase of the model. This additionally leads to the quality indicators of the model
presented in Section 4.1. We present this information afterward.
In Fig. 5, we present the model creation times for different data sets. With a decrease of the
feature space, the computation time reduces, too. However, there are some saltus identifi-
30
600
Feature Reduction by Correlation
●
500
●
full
8
16
32
64
●
128
256
512
1024
2048
●
400
Dimensions
●
●
●
●
●
●
●●
●
●
●
●
●
●
300
●
●
●
●
●
●
●
●
●
●
●
●
●
0.5
0.6
0.7
0.8
0.9
1.0
Correlation Criteria
Figure 4: Feature Reduction by Correlation
able. These are related to the fact, that the algorithm has a dynamic model complexity. This
means that the number of nodes within the model is not restricted and therefore, smaller
models can have a faster generation time. Nevertheless, we do not focus on optimization
for our case study, but we derive a general relationship. From our data, we can derive that
a decrease is reduced for data that are not more than 85% correlated. This leads to a slower
reduction in computation time. However, with this elimination of 85% correlated values,
the computational effort is reduced to approximately one third. An important result from
the model generation: a restrictive discretization (cases 8 and 16) does negatively influences the model building time. Note, we do not use in our evaluation an optimized data
structure, which has a significant influence on the computational performance, see also
Section 2.2. Although the underlying data structure is general, a restriction of the feature
cardinalities improves model building times for the cases cardinality 32 and higher.
For evaluation times of the model a similar behavior is identifiable. In Fig. 6, we present
the evaluation times for the same data sets. Two major differences can be easily seen: On
the one side, the difference between the test cases is smaller and the slopes are smoother.
On the other side, a reduction of the evaluation time is optimal for cardinalities of 32
and 64. An increase of the cardinalities leads to a higher computational effort. This is
respected to the fact that the sequential searches within the data are quite important for the
testing phase of a model. A usage of efficient data structures should therefore be in focus
of future studies.
With both above presented evaluations, we only have computation time in the focus. However, we have to respect the quality of the model at the same time. Within classification
applications, an increased information usage (in terms of data attributes) can increase the
model quality. A reduction of the information space might lead to a lower model quality.
31
●
●
●●
4000
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
128
256
512
1024
2048
●
●●
●
●
●
●
●
full
8
16
32
64
●
6000
●
Time in secs
8000
128
256
512
1024
2048
4000
10000
●
full
8
16
32
64
●
6000
Time in secs
Model Evaluation Time
8000
Model Build Time
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
2000
2000
●
0.5
0.6
0.7
0.8
0.9
1.0
0.5
Erased Correlations
0.6
0.7
0.8
0.9
1.0
Erased Correlations
Figure 5: Model Build Time
Figure 6: Model Evaluation Times
In Fig. 7, we show the relationship between erased correlations and the F-Measure. Note,
that an increase in the F-Measure is also for a reduced data space possible (e.g., in the
case of full cardinality). With a reduced cardinality in the information space, a lower FMeasure is achieved. This is especially true for low cardinalities (e.g., 8 or 16). However,
in the case that the cardinality is reduced from a correlation of 0.95 to 0.9 within the data
set a higher decrease in the F-Measure is identifiable. A second significant reduction of
the F-measure is at the 0.7 correlation elimination level.
In Fig. 8, we present the relationship between model build times and the model quality.
Although a negative dependency is assumed, this trend is only applicable to some parts
of the evaluation space. As an optimization of model quality and computation time, the
first high decrease model quality is at an elimination of 0.95 correlated values. Further
eliminations do not influence the model build times in a similar decrease.
Overall, we have to state that our reduction of the data space is quite high compared to the
reduction of the model quality in terms of the F-measure. Note, other model performance
measures are quite similar.
4.3 Discussion
With our evaluation, we focus on the influences of the data space to model performance
in terms of quality and computation times. Therefore, we reduce the information space in
two ways. On the one hand, we restrict dimensionality by applying a feature reduction by
correlation. This is also called canonical correlation analysis. It can be computed in a very
efficient way and therefore, it is much faster than other feature reduction techniques, e.g.,
principal component analysis or linear discriminant analysis. Furthermore, we restrict the
cardinality of the feature spaces, too. We discretize the feature space and are interested in
32
0.88
●
0.90
full
8
16
32
64
●
Model Performance: F−Measure
128
256
512
1024
2048
●
●●
●●
●
●●
●
●
●
●
●
●
● ●
● ●●
●
●
0.88
0.90
Model Performance: F−Measure
●
●
●
●
●
0.86
0.86
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
0.84
●
full
8
16
32
64
●
128
256
512
1024
2048
0.82
0.82
0.84
●
●
0.5
0.6
0.7
0.8
0.9
1.0
4000
Erased Correlations
6000
8000
10000
Model build time in secs
Figure 7: Model Quality and Reduction
Figure 8: Model Quality and Build Times
the influence on model quality. An influence for the model build times are not assumed,
due to the fact that the underlying data structures are not optimized.
We focus this in future work, cf. [BDKM14]. Due to the column-wise data processing of
the classifiers, we assume that a change in the underlying storage structure, e.g., column
stores or Data Dwarfs, leads to a significant computational performance increase. First
analyzes of the WEKA implementation reveal a high integration effort. However, the
benefits are very promising.
5 Conclusion
We present some ideas on improving model quality and computational performance for a
classification problem. This work is a starting point to enhance the process with respect to
optimize computation times in a biometric scenario. Additional use cases, e.g., indicator
simulation [KL05], other data mining techniques [HK00], or operations in a privacy secure
environment [DKK+ 14], can be applied to our main idea and have to be considered for
filtering and reduction techniques. With our evaluation study we show that performance
with respect to computation times as well as model quality can be optimized. However, a
trade-off between both targets has to be achieved due to inter-dependencies.
In future work, we want to improve the process by integrating and optimizing the different
steps. We assume, an efficient data access structure is beneficial for model computation
times and therefore, increases the application scenario. However, this computational improvement relies on the information space, especially on dimensional cardinality and number of involved dimensions. With an easy to apply algorithm, a data processing enables a
fast transformation of the feature space and smooth the way for more efficient data mining
for forensic scenarios.
33
Acknowledgment
The work in this paper has been funded in part by the German Federal Ministry of Education and Research (BMBF) through the Research Program ”DigiDak+ SicherheitsForschungskolleg Digitale Formspuren” under Contract No. FKZ: 13N10818.
References
[AKZ08]
Elke Achtert, Hans-Peter Kriegel, and Arthur Zimek. ELKI: A Software System for
Evaluation of Subspace Clustering Algorithms. In SSDBM, LNCS (5069), pages 580–
585. Springer, 2008.
[AMH08]
Daniel J. Abadi, Samuel Madden, and Nabil Hachem. Column-Stores vs. Row-Stores:
How different are they really? In Proceedings of the International Conference on
Management of Data (SIGMOD), pages 967–980, Vancouver, BC, Kanada, 2008.
[BBK98]
Stefan Berchtold, Christian Böhm, and Hans-Peter Kriegel. The Pyramid-technique:
Towards Breaking the Curse of Dimensionality. SIGMOD Rec., 27(2):142–153, 1998.
[BBK01]
Christian Böhm, Stefan Berchtold, and Daniel A. Keim. Searching in Highdimensional Spaces: Index Structures for Improving the Performance of Multimedia
Databases. ACM Comput. Surv., 33(3):322–373, 2001.
[BDKM14] David Broneske, Sebastian Dorok, Veit Köppen, and Andreas Meister. Software Design Approaches for Mastering Variability in Database Systems. In GvDB, 2014.
[Ben38]
Frank Benford. The Law of Anomalous Numbers. Proceedings of the American
Philosophical Society, 78(4):551–572, 1938.
[Ben75]
Jon Louis Bentley. Multidimensional Binary Search Trees Used for Associative
Searching. Commun. ACM, 18(9):509–517, 1975.
[DKK+ 14]
Jana Dittmann, Veit Köppen, Christian Krätzer, Martin Leuckert, Gunter Saake, and
Claus Vielhauer. Performance Impacts in Database Privacy-Preserving Biometric Authentication. In Rainer Falk and Carlos Becker Westphall, editors, SECURWARE
2014: The Eighth International Conference on Emerging Security Information, Systems and Technologies, pages 111–117. IARA, 2014.
[Fri14]
Fries Research & Technology GmbH. Chromatic White Light Sensor CWL, 2014.
http://www.frt-gmbh.com/en/chromatic-white-light-sensor-frt-cwl.aspx.
[FSGM+ 98] Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and Jeffrey D. Ullman. Computing Iceberg Queries Efficiently. In Proceedings of the 24rd
International Conference on Very Large Data Bases, VLDB ’98, pages 299–310, San
Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc.
[GBS+ 12]
Alexander Grebhahn, David Broneske, Martin Schäler, Reimar Schröter, Veit Köppen,
and Gunter Saake. Challenges in finding an appropriate multi-dimensional index structure with respect to specific use cases. In Ingo Schmitt, Sascha Saretz, and Marcel
Zierenberg, editors, Proceedings of the 24th GI-Workshop ”Grundlagen von Datenbanken 2012”, pages 77–82. CEUR-WS, 2012. urn:nbn:de:0074-850-4.
34
[GG98]
Volker Gaede and Oliver Günther. Multidimensional Access Methods. ACM Comput.
Surv., 30:170–231, 1998.
[Gut84]
Antonin Guttman. R-trees: A Dynamic Index Structure for Spatial Searching. SIGMOD Rec., 14(2):47–57, 1984.
[HFH+ 09]
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and
Ian H. Witten. The WEKA Data Mining Software: An Update. SIGKDD Explorations,
11(1):10 – 18, 2009.
[HK00]
Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 2000.
[HKDV14]
Mario Hildebrandt, Stefan Kiltz, Jana Dittmann, and Claus Vielhauer. An enhanced
feature set for pattern recognition based contrast enhancement of contact-less captured
latent fingerprints in digitized crime scene forensics. In Adnan M. Alattar, Nasir D.
Memon, and Chad D. Heitzenrater, editors, SPIE Proceedings: Media Watermarking,
Security, and Forensics, volume 9028, pages 08/01–08/15, 2014.
[Hu62]
Ming-Kuei Hu. Visual pattern recognition by moment invariants. Information Theory,
IRE Transactions on, 8(2):179–187, 1962.
[HWJ98]
Lin Hong, Yifei Wan, and A. Jain. Fingerprint image enhancement: algorithm and performance evaluation. Pattern Analysis and Machine Intelligence, IEEE Transactions
on, 20(8):777 –789, aug 1998.
[KL05]
Veit Köppen and Hans-J. Lenz. Simulation of Non-linear Stochastic Equation Systems. In S.M. Ermakov, V.B. Melas, and A.N. Pepelyshev, editors, Proceeding of the
Fifth Workshop on Simulation, pages 373–378, St. Petersburg, Russia, July 2005. NII
Chemistry Saint Petersburg University Publishers.
[KSS14a]
Veit Köppen, Gunter Saake, and Kai-Uwe Sattler. Data Warehouse Technologien.
MITP, 2 edition, Mai 2014.
[KSS14b]
Veit Köppen, Martin Schäler, and Reimar Schröter. Toward Variability Management
to Tailor High Dimensional Index Implementations. In RCIS, pages 452–457. IEEE,
2014.
[MKH+ 13] Andrey Makrushin, Tobias Kiertscher, Mario Hildebrandt, Jana Dittmann, and Claus
Vielhauer. Visibility enhancement and validation of segmented latent fingerprints in
crime scene forensics. In SPIE Proceedings: Media Watermarking, Security, and
Forensics, volume 8665, 2013.
[Qui86]
John Ross Quinlan. Induction of Decision Trees. Mach. Learn., 1(1):81–106, 1986.
[Qui93]
John Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.
[SDRK02]
Yannis Sismanis, Antonios Deligiannakis, Nick Roussopoulos, and Yannis Kotidis.
Dwarf: Shrinking the PetaCube. In SIGMOD, pages 464–475. ACM, 2002.
[SGS+ 13]
Martin Schäler, Alexander Grebhahn, Reimar Schröter, Sandro Schulze, Veit Köppen,
and Gunter Saake. QuEval: Beyond high-dimensional indexing à la carte. PVLDB,
6(14):1654–1665, 2013.
[WB97]
Roger Weber and Stephen Blott. An Approximation-Based Data Structure for Similarity Search. Technical Report ESPRIT project, no. 9141, ETH Zürich, 1997.
35
Using Different Encryption Schemes for Secure Deletion
While Supporting Queries
Maik Schott, Claus Vielhauer, Christian Krätzer
Department Informatics and Media
Brandenburg University of Applied Sciences
Magdeburger Str. 50
14770 Brandenburg an der Havel, Germany
[email protected]
[email protected]
Department of Computer Science
Universitaetsplatz 2
39106 Magdeburg, Germany
[email protected]
Abstract: As more and more private and confidential data is stored in databases
and in the wake of cloud computing services hosted by third parties, the privacyaware and secure handling of such sensitive data is important. The security of such
data needs not only be guaranteed during the actual life, but also at the point where
they should be deleted. However, current common database management systems
to not provide the means for secure deletion. As a consequence, in this paper we
propose several means to tackle this challenge by means of encryption and how to
handle the resulting shortcomings with regards to still allowing queries on
encrypted data. We discuss a general approach on how to combine homomorphic
encryption, order preserving encryption and partial encryption as means of depersonalization, as well as their use on client-side or server-side as system
extensions.
1 Introduction and state of the art
With the increase of data in general stored in databases especially its outsourcing into
cloud services, privacy-related informations are also becoming more and more prevalent.
Therefore there is an increasing need of maintaining the privacy, confidentiality, and in
general security of such data. Additionally privacy is required by several national laws,
like the Family Educational Rights and Privacy Act and the Health Insurance Portability
and Accountability Act of the United States, the Federal Data Protection Act
(Bundesdatenschutzgesetz) of Germany, or the Data Protection Directive (Directive
95/46/EC) of the European Union. All these legal regulations require the timely and
guaranteed – in the sense that it is impossible to reconstruct – removal of private
information. Such removal is called forensic secure deletion.
37
Aside from the regular challenges of this issue, e.g. the behavior of magnetic media to
partially retain the state of their previous magnetization – leaving traces of data later
overwritten with other data – or the wear-levelling techniques of solid-state memory
media [Gu96], RAM or swap memory copies, and remote backups, database systems
have an additional complexity due to their nature to provide an efficient and fast access
to data by the means of introducing several redundancies of data. A certain information
is not only stored within its respective database table, but also in other locations, like
indexes, logs, result caches, temporary relations or materialized views [SML07]. Deleted
rows are often just flagged as deleted, without touching the actually stored data.
Additionally, due to page-based storage mechanisms, changes to records on these pages,
requiring a change to the layout of these pages will, not necessarily update only this very
page but instead create a new copy in the unallocated parts of the file system updating
with the new data, but still leaving the original page behind flagged as unallocated space.
The same applies to any kind of deletion operations. Essentially, old data is being
marked as deleted, however remains present and is not immediately or intentionally
deleted. The later happens only occasionally, when this unallocated space is later
overwritten by a new page.
An extensive study of this issue was done by Stahlberg et al. [SML07] who forensically
investigated five different database storage engines - IBM DB2, InnoDB, MyISAM
(both MySQL), PostgreSQL, and SQLite - with regards to traces left of deleted data
within the table storage, transaction log, indexes. They found that even after applying
25,000 operations and vacuuming, a large amount of deleted records could still be found.
Furthermore they investigated the cost of overwriting or encrypting (albeit using highly
insecure algorithms) log entries for the InnoDB engine. Grebhahn et al. [GSKS13]
especially focused on index structures and what kind and amount of traces of deleted
records can be reconstructed from the structure of indexes. Albeit they didn’t investigate
a real database system but a mockup designed to thoroughly evaluate high-dimensional
indexes, they achieved recovery rates from R trees of up to 60% in single cases.
As shown, although a forensic secure deletion is required in many cases, the actual
realization of removing data once it has been ingested to a database is still a difficult or
even unsolved challenge. Therefore the solution must be sought at an earlier point: the
time the data first enters the database. As the concept of deletion of data basically means
rendering this data unreadable/illegible it shares similarities to the encryption of data
without having knowledge of the proper key, as introduced by [BL96] for backup
systems. Encryption can therefore be seen as a “preventive deletion” scheme in a
forensic secure way.
At the same time the illegibility of encrypted data also hinders the widespread use in
database systems as most operations and therefore queries on them are not possible
compared to their unencrypted state. Therefore, in this paper we discuss a general
approach on how to use several encryption schemes to provide additional security, while
still maintaining some of the advantageous properties of unencrypted data. A similar
concept has been introduced in the context of CryptDB e.g. in [PRZ+11] and more
recently in [GHH+14].
38
2 Approach
Encryption schemes can generally be classified as symmetric and asymmetric,
depending on if the encryption and decryption processes use the same shared key or
different keys (public key/ and private key). As in asymmetric schemes the sender and
receiver of a message use different keys, there is less trust required between both parties
w.r.t. the key handling and secure storage; thus asymmetric schemes are more secure.
However, this severely affects the performance and conversely symmetric schemes are
several magnitudes faster than asymmetric algorithms.
In “traditional” (strong) cryptography, the goal is that the ciphertext does not reveal
anything about the plaintext, e.g. any two given similar yet different plaintexts m1 and m2
the resulting ciphertexts are dissimilar and appear random. As such any operation which
makes use of any property of the plaintext is not applicable to ciphertext. This is
expressed by the cryptographic property is ciphertext indistinguishability introduced by
[GM84] as polynomial security, i.e. given n plaintexts and their n ciphertexts,
determining which ciphertexts refer to which plaintexts has the same probability as
random guessing. Similar is the property non-malleability introduced by [DDN00] which
states that given a ciphertext an attacker may not be able to modify this ciphertext in a
way which would yield a related plaintext, i.e. a malleable scheme would fulfil:
E (m⊕ x) = E(m) ⊗ x’, with E being the encryption operation, m the plaintext, x a value
to change the plaintext using the operation ⊕ and x’/⊗ their counterparts in encrypted
domain. However, there exist encryption schemes which intentionally give up on these
security properties in exchange to provide for additional benefits, i.e. computational
properties for mathematical operations in the encrypted domain.
The first one of these schemes is Homomorphic Encryption (HE), with its basic
concept introduced by Rivest et al. [RAD78], as an encryption scheme allowing certain
binary operations on the encrypted plaintexts in the ciphertext domain by a related
homomorphic operation without any knowledge of the actual plaintext, i.e.:
E(m1∘ m2) = E(m1) ⊚ E(m2). However, most of the early homomorphic encryption
schemes only allowed one operation (multiplication or addition) until Gentry [Ge09]
introduced the first Fully Homomorphic Encryption (FHE) scheme, which allow additive
as well as multiplicative arithmetic operations at the same time. As can easily be seen,
homomorphic encryption does not fulfil the non-malleability criterion, since every
adversary can combine two ciphertexts to create a valid new encrypted plaintext. The
main drawback of this encryption scheme is that, although Gentry’s approach opened up
lots of interest in the scientific community, the scheme it is very intensive with regards
to computational time and space requirements, with a plaintext to ciphertext expansion
factor of thousands to millions [LH14], and the computation of complex/chained
operations may take several seconds.
The second encryption scheme is Order-Preserving Encryption (OPE) introduced by
Agrawal et al. [AKSX04] and cryptographically proofed by Boldyreva et al. [BCLO09],
who also provided a cryptographic model based on hypergeometric distribution. The
idea of order-preservation encryption is to provide an encryption scheme which
39
maintains the property of order of plaintext within their encrypted counterparts, i.e.
m2 : E(m1) ≥ E(m2). It can be seen that this scheme does not fulfil the ciphertext
indistinguishability criterion as by ordering the plaintexts and ciphertexts there is high
probability of knowing which ciphertext can be mapped to which plaintext, except for
equal plaintexts.
∀m1 ≥
A third encryption scheme is Partial Encryption, differing from the aforementioned as
it is not a specific new approach of encrypting, but a different application of existing
common encryption schemes. The basic idea, as already discussed, e.g., in [MKD+11]
and [SSM+11], is that in many complex data items only some parts are confidential.
However, encrypting the complete data item may hinder the use of the data as even
persons or processes who only want to access these non-confidential parts would need to
get the data item decrypted in some way, e.g. by providing them with a higher security
clearance, even though they should not need it. Using partial encryption, only the
confidential parts of the data item are encrypted and associated with metadata specifying
which parts. As the data is complex, and thus HE and OPE schemes would not be usable,
partial encryption would use strong cryptography.
Combining these, our concept consists of two steps: The first step is to classify each data
type with regards to its security level in the sense of maximum possible harm if the data
gets misused or disclosed. This segmentation basically defines the amount of data to be
protected as well as the type of security means needed to protect this data. Since the
actual evaluation depends on the regulations of each organization, legislation, use-cases
and so on, this is not the focus of this paper and thus will not be described in detail. The
second step is to evaluate what kind of operations are commonly done to the data, in the
sense of if the queries mostly consist of arithmetic operations, comparison operations, or
other kinds of operations. If data is mainly used for arithmetic reasons, HE can be used
to enable these kinds of operations on encrypted data. The same applies to data mainly
used for comparison queries and OPE. If complex data types are present where only
parts are sensitive and other parts contain useful information too, partial encryption can
be used. It has to be stated here that this two-step concept is an extension of the basic
approach discussed in [MKD+11].However, as stated before, the HE and OPE schemes
are weaker than strong encryption schemes. Therefore, based on the evaluated security
level of step 1, highly confidential data which satisfy the criteria for either HE or OPE
should still be encrypted using strong cryptography.
3 Realization in database systems
In this section we describe the implications of using our approach on the database
management system by discussing if the encryption or parts thereof should be provided
server-side or client-side, as well as necessary changes to queries. For two of the
introduced schemes (HE and OPE) the integration into a query language is discussed
only conceptually on query level, while for the third scheme (PE) a realization of the
query language extension required is presented.
40
Our test MySQL database consists of actual forensic data acquired during the Digi-Dak1
project consisting of fiber scans, synthetic fingerprints and metadata in the form of
statistical, spectral and gradient features for fingerprints [MHF +12] in 3.4 million tuples
and diameter, length, perimeter, area, height and color as fiber features [AKV12].
3.1 Server-side vs client-side
An important notion is if the encryption E and decryption D functions should be
provided client-side or server-side. On client-side, each client software would be
required to implement both functions if it wants to create proper queries and interpret the
results. However, this may be unfeasible for heterogeneous environments with many
different types of client software, especially if future updates to the employed encryption
are considered. The advantage of this approach is that the keys never leave the client.
If the crypto functions are centrally server-based, every client can make use of the
encrypted data with only minor changes to the queries as described in the following
sections. An important issue here is the acquisition of the keys. The keys may either be
provided by the client or by the server. For client-based key provision, they would be
part of the query and thus transmitted over the server-client connection. In this case this
connection must be secured, e.g. by using SSL/TLS. For server-based key provision they
may either be part of encryption or decryption functions themselves or are provided as
part of a view as described in Section 3.4. The disadvantage server-based key provision
using views approach is that the password needs to be stated in the view definition, and
as such this encryption approach has at most the security level of the database access
controls. As such it can protect confidential data against malicious clients or clients
vulnerable to SQL injections and similar attacks, but not against attackers who have lowlevel (OS) access to the database. Therefore, client-based key provision has a higher
security level, but as stated in Section 2 server-based key provision may also be feasible.
Approaches like [GSK13] propose an extension of the SQL syntax to permit special
forensic tables that automatically handle secure deletion, and [SML07] proposes the use
of an internal stream cipher to automatically encrypt every data. Both approaches need
the extension/update of database systems on source code level which is in most cases not
applicable to actual database systems in a productive environment. Therefore our
approach consists of realizing secure deletion/encryption by making use of means
provided by the database system, in our case external User-Defined Functions (UDF) –
also called call specification in Oracle or CLR Function in MSSQL – from external
libraries as complex cryptographic implementations are infeasible with stored
procedures in SQL/PSM or related languages.
3.2 Order Preserving Encryption
Like for the other encryption schemes the data needs to be encrypted before it is stored
in the database. This would be done by the client who transforms an input item m to
1
http://omen.cs.uni-magdeburg.de/digi-dak/
41
m’ = EOPE(m, keyE) which is then INSERTed. On a SELECT, the returned m’ needs to be
transformed back as m = DOPE(m’, keyD). As described in the previous section, EOPE()
and DOPE() may either be server- or client-side provided and the keys may either be
explicitly provided or implicitly.
Assuming the columns of a tuple where c2 contains confidential values mainly used for
relational queries (for example with a fixed value v1) a query in the form:
Query 1a: SELECT c1, c2 FROM t1 WHERE c2<v1;
would be transformed into the following for OPE data:
Query 1b: SELECT c1, DOPE(c2, keyD) FROM t1 WHERE c2<EOPE(v1, keyE);
In case the query result does not contain OPE data, the only overhead to non-encrypted
data would be the encryption of the WHERE statement. Therefore the time complexity
depends on the number and complexity of OPE statements in the SELECT statement
multiplied by the result count and the number and complexity of OPE statements in the
WHERE statement.
3.3 Homomorphic Encryption
For homomorphic encryption basically the same applies. Assuming the columns of a
tuple where c1 and c2 contain confidential values mainly used for arithmetic operations
queries to retrieve a value for a client or generate a value within the database:
Query 2a: SELECT c1 + c2 FROM t1 WHERE …;
Query 3a: INSERT INTO t2 (c3) SELECT c1 * c2 FROM t1 WHERE …;
would be transformed into the following for HE data:
Query 2b: SELECT DHE(ADDHE(c1, c2), keyD) FROM t1 WHERE …;
Query 3b: INSERT INTO t2 (c3) SELECT MULHE(c1, c2) FROM t1 WHERE …;
However depending on the homomorphic operations and if the query result is stored in
the database or returned to the client the call to an additional function may not be
needed. For example in the Paillier cryptosystem [Pa99] the addition of two plaintexts is
expressed by a multiplication of the ciphertexts: m1 + m2 mod n = m1’ * m2’ mod n2 .
3.4 Partial Encryption
In our implementation the actual library and user-defined functions were written in C#,
as its runtime library provides a large amount of conveniently usable image processing
and cryptography functions. As C# libraries use different export signatures the
Unmanaged Exports tool (MIT license) by Robert Giesecke 2 to automatically create
proper C style exports and unmarshalling.
The pseudo-code of the decrypt UDF called DPE is as follows:
INPUT: blob, password
OUTPUT: image
2
https://sites.google.com/site/robertgiesecke/Home/uploads/unmanagedexports
42
(image, regions, cryptalg, cryptparams, blocksize) ← unpack(blob)
if password = ∅ then return image endif
buffer ← ARRAY BYTE[1..blocksize], coord ← ARRAY POINT[1..blocksize]
b ← 0, h ← height(image), w ← width(image)
for y = 1 to h do
for x = 1 to w do
if (x, y) in regions then
buffer[b] ← image[x, y], coord[b] ← (x, y)
b ← b+1
endif
if b = blocksize OR (y=h AND x=w)) then
buffer’ ← decrypt(buffer, cryptalg, cryptparams, password)
for i = 1 to blocksize do image[coord[i]] = buffer’[i] endfor
b ← 0
endif endfor endfor
return image
In our scenario partial encryption is used for fingerprint scans. At a crime scene there are
cases where latent fingerprints – sensitive data – may be superimposed with other nonsensitive evidence like fiber traces. Depending on the investigation goal (fingerprint of
fiber analysis) it may thus become necessary to make areas containing sensitive data
inaccessible. Therefore in our approach, only the fingerprint parts of the scans are
encrypted by using AES, packed in ZIP container along the encryption metadata
(algorithm, bit size, initialization vectors, password salt) and the outline of partially
encrypted regions as shown in Figure 1.
Figure 1: (Synthetic) original3 and partially encrypted fingerprint
3
Example from the Public Printed Fingerprint Data Set – Chromatic White Light Sensor - Basic Set V1.0. The
image was acquired using sensory from the Digi-Dak research project (http://http://omen.cs.unimagdeburg.de/digi-dak/, 2013) sponsored by the German Federal Ministry of Education and Research, see
publication: Hildebrandt, M, Sturm, J., Dittmann, J., and Vielhauer, C.: Creation of a Public Corpus of
Contact-Less Acquired Latent Fingerprints without Privacy Implications. Proc. CMS 2013, Springer, LNCS
8099, 2013, pp. 204–206; it uses privacy implication free fingerprint patterns generated with SFinGe,
published in Cappelli, R.: Synthetic fingerprint generation, Maltoni, D., Maio, D., Jain, A.K., and Prabhakar, S.
(Eds.): Handbook of Fingerprint Recognition, 2nd edn., Springer London, 2009.
43
The database itself only stores the container. Additionally for server-based key provision
using views there would be two views to provide standardized means of access on the
database query language level: a) an “anonymized view” for general users who do not
have the proper security clearance to access unencrypted fingerprints, returning the
encrypted fingerprint scan from the container (Figure 1 right side):
Query 4: CREATE VIEW fingerprints_anon AS SELECT id, filename,
DPE(scan) AS scan FROM fingerprints;
And b) a “deanonymized view” for privileged users like forensic fingerprint experts who
do have the access rights to the actual fingerprints not only unpack the image but also try
to decrypt it with the provided key (Figure 1 left side):
Query 5: CREATE VIEW fingerprints_deanon AS SELECT id, filename,
DPE(scan, key) AS scan FROM fingerprints;
Regarding the performance on our test system (Intel Core i7-4610QM @ 2.3GHz, 8 GB
RAM), we observe in first experiments that querying 1000 tuples of containers takes
0.607s, unpacking the encrypted scans (Query 4) 29.110s, and returning the
deanonymized scans (Query 5) 99.016s. However, it should be noted that for the Query
5 task, computation power for image parsing is included in the given figure and makes
up for the main part of execution time for this query.
4 Conclusion and future work
In this paper we showed a general concept on how to use different encryption schemes
for the sake of secure deletion and de-personalization, also taking into account that the
data still at least partially remains usable for queries. The concept includes of the
classification of data by its security level and later main usage which decides the
appropriate encryption scheme. We also showed how these encryption schemes can be
used in common database systems without the need to directly modify the system, but
using its existing capabilities of a selected database system.
In future work a more thorough research on key management for server-based key
provision needs to be done, as well the applicability to other database systems.
Furthermore, the actual performance impact of this general concept to practical systems
has to be evaluated with large-scale evaluations for relevant application scenarios, like
e.g. larger forensic databases and/or biometric authentication systems where such a
scheme could prevent information leakage as well as inter-system traceability.
5 Acknowledgements
The work in this paper has been funded by the Digi-Dak project by the German Federal
Ministry of Education and Science (BMBF) through the research program under the
contract no. FKZ: 13N10816.
44
References
[AKSX04] Agrawal, R.; Kiernan, J.; Srikant, R.; Xu, Y.: Order-preserving encryption for numeric
data. In: SIGMOD, 2004; pages 563–574.
[AKV12] Arndt, C.; Kraetzer C.; Vielhauer, C.: First approach for a computer-aided textile fiber
type determination based on template matching using a 3D laser scanning microscope.
In: Proc. 14th ACM Workshop on Multimedia and Security, 2012.
[BCLO09] Boldyreva, A., Chenette, N., Lee, Y., O’Neill, A.: Order-preserving symmetric
encryption. In: EUROCRYPT, 2009, pages 224–241.
[BL96]
Boneh, D.; Lipton, R. J.: A revocable backup system. In: USENIX Security
Symposium, 1996; pages 91–96.
[DDN00] Dolev, D.; Dwork, C.; Naor, M.: Nonmalleable Cryptography. In: SIAM Journal on
Computing 30 (2), 2000; pages 391–437.
[Ge09]
Gentry, C: Fully homomorphic encryption using ideal lattices. In: ACM symposium on
Theory of computing, STOC ’09, New York, 2009; pages 169–178.
[GM84]
Goldwasser, S.; Micali, S.: Probabilistic encryption. In: Journal of Computer and
System Sciences. 28 (2), 1984; pages 270–299.
[GSK13] Grebhahn, A.; Schäler, M.; Köppen, V.: Secure Deletion: Towards Tailor-Made
Privacy in Database Systems. In: Workshop on Databases in Biometrics, Forensics and
Security Applications (DBforBFS), BTW, Köllen-Verlag, 2013; pages 99–113.
[GSKS13] Grebhahn, A.; Schäler, M.; Köppen, V.; Saake, G.: Privacy-Aware Multidimensional
Indexing. In: BTW 2013; pages 133–147.
[GHH+14] Grofig, P.; Hang, I.; Härterich, M.; Kerschbaum, F.; Kohler, M.; Schaad, A.;
Schröpfer, A.; Tighzert, W.: Privacy by Encrypted Databases. Preneel, B.; Ikonomou,
D. (eds.): Privacy Technologies and Policy. Springer International Publishing, Lecture
Notes in Computer Science, 8450, ISBN: 978-3-319-06748-3, pp. 56-69, 2014.
[Gu96]
Gutmann, P.: Secure Deletion of Data from Magnetic and Solid-State Memory. In:
USENIX Security Symposium, 1996.
[MHF+12] Makrushin, A.; Hildebrandt, M.; Fischer, R.; Kiertscher, T.; Dittmann, J.; Vielhauer,
C.: Advanced techniques for latent fingerprint detection and validation using a CWL
device. In: Proc. SPIE 8436, 2012; pages 84360V.
[MKD+11] Merkel, R.; Kraetzer, C.; Dittmann, J.; Vielhauer, C.: Reversible watermarking with
digital signature chaining for privacy protection of optical contactless captured
biometric fingerprints - a capacity study for forensic approaches. Proc. 17th
International Conference on Digital Signal Processing (DSP), 2011.
[LN14]
Lepoint, T.; Nehring, A.: A Comparison of the Homomorphic Encryption Schemes FV
and YASHE. Africacrypt 2014.
[Pa99]
Paillier, P.: Public-Key Cryptosystems Based on Composite Degree Residuosity
Classes. In: Eurocrypt 99, Springer Verlag, 1999; pages 223–238.
[PRZ+11] Popa, R. A.; Redfield, C. M. S.; Zeldovich, N.; Balakrishnan, H.: CryptDB: Protecting
Confidentiality with Encrypted Query Processing. Proc. 23rd ACM Symposium on
Operating Systems Principles (SOSP), 2011.
[RAD78] Rivest, R. L.; Adleman, L.; Dertouzos, M. L.: On data banks and privacy
homomorphisms. In: Foundations of Secure Computation, 1978.
[SSM+11] Schäler, M.; Schulze, S.; Merkel, R.; Saake, G.; Dittmann, J.: Reliable Provenance
Information for Multimedia Data Using Invertible Fragile Watermarks. In Fernandes,
A. A. A.; Gray, A. J. G.; Belhajjame, K. (eds.): Advances in Databases. Springer
Berlin Heidelberg, Lecture Notes in Computer Science, 7051, pp. 3-17, ISBN: 978-3642-24576-3, 2011.
[SML07] Stahlberg, P.; Miklau, G.; Levine, B.N.: Threats to Privacy in the Forensic Analysis of
Database Systems. In: SIGMOD, New York 2007, ACM; pages 91–102.
45
Data Streams and Event Processing
Data Streams and Event Processing
Marco Grawunder1 , [email protected]
Daniela Nicklas2 , [email protected]
1
Universität Oldenburg
Universität Bamberg
2
The processing of continuous data sources has become an important paradigm of modern data processing and management, covering many applications and domains such as
monitoring and controlling networks or complex production system as well complex event
processing in medicine, finance or compliance.
•
•
•
•
•
•
•
•
Data streams
Event processing
Case Studies and Real-Life Usage
Foundations
– Semantics of Stream Models and Languages
– Maintenance and Life Cycle
– Metadata
– Optimization
Applications and Models
– Statistical and Probabilistic Approaches
– Quality of Service
– Stream Mining
– Provenance
Platforms for event and stream processing, in particular
– CEP Engines
– DSMS
– ”Conventional” DBMS
– Main memory databases
– Sensor Networks
Scalability
– Hardware acceleration (GPU, FPGA, ...)
– Cloud Computing
Standardisation
In addition to regular workshop papers, we invite extended abstracts to cover hot topics,
ongoing research and ideas that are ready to share and discuss, but maybe not ready to
publish yet.
49
1
Workshop co-chairs
Marco Grawunder (Universität Oldenburg)
Daniela Nicklas (Universität Bamberg)
2 Program Committee
Andreas Behrend (Universität Bonn)
Klemens Boehm (Karlsruher Institut für Technologie)
Peter Fischer (Universität Freiburg)
Dieter Gawlick (Oracle)
Boris Koldehofe (Technische Universität Darmstadt)
Wolfgang Lehner (TU Dresden)
Richard Lenz (Universität Erlangen-Nürnberg)
Klaus Meyer-Wegener (Universität Erlangen)
Gero Mühl (Universität Rostock)
Kai-Uwe Sattler (Technische Universität Ilmenau)
Thorsten Schöler (Hochschule Augsburg)
50
Modulares Verteilungskonzept für
Datenstrommanagementsysteme
Timo Michelsen, Michael Brand, H.-Jürgen Appelrath
Universität Oldenburg, Department für Informatik
Escherweg 2, 26129 Oldenburg
{timo.michelsen, michael.brand, appelrath}@uni-oldenburg.de
Abstract: Für die Verteilung kontinuierlicher Anfragen in verteilten Datenstrommanagementsytemen (DSMS) gibt es je nach Netzwerk-Architektur und Anwendungsfall
unterschiedliche Strategien. Die Festlegung auf eine Strategie ist u. U. nachteilig, besonders wenn sich Netzwerk-Architektur oder Anwendungsfall ändern. In dieser Arbeit wird ein Ansatz für eine flexible und erweiterbare Anfrageverteilung in verteilten
DSMSs vorgestellt. Der Ansatz umfasst drei Schritte: (1) Partitionierung, (2) Modifikation und (3) Allokation. Bei der Partitionierung wird eine kontinuierliche Anfrage
in disjunkte Teilanfragen zerlegt. Die optionale Modifikation erlaubt es, Mechanismen wie Fragmentierung oder Replikation zu verwenden. Bei der Allokation werden
die einzelnen Teilanfragen schließlich Knoten im Netzwerk zugewiesen, um dort ausgeführt zu werden.
Für jeden der drei Schritte können unabhängige Strategien verwendet werden. Dieser modulare Aufbau ermöglicht zum Einen eine individuelle Anfrageverteilung. Zum
Anderen können bereits vorhandene Strategien aus anderen Arbeiten und Systemen
(z.B. eine Allokationsstrategie) integriert werden. In dieser Arbeit werden für jeden
der drei Teilschritte beispielhafte Strategien vorgestellt. Außerdem zeigen zwei Anwendungsbeispiele die Vorteile des vorgestellten, modularen Ansatzes gegenüber einer festen Verteilungsstrategie.
1
Einleitung
In verarbeitenden Systemen ist es häufig notwendig, den persistenten Teil des Systems
mehrfach und verteilt vorzuhalten, um Ausfälle kompensieren zu können. In verteilten
Datenbankmanagementsystemen (DBMS) werden persistente Daten disjunkt (Fragmentierung) oder redundant (Replikation) auf verschiedene Knoten des Netzwerks verteilt.
Anfragen werden einmalig gestellt und greifen ausschließlich auf die Knoten zu, die die
betreffenden Daten besitzen.
Betrachtet man allerdings verteilte Datenstrommanagementsysteme (DSMS), so ist eine
Verteilung der Datenstromelemente allein nicht zielführend, da diese flüchtig sind. Hier
sind die Anfragen persistent, da sie theoretisch kontinuierlich ausgeführt werden. Eine
solche, auf mehrere Knoten verteilte, kontinuierliche Anfrage wird verteilte, kontinuierliche Anfrage genannt. Eine verteilte, kontinuierliche Anfrage befindet sich auf mehreren
Knoten, indem die einzelnen Operationen den Knoten zugeordnet und dort ausgeführt
51
werden (in sogenannten Teilanfragen). Zwischenergebnisse der Teilanfragen werden von
Knoten zu Knoten gesendet, bis schließlich die Ergebnisse an den Nutzer gesendet werden. Für eine Zerlegung einer kontinuierlichen Anfrage in Teilanfragen gibt es allerdings
viele Möglichkeiten (z.B. eine Teilanfrage für die gesamte kontinuierliche Anfrage oder je
eine Teilanfrage pro Operation).
Die Zerlegung in Teilanfragen ist nur ein Aspekt, der bereits aufzeigt, dass unterschiedliche Strategien für eine Anfrageverteilung in verteilten DSMSs existieren. Ein weiterer
Aspekt ist die Netzwerk-Architektur (bspw. homogene Cluster, Client/Server-Architekturen und Peer-To-Peer (P2P)-Netzwerke). In einem homogenen Cluster verfügen alle
Knoten i.d.R. über die gleichen Mengen an Systemressourcen. Daher muss bei einer Zuweisung von (Teil-) Anfragen an einen Knoten in homogenen Clustern keinerlei Heterogenität berücksichtigt werden. Kommt hingegen ein heterogenes P2P-Netzwerk zum Einsatz,
kann es vorkommen, dass nicht jeder Knoten jede (Teil-) Anfrage ausführen kann (bspw.
aufgrund komplexer Operatoren).
In vielen verteilten DSMSs ist die Anfrageverteilung system-intern auf eine bestimmte
Netzwerk-Architektur ausgelegt (z.B. Client/Server). Somit wird es aufwendig, die zugrunde liegende Netzwerk-Architektur nachträglich zu wechseln. Ein Grund für eine solche Einschränkung ist die Tatsache, dass keine optimale Verteilungsstrategie existiert, die
für alle Netzwerk-Architekturen und Knoten eingesetzt werden kann. Neben unterschiedlichen Netzwerk-Architekturen können ebenso anwendungsspezifische Kenntnisse des Nutzers die Anfrageverteilung optimieren (z.B. die Identifikation von Operatoren, die eine
hohe Systemlast erzeugen). Eine solche manuelle Optimierung ist allerdings nur möglich,
wenn der Nutzer die Anfrageverteilung für eine konkrete kontinuierliche Anfrage konfigurieren kann.
Diese Arbeit stellt ein Konzept für eine flexible und erweiterbare Anfrageverteilung in
einem verteilten DSMS vor. Dazu wird die Verteilung in drei Phasen unterteilt. Jede Phase
bietet eine eindeutige Schnittstelle mit Ein- und Ausgaben, die die Umsetzung mehrerer
Strategien ermöglicht. Die Anfrageverteilung kann individuell für die zugrunde liegende
Netzwerk-Architektur und den konkreten Anwendungsfall konfiguriert und ggfs. optimiert
werden. Außerdem ist es möglich, bereits existierende Strategien aus anderen Quellen
zu übernehmen. Es entsteht eine Sammlung an Strategien, die je nach Anwendungsfall
individuell kombiniert werden können.
Der Rest der Arbeit ist wie folgt gegliedert: Abschnitt 2 gibt einen Überblick über andere Systeme, die ebenfalls kontinuierliche Anfragen verteilen. Das modulare Verteilungskonzept wird in Abschnitt 3 vorgestellt. Dabei liegt der Fokus auf der Erläuterung des
Konzeptes. Auf eine vollständige Übersicht aller zum jetzigen Zeitpunkt verfügbarer Strategien wird in dieser Arbeit verzichtet. Abschnitt 4 beinhaltet den aktuellen Stand der
Implementierung und beispielhafte Anwendungsszenarien. In Abschnitt 5 wird die Arbeit
abschließend zusammengefasst.
52
2
Verwandte Arbeiten
In den DSMSs Borealis [CBB+ 03] und StreamGlobe [KSKR05] werden neue kontinuierliche Anfragen zunächst grob zerlegt und im Netzwerk verteilt. Häufig werden einzelne
Operatoren Knoten zugeordnet. Nach der anfänglichen Verteilung werden Teile der kontinuierlichen Anfrage während der Verarbeitung verschoben, d.h., sie werden von Knoten
zu Knoten übertragen. Dadurch können Kommunikationskosten gespart werden, indem
bspw. Teilanfragen verschiedener Knoten zusammengefasst werden.
Das DSMS Stormy [LHKK12] zerlegt keine kontinuierlichen Anfragen, sondern führt sie
stets vollständig auf einem Knoten aus. Die Auswahl des Knotens geschieht mittels eines Hashwertes, welcher aus der kontinuierlichen Anfrage gebildet wird: Jeder Knoten
übernimmt einen Teil des Wertebereichs, sodass kontinuierliche Anfragen mit bestimmten
Hashwerten bestimmten Knoten zugeordnet werden.
StreamCloud [GJPPM+ 12], Storm [TTS+ 14] und Stratosphere [WK09] nutzen CloudInfrastrukturen, um bei Bedarf verarbeitende, virtuelle Knoten in der Cloud zu erzeugen
und Teilanfragen zuzuordnen. In Storm kann der Nutzer zusätzlich angeben, wie viele
Instanzen für eine Operation erstellt werden sollen (bspw. für Fragmentierung oder Replikation). StreamCloud bietet unterschiedliche Zerlegungsstrategien, die später in dieser
Arbeit aufgegriffen werden.
In SPADE, einer Anfragesprache von System S [GAW09], werden kontinuierliche Anfragen nach einem Greedy-Algorithmus zerlegt. Die Zuweisung der so entstandenen Teilanfragen wird mittels eines Clustering-Ansatzes durchgeführt. Dadurch sollen die Teilanfragen möglichst wenigen Knoten zugeordnet werden.
Daum et al. [DLBMW11] haben ein Verfahren für eine automatisierte, kostenoptimale
Anfrageverteilung vorgestellt. Sie verfolgen damit andere Ziele als das in dieser Arbeit
vorgestellte Konzept, bei dem es viel mehr um Flexibilität und Modularität als um Automatisierung geht.
Jedes hier vorgestellte DSMS besitzt seine eigene Vorgehensweise (mit eigenen Vor- und
Nachteilen), kontinuierliche Anfragen im Netzwerk zu verteilen. Jedoch bieten sie – im
Gegensatz zu dem hier vorgestellten Konzept – nicht die Flexibilität und Modularität, um
Verteilungsstrategien bei Bedarf zu wechseln. Sie erlauben häufig nur mit großem Aufwand neue, evtl. domänenspezifische Strategien zu implementieren und einzusetzen.
3
Konzept
Häufig werden kontinuierliche Anfragen vom Nutzer deklarativ gestellt. Anschließend
wird der Anfragetext i.d.R. in eine sprachenunabhängige Struktur übersetzt, die sich an
Anfragepläne von DBMSs orientiert: die sogenannten logischen Operatorgraphen. Sie
können als gerichtete, azyklische Graphen interpretiert werden, wobei die Knoten die Operatoren repräsentieren. Die Kanten stellen die Datenströme zwischen den Operatoren dar.
Ein logischer Operator beschreibt, welche Operation auf einen Datenstrom ausgeführt
53
werden soll (bspw. Selektion, Projektion, Join). Sie beinhaltet jedoch nicht die konkrete
Implementierung. Diese wird erst bei der tatsächlichen Ausführung der (Teil-) Anfragen
auf einem Knoten eingesetzt. Aufgrund der Unabhängigkeit von der Sprache und von der
Implementierung basiert das hier vorgestellte Konzept auf logischen Operatorgraphen. Es
beschreibt also, wie die Verteilung eines logischen Operatorgraphen flexibel und erweiterbar durchgeführt werden kann.
Allokation
Modifikation
Partitionierung
Wie bereits erwähnt, ist eine einzige, fest programmierte Vorgehensweise zur Anfrageverteilung häufig nicht praktikabel. Dementsprechend wird in dieser Arbeit eine mehrschrittige Vorgehensweise verfolgt: (1) Partitionierung, (2) Modifikation und (3) Allokation. Abbildung 1 soll den Zusammenhang der Phasen zueinander verdeutlichen. Für jede Phase
werden mehrere Strategien zur Verfügung gestellt, aus denen der Nutzer für jede kontinuierliche Anfrage wählen kann. Als Alternative ist eine automatisierte Auswahl vorstellbar,
diese wird jedoch in dieser Arbeit nicht weiter verfolgt.
Knoten A
Knoten B
Knoten C
Knoten D
Logischer
Operatorgraph
Verteilte
Teilgraphen
Abbildung 1: Phasen der Verteilung kontinuierlicher Anfragen.
Ausgehend von einem logischen Operatorgraphen wird in der ersten Phase, der Partitionierung, der Graph in Teilgraphen zerlegt, wobei keinerlei Änderungen am Graphen vorgenommen werden (z.B. neue Operatoren). Diese Teilgraphen werden dann modifiziert, um
weitere Eigenschaften im Graphen sicherzustellen. Dabei können die Teilgraphen ergänzt,
verändert oder auch entfernt werden. Beispielsweise wurde in der Abbildung der mittlere
Teilgraph repliziert. Anschließend wird in der Allokation eine Zuordnung zwischen Teilgraph und ausführenden Knoten hergestellt und der Teilgraph schließlich übermittelt. In
dieser Phase werden die Teilgraphen nicht mehr verändert.
1
2
3
#NODE_PARTITION < P a r t i t i o n i e r u n g s s t r a t e g i e > <P a r a m e t e r >
#NODE_MODIFICATION <M o d i f i k a t i o n s s t r a t e g i e > <P a r a m e t e r >
#NODE_ALLOCATE < A l l o k a t i o n s s t r a t e g i e > <P a r a m e t e r >
4
5
kontinuierliche Anfrage
Listing 1: Selektion der Strategien zur Verteilung einer kontinuierlichen Anfrage.
Die Auswahl der Strategien wird hier dem Nutzer überlassen. Ein Beispiel, wie der Nutzer
eine kontinuierliche Anfrage im DSMS Odysseus [AGG+ 12] verteilen und entsprechende
54
Strategien auswählen kann, ist in Listing 1 zu sehen. Zu Beginn gibt der Nutzer die Strategie für jede Phase an (Zeilen 1 bis 3). Mittels Parameter können die Strategien weiter
verfeinert werden (bspw. Replikationsgrad). Anschließend wird die eigentliche Anfrage
formuliert, welche anhand der gewählten Kombination der Strategien verteilt wird.
Der Nutzer muss den Anfragetext nicht speziell anpassen, um diese verteilen zu können.
Es müssen lediglich zuvor die eingesetzten Strategien spezifiziert werden. Für eine andere
Anfrage können wiederum andere Strategien gewählt werden. In den folgenden Abschnitten werden die Phasen zur Verteilung genauer beschrieben und beispielhafte Strategien
kurz erläutert.
3.1 Partitionierung
Aufgrund der Tatsache, dass eine gegebene kontinuierliche Anfrage in einem Netzwerk
verteilt werden soll, werden einzelne Knoten häufig Teile der Anfrage erhalten. Dementsprechend muss zuvor entschieden werden, wie die Anfrage zerlegt werden soll. Das bedeutet in diesem Fall, dass der logische Operatorgraph in mehrere Teilgraphen zerlegt
werden muss. Teilgraphen beschreiben somit, welche logischen Operatoren zusammen
auf einem Knoten ausgeführt werden sollen. Es ist dabei wichtig, dass die Teilgraphen
zueinander disjunkt sind, d.h., jeder logischer Operator der kontinuierlichen Anfrage befindet sich in genau einem Teilgraphen. Die logischen Operatoren müssen innerhalb eines
Teilgraphen jedoch nicht zusammenhängend sein.
Das Finden einer geeigneten Zerlegung eines Graphen ist NP-hart [BMS+ 13]. Dementsprechend wird vorgeschlagen, mehrere Strategien zur Partitionierung anzubieten. Eine
Partitionierungsstrategie erhält einen logischen Operatorgraphen und liefert eine Menge
an disjunkten Teilgraphen. Einige Partitionierungsstrategien sind die folgenden:
QueryCloud Der logische Operatorgraph wird als ein Teilgraph behandelt. Das bedeutet,
dass keine Zerlegung durchgeführt wird. Dies ist nützlich, wenn es sich um eine Anfrage mit einer geringen Zahl an logischen Operatoren handelt und eine Zerlegung
nicht praktikabel erscheint.
OperatorCloud Jeder logischer Operator ist ein Teilgraph. Dies repräsentiert die maximale Zerlegung des Operatorgraphen. Diese Strategie ist für kontinuierliche Anfragen interessant, die wenige, jedoch sehr komplexe logische Operatoren beinhalten.
OperatorSetCloud In vielen kontinuierlichen Anfragen verursachen die zustandsbehafteten Operatoren die meiste Systemlast, wie bspw. Aggregationen [GJPPM+ 12].
Die Partitionierungsstrategie OperatorSetCloud zerlegt den logischen Operatorgraphen, sodass jeder Teilgraph maximal einen solchen Operator enthält. Dadurch ist
es möglich, die zustandsbehafteten Operatoren verschiedenen Knoten zuzuteilen,
sodass die Systemlast besser im Netzwerk verteilt werden kann.
Nutzerbasiert Falls es die Anfragesprache erlaubt, kann der Nutzer direkt angeben, welche Operatoren zusammen auf einem Knoten ausgeführt werden sollen. Diese Strategie ist besonders für Evaluationen praktisch, da damit bestimmte Szenarien der
Verteilung nachgestellt und reproduziert werden können.
55
Auf Details wird im Rahmen dieser Arbeit verzichtet. Die ersten drei Strategien wurden
dem Vorbild des DSMS StreamCloud nachempfunden und werden in [GJPPM+ 12] genauer erläutert. Es ist möglich, dass Entwickler weitere Strategien konzipieren und einsetzen.
Dadurch können in konkreten Anwendungsszenarien bspw. spezielle Eigenschaften der
Knoten und des Netzwerks ausgenutzt werden.
3.2 Modifikation
In der zweiten Phase wird der logische Operatorgraph modifiziert, um weitere Eigenschaften in der kontinuierlichen Anfrage sicherzustellen. Dieser Schritt ist optional und kann
übersprungen werden, wenn keine Änderungen am logischen Operatorgraphen notwendig
sind. Sind jedoch mehrere Modifikationen notwendig, kann diese Phase mehrfach durchgeführt werden. An dieser Stelle sind ebenfalls unterschiedliche Möglichkeiten vorstellbar.
Jede Modifikationsstrategien erhält als Eingabe die Menge an Teilgraphen aus der ersten
Phase. Die Ausgabe beinhaltet eine modifizierte Menge an Teilgraphen.
In dieser Phase ist ebenfalls vorgesehen, dass Entwickler weitere Strategien konzipieren
und einsetzen. Jedoch wurden in der vorliegenden Arbeit folgende Modifikationsstrategien
betrachtet:
Replikation Jeder Teilgraph wird (u. U. mehrfach) repliziert. Dadurch kann jeder Teilgraph auf mehreren Knoten ausgeführt werden. Solange mindestens ein Knoten den
Teilgraphen ausführt, können (Zwischen-) Ergebnisse berechnet und gesendet werden.
Horizontale Fragmentierung Ähnlich zur Replikation wird jeder Teilgraph repliziert,
jedoch empfängt jede Kopie nur einen Teil des Datenstroms. Die Ergebnisse werden am Ende wieder zusammengefasst. Damit kann die Verarbeitung parallelisiert
werden, was bei besonders hohen Datenraten oder komplexen Berechnungen sinnvoll ist.
Eine Illustration beider Strategien ist in Abbildung 2 zu sehen. Links ist der Einsatz der
Operator
Operator
Operator
Operator
Union
Merge
Operator
Operator
Operator
Operator
Operator
Operator
Operator
Operator
Operator
Operator
Operator
Operator
Operator
Operator
Fragment
Operator
Operator
Abbildung 2: Verwendung der Replikations- (links) und der horizontalen Fragmentierungsstrategie
(rechts).
56
Replikation als Modifikationsstrategie zu sehen: der Teilgraph wird kopiert und die replizierten (Teil-) Ergebnisse werden mittels eines speziellen Merge-Operator vereinigt. Der
Merge-Operator erkennt und entfernt Duplikate in den Datenströmen, sodass die Replikation die Verarbeitungsergebnisse nicht unnötig vervielfacht. Im Rahmen der horizontalen Fragmentierung werden Teilgraphen ebenfalls kopiert (in der Abbildung rechts). Der
vorgelagerte Fragment-Operator zerlegt den eintreffenden Datenstrom in disjunkte Fragmente, die parallel verarbeitet werden (bspw. mittels Hashwerten der Datenstromelemente). Der Union-Operator vereinigt die Teilmengen schließlich zu einem Datenstrom. In
der Modifikationsphase ist ebenfalls vorgesehen, dass Entwickler eigene Strategien entwickeln und einsetzen (z.B. vertikale Fragmentierung).
3.3 Allokation
Die dritte Phase – die Allokation – beinhaltet die Aufgabe, Teilanfragen den Knoten im
Netzwerk zur Ausführung zuzuordnen. Auch hier sind in Abhängigkeit zum vorliegenden Netzwerk verschiedene Vorgehensweisen vorstellbar, sodass im Rahmen dieser Arbeit mehrere Strategien betrachtet werden. Jede Allokationsstrategie erhält als Eingabe die
Menge an (ggfs. replizierten und/oder fragmentierten) Teilgraphen. Die Ausgabe umfasst
eine 1:n-Zuordnung zwischen ausführenden Knoten und Teilanfragen. Das bedeutet, dass
ein Knoten mehrere Teilgraphen erhalten kann, jedoch wird jeder Teilgraph genau einem
Knoten zugeordnet. Folgende Allokationsstrategien wurden bisher verfolgt:
Nutzerbasiert Der Nutzer gibt die Zuordnung vor (bspw. über eine grafische Oberfläche
oder durch Annotationen im Anfragetext). Der Nutzer kann über Spezialwissen
verfügen, die es ermöglichen, eine optimale Zuordnung anzugeben.
Round-Robin Die Teilgraphen werden der Reihe nach an die Knoten verteilt.
Lastorientiert Die Teilgraphen werden an die Knoten verteilt, welche aktuell die geringste Auslastung aufweisen. Durch diese Vorgehensweise kann die Systemlast im Netzwerk verteilt werden.
Contract-Net Für jeden Teilgraphen wird eine Auktion ausgeschrieben, und jeder Knoten kann bei Interesse ein Gebot abgeben. Ein Gebot beschreibt die Bereitschaft, den
Teilgraphen zu übernehmen. Dabei kann ein Gebot aus verschiedenen Faktoren zusammengesetzt werden. Beispielsweise spielt die Menge an verfügbaren Systemressourcen eine Rolle: Je mehr Ressourcen frei sind, desto besser kann der Teilgraph
ausgeführt werden. Der Knoten mit dem höchsten Gebot erhält schlussendlich den
Teilgraphen.
Ist das Netzwerk bekannt und sind alle Knoten homogen, kann Round-Robin für eine
schnelle und einfache Verteilung der Anfrage genutzt werden. Sollen Auslastungen der
Knoten berücksichtigt werden, ist die lastorientierte Strategie vorzuziehen. Sie kann auch
eingesetzt werden, wenn die Knoten über unterschiedliche Leistungskapazitäten verfügen.
Contract-Net sollte benutzt werden, wenn die Autonomie der Knoten zu berücksichtigen
ist (d.h., die Knoten entscheiden selbstständig, welche Teilgraphen sie ausführen wollen).
Das in dieser Arbeit vorgestellte Konzept sieht vor, dass Entwickler eigene Allokationss-
57
trategien implementieren können, um bspw. domänenspezifisches Wissen einzusetzen oder
spezielle Netzwerkstrukturen zu berücksichtigen.
4 Aktueller Stand
Das oben beschriebene Konzept wurde in Odysseus [AGG+ 12] als zusätzliche Komponente implementiert. Jede oben genannte Strategie ist verfügbar und kann in kontinuierlichen Anfragen eingesetzt werden. Dadurch kann Odysseus in unterschiedlichen NetzwerkArchitekturen eingesetzt werden, ohne dass umfangreiche Änderungen an der Verteilung
durchgeführt werden müssen (es muss lediglich die Strategieauswahl angepasst werden).
Im Folgenden werden zwei Anwendungsbeispiele von Odysseus vorgestellt, die aufzeigen
sollen, wie das oben genannte Konzept flexibel eingesetzt werden kann.
Anwendungsfall 1: In einem Anwendungsfall wird Odysseus auf mehrere Knoten in einem heterogenen und autonomen P2P-Netzwerk eingesetzt, um Sportereignisse in Echtzeit auszuwerten. Die Daten werden mit Hilfe von aktiven Sensoren aufgenommen und
an das Netzwerk gesendet. Die Sensoren sind dabei an spielrelevanten Entitäten wie den
Spielern und dem Ball angebracht. Die Analyse geschieht mit zuvor verteilten kontinuierlichen Anfragen. Da zum einen ein solches Sensornetzwerk mehrere tausend Datenstromelemente pro Sekunde erzeugen kann und zum anderen die Analyse dieser Daten teuer
ist, bietet sich eine Anfrageverteilung an, die in Listing 2 dargestellt ist. Konkret sollen
1
2
3
#NODE_PARTITION o p e r a t o r s e t c l o u d
#NODE_MODIFICATION f r a g m e n t a t i o n h o r i z o n t a l h a s h n
#NODE_ALLOCATE c o n t r a c t n e t
4
5
originale Analyse-Anfrage
Listing 2: Beispielhafte Verwendung der Anfrageverteilung für eine Sportanalyse.
Operatoren, die viel Last erzeugen, auf unterschiedlichen Knoten ausgeführt werden (OperatorSetCloud-Partitionierungsstrategie). Dadurch werden unterschiedliche Sportanalysen
von verschiedenen Knoten des Netzwerks übernommen. Zusätzlich soll der Datenstrom
fragmentiert werden, um die Systemlast für einzelne Knoten zu verringern (Modifikationsstrategie hash-basierte horizontale Fragmentierung und n Fragmenten). Da die Knoten
heterogen und autonom sind, wird in diesem Fall die Contract-Net-Allokationsstrategie
eingesetzt.
Anwendungsfall 2: In einem Windpark liefert jedes Windrad kontinuierlich Statusinformationen, die als Datenstrom interpretiert werden (z.B. die aktuell erzeugte Energie sowie
Windrichtung und -geschwindigkeit). Diese Datenströme werden zur Überwachung und
Kontrolle der Windräder benötigt und an ein homogenen Cluster aus Odysseus-Instanzen
gesendet. Es können spezielle Datenstromelemente versendet werden, die Alarmmeldungen oder Störungen signalisieren. Aus diesem Grund ist es wichtig, dass jedes Datenstromelement (jede Alarmmeldung oder Störung) verarbeitet wird, auch wenn ein Knoten in dem
verarbeitendem Cluster ausfällt. Listing 3 zeigt eine mögliche Anfrageverteilung für dieses
58
Szenario unter der Verwendung eines homogenen Clusters. Konkret sollen alle Operatoren
1
2
3
#NODE_PARTITION q u e r y c l o u d
#NODE_MODIFICATION r e p l i c a t i o n n
#NODE_ALLOCATE r o u n d r o b i n
4
5
originale Überwachungs-Anfrage
Listing 3: Beispielhafte Verwendung der Anfrageverteilung für eine Windpark-Überwachung.
der Anfrage auf einen Knoten ausgeführt werden, da die Cluster-Knoten mit ausreichend
Ressourcen ausgestattet sind (QueryCloud-Partitionierungsstrategie). Zusätzlich wird der
Datenstrom repliziert, um die Ausfallsicherheit zu erhöhen und um Alarmmeldungen nicht
zu verlieren (Modifikationsstrategie Replikation mit n Replikaten). Als Allokator kommt
die Round-Robin-Strategie zum Einsatz, da es sich um ein homogenes Cluster mit identischen Knoten handelt.
5 Zusammenfassung
Für eine Anfrageverteilung in verteilten DSMSs gibt es je nach Netzwerk-Architektur und
Anwendungsfall unterschiedliche Strategien. Viele Systeme haben sich auf eine Verteilungsstrategie spezialisiert, wodurch sie nur mit großem Aufwand an eine Änderung der
Netzwerk-Architektur angepasst werden können. Außerdem ist die Anfrageverteilung in
vielen Systemen nicht durch den Nutzer konfigurierbar, was es unmöglich macht anwendungsspezifische Kenntnisse einzubringen.
Aus diesem Grund wurde in dieser Arbeit ein modularer Konzept für eine flexible und
erweiterbare Anfrageverteilung in verteilten DSMSs vorgestellt. Das Konzept sieht dabei einen logischen Operatorgraphen als Eingabe vor und liefert verteilte Teilgraphen als
Ausgabe. Strukturell umfasst er drei Schritte: (1) Partitionierung, (2) Modifikation und (3)
Allokation.
Bei der Partitionierung wird der logische Operatorgraph in disjunkte Teilgraphen zerlegt,
um die Operatoren zu identifizieren, die gemeinsam auf einem Knoten im Netzwerk ausgeführt werden sollen. Die optionale Modifikation erlaubt es Mechanismen wie Fragmentierung oder Replikation zu verwenden, indem die Teilgraphen modifiziert werden. In der
Allokationsphase werden die einzelnen (modifizierten) Teilgraphen Knoten im Netzwerk
zugewiesen.
Für jeden der drei Schritte gibt es Schnittstellen, wodurch unabhängige Strategien miteinander kombiniert werden können. Dieser modulare Aufbau ermöglicht zum einen eine
individuelle Anfrageverteilung. Zum anderen können bereits vorhandene Strategien aus
anderen Arbeiten und Systemen (z.B. eine Allokationsstrategie) integriert werden. In dieser Arbeit wurden für jeden der drei Teilschritte beispielhafte Strategien vorgestellt.
Das Konzept wurde im DSMS Odysseus als zusätzliche Komponente implementiert und
erfolgreich in verschiedenen Anwendungsszenarien eingesetzt. Zwei Anwendungsszena-
59
rien wurden in dieser Arbeit kurz vorgestellt: (1) Die Sportanalyse in Echtzeit mittels
einem P2P-Netzwerk aus heterogenen Knoten und (2) die Überwachung eines Windparks.
Odysseus musste in beiden Anwendungsfällen lediglich bei der Strategieauswahl angepasst werden. Dies zeigt, dass das oben beschriebene Konzept zur Verteilung kontinuierlicher Anfragen flexibel und erweiterbar ist.
Literatur
[AGG+ 12]
H.-Jürgen Appelrath, Dennis Geesen, Marco Grawunder, Timo Michelsen und Daniela Nicklas. Odysseus: a highly customizable framework for creating efficient event
stream management systems. DEBS ’12, Seiten 367–368. ACM, 2012.
[BMS+ 13]
Aydin Buluç, Henning Meyerhenke, Ilya Safro, Peter Sanders und Christian Schulz.
Recent Advances in Graph Partitioning. CoRR, abs/1311.3144, 2013.
[CBB+ 03]
Mitch Cherniack, Hari Balakrishnan, Magdalena Balazinska, Donald Carney, Ugur
Cetintemel, Ying Xing und Stan Zdonik. Scalable Distributed Stream Processing.
In CIDR 2003 - First Biennial Conference on Innovative Data Systems Research,
Asilomar, CA, January 2003.
[DLBMW11] Michael Daum, Frank Lauterwald, Philipp Baumgärtel und Klaus Meyer-Wegener.
Kalibrierung von Kostenmodellen für föderierte DSMS. In BTW Workshops, Seiten
13–22, 2011.
[GAW09]
Buğra Gedik, Henrique Andrade und Kun-Lung Wu. A Code Generation Approach to
Optimizing High-performance Distributed Data Stream Processing. In Proceedings
of the 18th ACM Conference on Information and Knowledge Management, CIKM
’09, Seiten 847–856, New York, NY, USA, 2009. ACM.
[GJPPM+ 12] Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patino-Martinez, Claudio Soriente
und Patrick Valduriez. StreamCloud: An Elastic and Scalable Data Streaming System. IEEE Transactions on Parallel and Distributed Systems, 23(12):2351–2365,
2012.
[KSKR05]
Richard Kuntschke, Bernhard Stegmaier, Alfons Kemper und Angelika Reiser. Streamglobe: Processing and sharing data streams in grid-based p2p infrastructures. In
Proceedings of the 31st international conference on Very large data bases, Seiten
1259–1262. VLDB Endowment, 2005.
[LHKK12]
Simon Loesing, Martin Hentschel, Tim Kraska und Donald Kossmann. Stormy: an
elastic and highly available streaming service in the cloud. In Proceedings of the
2012 Joint EDBT/ICDT Workshops, EDBT-ICDT ’12, Seiten 55–60, New York, NY,
USA, 2012. ACM.
[TTS+ 14]
Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthikeyan Ramasamy, Jignesh M.
Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham,
Nikunj Bhagat, Sailesh Mittal und Dmitriy V. Ryaboy. Storm@twitter. In SIGMOD
Conference, Seiten 147–156, 2014.
[WK09]
Daniel Warneke und Odej Kao. Nephele: Efficient Parallel Data Processing in the
Cloud. In Proceedings of the 2Nd Workshop on Many-Task Computing on Grids and
Supercomputers, MTAGS ’09, Seiten 8:1–8:10, New York, NY, USA, 2009. ACM.
60
Placement-Safe Operator-Graph Changes in Distributed
Heterogeneous Data Stream Systems
Niko Pollner, Christian Steudtner, Klaus Meyer-Wegener
Computer Science 6 (Data Management)
Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)
[email protected], [email protected], [email protected]
Abstract: Data stream processing systems enable querying continuous data without
first storing it. Data stream queries may combine data from distributed data sources
like different sensors in an environmental sensing application. This suggests distributed query processing. Thus the amount of transferred data can be reduced and
more processing resources are available.
However, distributed query processing on probably heterogeneous platforms complicates query optimization. This article investigates query optimization through operator graph changes and its interaction with operator placement on heterogeneous
distributed systems. Pre-distribution operator graph changes may prevent certain operator placements. Thereby the resource consumption of the query execution may
unexpectedly increase. Based on the operator placement problem modeled as a task
assignment problem (TAP), we prove that it is NP-hard to decide in general whether an
arbitrary operator graph change may negatively influence the best possible TAP solution. We present conditions for several specific operator graph changes that guarantee
to preserve the best possible TAP solution.
1
Introduction
Data stream processing is a well suited technique for efficient analysis of streaming data.
Possible application scenarios include queries on business data, the (pre-)processing of
measurements gathered by environmental sensors or by logging computer network usage
or online-services usage.
In such scenarios data often originate from distributed sources. Systems based on different
software and hardware platforms acquire the data. With distributed data acquisition, it is
feasible to distribute query processing as well, instead of first sending all data to a central
place. Some query operators can be placed directly on or near the data acquisition systems. This omits unnecessary transfer of data that are not needed to answer the queries,
and partitions the processing effort. Thus querying high frequency or high volume data
becomes possible that would otherwise require expensive hardware or could not be processed at all. Also data acquisition devices like wireless sensor nodes profit from early
operator execution. They can save energy if data is filtered directly at the source.
61
Problem Statement Optimization of data stream queries for a distributed heterogeneous
execution environment poses several challenges: The optimizer must decide for each operator on which processor it should be placed. Here and in the remainder of this article, the
term processor stands for a system that is capable to execute operators on a data stream.
A cost model is a generic base for the operator placement decision. It can be adapted to
represent the requirements of specific application scenarios, so that minimal cost represents the best possible operator distribution. Resource restrictions on the processors and
network links between them must also be considered. In a heterogeneous environment
costs and capacities will vary among the available processors. The optimizer can optimize
the query graph before and after the placement decision. Pre-placement changes of the
query graph may however foil certain placements and a specific placement limits the possible post-placement algebraic optimization. For example, a change of the order of two
operators can increase costs if the first operator of the original query was available directly
on the data source and the now first operator in the changed query is not. This must be
considered when using common rules and heuristics for query graph optimization.
Contribution We investigate the influence of common algebraic optimization techniques
onto a following operator placement that is modeled as a task assignment problem (TAP).
We prove that the general decision whether a certain change of the query graph worsens
the best possible TAP solution is NP-hard. We then present analysis of different common
operator graph changes and state the conditions under which they guarantee not to harm
the best possible placement. We do not study any special operator placement algorithm,
but focus on preconditions for graph changes.
Article Organization The following section gives an overview on related work from
both the fields of classical database query optimization and data stream query optimization. Sect. 3 introduces the TAP model for the operator placement. It is the basis for
the following sections. We prove the NP-hardness of the query-graph-change influence in
Sect. 4 and present the preconditions for special graph changes in Sect. 5. The next section
shows how to use the preconditions with an exemplary cost model for a realistic query. In
the last section we conclude and present some ideas for further research.
2
Related Work
This section presents related work on operator graph optimization from the domains of
data base systems (DBS) and data stream systems (DSS). Due to space limitations, we are
unfortunately only able to give a very rough overview.
Query optimization in central [JK84] as well as in distributed [Kos00] DBS is a well studied field. Basic ideas like operator reordering are also applicable to DSS. Some operators,
especially blocking operators, however have different semantics. Other techniques like
optimization of data access have no direct match in DSS. Strict resource restrictions are
62
also rarely considered with distributed DBS because they are not thought to run on highly
restricted systems.
The authors of [HSS+ 14] present a catalog of data stream query optimizations. For each
optimization, realistic examples, preconditions, its profitability, and dynamic variants are
listed. Among operator graph changes the article also presents other optimizations, like
load-shedding, state sharing, operator placement and more. They impose the question for
future research, in which order different optimizations should be performed. In the paper
at hand, we go a first step into this direction by studying the influence of operator graph
changes to following placement decisions. We detail the impact of all the five operator
graph changes from [HSS+ 14]. We think that these changes cover the common query
graph optimizations.
The articles [TD03] and [NWL+ 13] present different approaches to dynamic query optimization. The basic idea is that the order in which tuples visit operators is dynamically
changed at runtime. The concept of distributed Eddies from [TD03] decides this on a
per tuple basis. It does not take the placement of operators into account. Query Mesh
[NWL+ 13] precreates different routing plans and decides at runtime which plan to use for
a set of tuples. It does not consider distributed query processing.
3 Operator Placement as Task Assignment Problem
The operator placement can be modeled as a TAP. Operators are represented as individual
tasks. We use the following TAP definition, based on the definition in [DLB+ 11].
P is the set of all query processors. L is the set of all communication channels. A single
communication channel l ∈ L is defined as l ⊆ {P × P }. A communication channel
subsumes the communication between processors that share a common medium.
T is the set of all operators. The data rate (in Byte) between two operators is given by
rt1 t2 , t1 , t2 ∈ T . The operators and rates represent the query graph.
ctp are the processing costs of operator t on processor p. kp1 p2 gives the cost of sending
one Byte of data between processor p1 and processor p2 . The costs are based on some
cost model according to the optimization goal. Since cost models are highly system and
application specific, we do not assume a specific cost model for our research of querygraph-change effects. Sect. 6 shows how to apply our findings to an exemplary cost model.
[Dau11, 98–121] presents methods for the estimation of operator costs and data rates.
The distribution algorithm tries to minimize the overall cost. It does this by minimizing
term (1), considering the constraints (2) - (5). The sought variables are xtp . xtp = 1
means that task t is executed on processor p. The first sum in equation (1) are the overall
processing costs. The second sum are all communication costs.
min
++
t∈T p∈P
ctp xtp +
+ + + +
t1 ∈T p1 ∈P t2 ∈T p2 ∈P
63
k p 1 p 2 rt 1 t 2 x t 1 p 1 x t 2 p 2
(1)
+
+
+ +
(p1 ,p2 )∈l t1 ∈T t2 ∈T
ctp xtp
≤
b(p), ∀p ∈ P
(2)
rt 1 t 2 x t 1 p 1 x t 2 p 2
≤
d(l), ∀l ∈ L
(3)
xtp
=
1, ∀t ∈ T
(4)
xtp
∈
{0, 1} , ∀p ∈ P, ∀t ∈ T
(5)
t∈T
+
p∈P
Constraint (2) limits the tuple processing cost of the operators on one processor to its
capacity b(p). Constraint (3) limits the communication rate on one communication channel
to its capacity d(l). Constraints (4) and (5) make sure that each task is distributed to exactly
one single processor.
Our findings are solely based on the objective function together with the constraints. We
do not assume any knowledge about the actual distribution algorithm. There exist different
heuristics for solving a TAP. See e.g. [DLB+ 11] and [Lo88].
4 Generic Operator-Graph-Change Influence Decision
A single algebraic query transformation changes the TAP in numerous ways. For example
an operator reordering changes multiple data rates, which are part of multiple equations
inside the TAP. When some of those factors increase, it is hard to tell how it affects a
following operator placement. The transformed query might even become impossible to
execute.
One way to determine the usefulness of a given transformation is to compare the minimum costs of both the original and the transformed query graph. If the transformed
query graph has lower or equal cost for the optimal operator placement, i.e. lower or
equal minimum cost, than the original query, the transformation has a non-negative effect. U (Q) denotes the query graph that results from applying a change U to the original
query Q. Since the operator placement needs to solve a TAP, an NP-complete problem,
it is not efficient to compute the placement for each possible transformation. A function
CompareQuery(Q, U (Q)), that compares two queries and returns true iff U (Q) has
smaller or equal minimal costs than Q would solve the problem.
Sentence. CompareQuery(Q, U (Q)) is NP-hard.
Definition. Utp is a transformation that allows task t only to be performed by processor
p. All other aspects of Utp (Q) are identical to Q. Both Q and Utp (Q) have equal costs
when operators are placed in the same way, i.e. as long as t is placed on p.
Proof. Given CompareQuery(Q, U (Q)) and transformations Utp it is possible to compute the optimal distribution. For each task t it is possible to compare Q and Utp (Q) for
64
each processor p. If CompareQuery(Q, Utp (Q)) returns true Utp (Q) has the same
minimum cost as Q. Thus the optimal placement of t is p. The algorithm in pseudo code:
ComputeDistribution(Q) {
foreach (t in Tasks) {
foreach (p in Processors) {
if (CompareQuery(Q, U_tp(Q)) == true) {
DistributeTaskProcessor(t, p);
// makes sure t will be distributed to p
break; // needed if multiple distributions exist
} } } }
ComputeDistribution(Q) calls CompareQuery(Q, U (Q)) at most |T | · |P |
times. This is a polynomial time reduction of ComputeDistribution(Q). To compute the optimal distribution it is necessary to solve the TAP, an NP-complete problem.
This proves that CompareQuery(Q, U (Q)) is NP-hard.
5
Specific Query Graph Changes
While the general determination of a query graph change’s impact is NP-hard, it can easily
be determined for specific cases. If a transformation neither increases variables used for the
TAP nor adds new variables to the TAP, it is trivial to see that all valid operator placement
schemes are still valid after the transformation. For the transformed query exist operator
placements with lower or equal costs than the original query’s costs: the original optimal
placement is still valid and has lesser or equal costs.
We establish preconditions for all the five operator graph changes from [HSS+ 14]. If
the preconditions are met the transformation is safe. That means for each valid operator
placement scheme of the original query exists a valid scheme for the transformed query
with equal costs. So the preconditions especially guarantee that the minimum costs do not
rise. However, if heuristic algorithms are used for solving the TAP, they may fail to find
an equally good solution for the transformed query as they did for the original query and
vice versa, because local minima may change.
Table 1 shows all preconditions at a glance. We now justify why these preconditions hold.
Notation Most of the used notation directly follows from the TAP, especially ctp and
rt1 t2 . The cost cAp of an operator A on the processor p depends on the input stream of A
and thus on the overall query executed before that operator. Query graph changes affect the
input streams of operators an thus also change the costs needed to execute those operators.
In order to distinguish between the original and the changed query we use U (A) to indicate
the operator A with the applied query change. U (A) and A behave in the same way, but
may have different cost, since they work on different input streams. The costs cU (A)p are
needed to execute U (A) on p and the following operator t receives an input stream with
65
the data rate rU (A)t . In addition rI denotes the input data stream and rO denotes the output
data stream.
Operator Reordering Operator reordering switches the order of two consecutive operators. The operator sequence A → B is transformed to U (B) → U (A). In the original
query, operator A is placed on processor pA and operator B on pB . It is possible that pA
is the same processor as pB , but it is not known whether both operators are on the same
processor, so this cannot be assumed. pA = pB would result in a set of preconditions that
are easier to fulfill than the preconditions we present. The transformed query can place the
operators U (B) and U (A) on any of the processors pA and pB .
Case 1: U (B) is placed on pA and U (A) is placed on pB . To insure the validity of all
distributions, the transformed operators’ cost must not exceed the cost of the other
original operator, which results in equations (6) and (7). Since the reordering affects
the data rates between operators, precondition (8) must hold.
Case 2: Both operators are placed on pA . This adds an internal communication inside
pA to the operator graph. Equation (9) ensures that internal communication is not
factored into the TAP constraints and cost function. The sum of the cost for both
transformed operators must be smaller or equal than the cost of A, which is described by equation (10). The changed data rates are reflected in equation (11).
Case 3: Both operators are placed on pB . This case is similar to case 2 and can be fulfilled
with the preconditions given by equations (9), (12) and (13).
Case 4: The remaining option, U (B) is placed on pB and U (A) is placed on pA , can
be viewed as changed routing. Since the remaining distribution of the query is
unknown, the changed routing can be problematic and this option is inherently not
safe. It is possible that pA processes the operator that sends the input to A and
that pB has an operator that processes the output stream of B. In this situation the
changed routing causes increased communication cost, since the tuples must be send
from pA to pB (applying B) to pA (applying A) to pB instead of only sending them
once from pA to pB .
If one of the operators has more than one input stream not all cases can be used. Even
if the stream does not need to be duplicated, if A has additional input streams only case
2 is valid. The other cases are not safe anymore, because the transformation changes the
routing of the second stream from destination pA to destination pB . Similar, if B has
additional input streams only case 3 is safe.
Redundancy Elimination This query change eliminates a redundant operator: the query
graph has an operator A on two different positions processing the same input stream,
duplicated by another operator. This change works by removing one of the instances of A
and duplicating its output.
The original query consist of three operators. Operator D (Dup Split in [HSS+ 14]) is
placed on pD , while an instance of A is placed both on p1 and p2 . The transformed query
66
Transformation
Precondition (∀p ∈ P )
Case
Case 1:
U (B) on pA
U (A) on pB
Operator reordering
Case 2:
U (B) on pA
U (A) on pA
–
Operator separation
–
(6)
cBp ≥ cU (A)p
(7)
rAB ≥ rU (B)U (A)
(8)
kpp = 0 ∧ ∀l ∈ L : (p, p) ∈
/l
(9)
cAp ≥ cU (B)p + cU (A)p
(10)
rAB ≥ rO
(11)
kpp = 0 ∧ ∀l ∈ L : (p, p) ∈
/l
Case 3:
U (B) on pB
U (A) on pB
Redundancy elimination
cAp ≥ cU (B)p
cBp ≥ cU (B)p + cU (A)p
(12)
rAB ≥ rI
(13)
kpp = 0 ∧ ∀l ∈ L : (p, p) ∈
/l
cDp ≥ cU (A)p + cU (D)p
kpp = 0 ∧ ∀l ∈ L : (p, p) ∈
/l
Case 1:
All on pA
(9)
(9)
(14)
(9)
cAp ≥ cA1 p + cA2 p
(15)
cAp ≥ cCp
(16)
rAB ≥ rO
(17)
cBp ≥ cCp
(18)
rAB ≥ rI
(19)
Fusion
Case 2:
All on pB
Fission
kpp = 0 ∧ ∀l ∈ L : (p, p) ∈
/l
+
cAp ≥ cSp + cM p +
cU (A)p
–
(9)
(20)
U (A)
Table 1: Preconditions for safe query graph changes that must be fulfilled for all processors. If an
operator is not available on some processors, the preconditions can be assumed fulfilled for these
processors. It is sufficient that the preconditions of one case are fulfilled.
67
consists of the operators U (A) and U (D), with U (D) duplicating the output instead of
the input. The only possibility to place the transformed query without changing routing
is to place both U (A) and U (D) on pD . The additional internal communication, due to
the additional operator on pD , again forces equation (9). To ensure that any processor can
perform the transformed operators, equation (14) is necessary.
In some situations (when pD is the same processor as p1 or p2 ) the change is safe as long
as A does not increase the data rate. But since it is unknown how the operators will be
placed, this requirement is not sufficient.
Operator Separation The operator separation splits an operator A into the two operators A1 → A2 . Additional internal communication results in precondition (9). Equation
(15) ensures that the separated operators’ costs are together less than or equal to A’s cost.
Fusion Fusion is the opposite transformation to operator separation. The two operators
A → B are combined to the single operator C (a superbox in [HSS+ 14]). For the original
query A is placed on pA and B on pB . The combined operator can be placed on either
pA or pB . The cost for C must not exceed the cost of pA or pB respectively. In addition,
the data rates are affected and thus also add preconditions. So either the fulfillment of
equations (16) and (17) (if C is placed on pA ) or (18) and (19) (if C is placed on pB )
guarantee the safety of this change. A special case of the fusion is the elimination of an
unneeded operator, i.e. removing the operator does not change the query result. Since the
redundant operator can change the data rate of a stream (e.g. a filter applied before a more
restrictive filter) it still needs to fulfill the preconditions to be safe.
Fission The original query is only the single operator A. Fission replaces A by a partitioned version of it, by applying a split operator S, multiple versions of U (A), which
can potentially be distributed across different processors, and finally a merge operator M
to unify the streams again. Since it is unknown whether other processors exist that can
share the workload profitably, the transformed operators must be placed on the processor
that executed the original A. This is safe when preconditions (9) and (20) hold. These
equations demand that the combined costs of the split, merge and all parallel versions of
U (A) can be executed by all processors with smaller or equal cost than the original A.
6 Application
Given a query and a DSS it is now possible to test whether a specific change is safe. Using
an exemplary cost model we examine a simple example query.
Cost Model [Dau11, 91–98] presents a cost model that will be used for the following
example. We use a filter and a map operator, which have the following costs:
68
CF ilter = λi CF il + λo CAppendOut
CM ap = λi Cproj + λo CAppendOut
(21)
(22)
CF il and Cproj are the costs associated with filtering respectively projecting an input tuple
arriving at the operator. CAppendOut represents the costs of appending one tuple to the
output stream. λi is the input stream tuple rate, while λo is the output stream tuple rate. For
those operators λo is proportional to λi and the equations (21) and (22) can be simplified
to λi fOp , where fOp is the cost factor of operator O on processor p for one tuple.
Using these simplified equations, the assumption that the tuple rate is proportional to the
data rate and costs and selectivities are non-zero, equations (6) to (8) can be rewritten as:
λI fAp ≥ λI fBp
⇔
σA λI fBp ≥ σU (B) λI fAp
⇔
σA λI ≥ σU (B) λI
⇔
fAp
fBp
≥1
fAp
σA
σU (B) ≥ fBp
σA
σU (B) ≥ 1
(23)
(24)
(25)
The equations for the other two cases shown in table 1 can be similarly rewritten. Equations (23) to (25) show that there are relatively few values to compare: We need the ratio
of the operator selectivity and for each processor the ratio of operator costs.
Example We examine the simple query of a map operator M followed by a filter F
applied on a stream containing image data monitoring conveyor belts transporting freshly
produced items. The query supports judging the quality of the current production run. Operator M classifies each tuple (and thus each observed produced item) into one of several
quality classes and is rather expensive. F filters the stream for one conveyor belt, because
different conveyor belts transport different items and are observed by different queries.
M does not change the data rate of the stream. It simply replaces the value unclassified
already stored inside the input stream for each tuple with the correct classification and thus
has a selectivity of 1. There are multiple types of processors available inside the production
f p
hall. Depending on the processor type the ratio fM
differs quite a bit, but overall M is
Fp
more expensive: this ratio fluctuates between 2 and 10. Equations (23) to (25) show that
the selection push down is always safe if σU (F ) is smaller or equal than 0.1: In this case
it is always possible that the two operators switch their places without violating additional
constraints of the TAP. If σU (F ) is greater than 0.1 this change is not necessarily safe. It is
possible that the preconditions of one of the other two cases (both operators on the same
processor) are fulfilled or another good distribution is possible, but the latter cannot be
tested in a reasonable time as we discussed in Sect. 4.
69
7
Conclusion
We presented our findings on the interaction between optimization through query graph
changes and the placement of operators on different heterogeneous processing systems.
We first motivated our research and defined the problem. Existing work on query optimization through operator graph changes in the context of DMS and DSS was presented,
none of which studied the interaction with operator placement. The next section presented
the TAP model of the distribution problem. We showed that it is NP-hard to decide in
general if an arbitrary query graph change can negatively influence the best possible operator placement scheme. Based on a selection of common query graph changes from the
literature, we deduced preconditions under which operator placement does not mind the
changes. The last section showed the application of our findings with an exemplary cost
model for a realistic query.
The preconditions for safe operator graph changes are quite restrictive. They severely limit
the possible changes if followed strictly. As with general query optimization, development
of heuristics to loosen certain preconditions seems promising. The preconditions presented
in this article are the basis for such future work. Another interesting field is the direct
integration of query graph optimization in the usually heuristic distribution algorithms.
Distribution algorithms could be extended to consider query graph changes in addition to
the operator placement. We plan to investigate these ideas in our future research.
References
[Dau11]
M. Daum. Verteilung globaler Anfragen auf heterogene Stromverarbeitungssysteme.
PhD thesis, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), 2011.
[DLB+ 11] M. Daum, F. Lauterwald, P. Baumgärtel, N. Pollner, and K. Meyer-Wegener. Efficient and Cost-aware Operator Placement in Heterogeneous Stream-Processing Environments. In Proceedings of the 5th ACM International Conference on Distributed
Event-Based Systems (DEBS), pages 393–394, New York, NY, USA, 2011. ACM.
[HSS+ 14] M. Hirzel, R. Soulé, S. Schneider, B. Gedik, and R. Grimm. A Catalog of Stream
Processing Optimizations. ACM Comput. Surv., 46(4):1–34, 2014.
[JK84]
M. Jarke and J. Koch. Query Optimization in Database Systems. ACM Comput. Surv.,
16(2):111–152, 1984.
[Kos00]
D. Kossmann. The State of the Art in Distributed Query Processing. ACM Comput.
Surv., 32(4):422–469, 2000.
[Lo88]
V. M. Lo. Heuristic Algorithms for Task Assignment in Distributed Systems. IEEE
Transactions on Computers, 37(11):1384–1397, 1988.
[NWL+ 13] R. V. Nehme, K. Works, C. Lei, E. A. Rundensteiner, and E. Bertino. Multi-route Query
Processing and Optimization. J. Comput. System Sci., 79(3):312–329, 2013.
[TD03]
F. Tian and D. J. DeWitt. Tuple Routing Strategies for Distributed Eddies. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29,
VLDB ’03, pages 333–344. VLDB Endowment, 2003.
70
Herakles: A System for Sensor-Based Live Sport Analysis
using Private Peer-to-Peer Networks
Michael Brand, Tobias Brandt, Carsten Cordes, Marc Wilken, Timo Michelsen
University of Oldenburg, Dept. of Computer Science
Escherweg 2, 26129 Oldenburg, Germany
{michael.brand,tobias.brandt,carsten.cordes,marc.wilken,timo.michelsen}@uni-ol.de
Abstract: Tactical decisions characterize team sports like soccer or basketball profoundly. Analyses of training sessions and matches (e.g., mileage or pass completion
rate of a player) form more and more a crucial base for those tactical decisions. Most
of the analyses are video-based, resulting in high operating expenses. Additionally, a
highly specialized system with a huge amount of system resources like processors and
memory is needed. Typically, analysts present the results of the video data in time-outs
(e.g., in the half-time break of a soccer match). Therefore, coaches are not able to view
statistics during the match.
In this paper we propose the concepts and current state of Herakles, a system for
live sport analysis which uses streaming sensor data and a Peer-to-Peer network of
conventional and low-cost private machines. Since sensor data is typically of high
volume and velocity, we use a distributed data stream management system (DSMS).
The results of the data stream processing are intended for coaches. Therefore, the
front-end of Herakles is an application for mobile devices (like smartphones or tablets).
Each device is connected with the distributed DSMS, retrieves updates of the results
and presents them in real-time. Therefore, Herakles enables the coach to analyze his
team during the match and to react immediately with tactical decisions.
1
Introduction
Sport is a highly discussed topic around the world. Fans, the media, teams and athletes
discuss about performance, tactical decisions and mistakes. Statistics about sports are an
important part of these discussions. Some statistics are calculated manually by people
outside the playing field (e.g., counting ball contacts). Other statistics are retrieved automatically by computer-based analysis systems. However, such systems require expensive
computers to calculate the statistics due to the huge amount of data to process. Not every
team is able to buy and maintain them. Additionally, many systems need too much time
to calculate the statistics, making it impossible to deliver them during the game. Many already existing sport analysis systems are using several cameras placed around the playing
field. This increases the costs for buying and maintaining the system even further.
In this paper, we propose Herakles, a live sport analysis system based solely on conventional low-cost hardware and software we are currently working on. Instead of using costly
cameras, Herakles uses a network of sensors attached to all game-relevant objects, e.g.,
the players and the ball. Since sensor systems are usually not allowed during games and
71
opponents won’t wear the sensors anyway, Herakles focuses on the use for tryouts. The
active sensors regularly send their actual position, acceleration and other values as data
streams. Therefore, Herakles uses a distributed data stream management system (DSMS).
With such a DSMS, it is possible to compute relevant sport statistics in real-time. In our
work, real-time means ”near to the actual event”(rather than in timeouts or after a match).
But using a single machine with a DSMS installed invokes the possibility of single-pointof-failures: (1) a single DSMS instance would run the risk of overload and (2) the live analysis stops completely if the single DSMS instance fails. Therefore, Herakles uses multiple
machines to share the processing load and to increase the reliability of the entire system.
Currently, many already existing distributed DSMSs are using computer clusters, grids,
etc. (e.g., [vdGFW+ 11]).
Since Herakles focuses on using a collection of smaller low-cost private machines like
notebooks, a private Peer-to-Peer (P2P) network is more feasible. In our work, each peer
is considered to be heterogeneous and autonomous. This makes it possible to use these
types of private machines for distributed data stream processing of sensor data for live
sport analysis. Although the peers are low-cost private machines, the system costs depend
on the used sensor network.
To show the calculated sport statistics in a convenient way for different users, Herakles
focuses on mobile devices for the presentation (e.g., smartphones or tablets). With this, the
statistics can be directly shown in real-time. Additionally, Herakles abstracts from specific
sensors as well as from a specific sport. Statistics for different sports can be calculated
using different sensors but based on the same system. Subsequently, these statistics have
to be presented in a way that the technological background is encapsulated from the user,
e.g., the user does not need to configure the P2P network.
The remainder of this paper is structured as follows: Section 2 gives an overview of related
sport analysis systems. The concept of Herakles is explained in Section 3. In Section 4 we
give an overview on the current state of our implementation. Section 5 gives a prospect of
future work and Section 6 concludes this paper.
2 Related Work
There are many existing sport analysis systems developed and used. However, the majority
of them is built on cameras. Many of them are commercial products like Synergy Sports
Technology1 or Keemotion2 . Other projects are scientific. For instance, Baum [Bau00]
provides a video-based system for performance analysis of players and objects. It uses
high-speed video cameras and was tested in the context of baseball. As mentioned earlier,
there are many problems related to video-based systems, especially the high acquisition
and maintenance costs.
Leo et al. [LMS+ 08] refer to a video-based soccer analysis of players and objects. In their
approach, the ball and players are tracked by six video cameras. Then, the video data is
1 http://corp.synergysportstech.com/
2 http://www.keemotion.com/
72
received by six machines and processed by a central supervisor. The arising problems of
this approach are the same as for Baum [Bau00].
Smeaton et al. [SDK+ 08] propose an analysis of the overall health of the sportsperson
in the context of football games. Their sensor-based approach includes body sensors in
combination with video recordings. The location is tracked by GPS, but the data can only
be reviewed after the game and their approach currently handles only one person at a time.
Von der Grün [vdGFW+ 11] proposes a sensor-based approach called RedFIR. It uses expensive hardware for the real-time analysis (e.g., SAP HANA for a german soccer team3 ),
which leads to high acquisition and maintenance costs like in [Bau00]. Additionally, their
data is processed in an in-memory database instead of a DSMS.
To our best knowledge, there is no sport analysis system that uses a network of conventional, low-cost machines for processing streaming sensor data and that shows the results on
a mobile device in real-time.
3 Concept
In this section, we give an overview of Herakles. Its architecture is shown in figure 1 and
is divided into three components: (1) sensor network, (2) P2P network of DSMSs and (3)
mobile device.
P2P network
of DSMSs
Sensor network
Sensor
data stream
Position
sensor
Position
receiver
Sport statistics
Queries
Mobile
device
Figure 1: Architecture of Herakles with sensors, a P2P network and a mobile device.
We equip important game entities with sensors to form the sensor network. Each of these
sensors sends the position and other information like speed and acceleration of the respective entity. This information is captured by receivers, which are placed around the playing
field. These receivers are connected to the next component, the P2P network of conventional (private) machines with DSMSs installed.
The DSMSs receive the sensor data stream. To be independent from a specific sport and
sensor technology, Herakles separates the incoming raw sensor data into multiple intermediate schemata at first (data abstraction). The DSMSs use continuous queries built on
top of these intermediate schemata to continuously analyze the data and to calculate the
sport statistics. With this, we can replace the sensors, but we do not need to change the
continuous queries.
3 http://tinyurl.com/o82mub5
73
The query results are streamed to the mobile device, the last component in the architecture
of Herakles. The mobile device visualizes the statistics and the user can decide at any time,
which statistics should be shown at the moment.
Inside Herakles, each statistic is mapped to a continuous query, which is distributed, installed and executed in the P2P network. The distribution is necessary to make sure that
not a single peer processes a complete query (avoiding overloads).
In the following sections, we give a more detailed view about the components of Herakles.
In section 3.1, we describe our data abstraction. The used P2P network and the distributed
DSMS with it’s features is explained in section 3.2. Finally, the presentation on mobile
devices is explained in section 3.3.
3.1
Data Abstraction
Distributed Data Container
To improve reusability and flexibility, Herakles separates the sensor data into different
layers. In doing so, we can avoid direct dependencies between calculated statistics and a
sensor’s technology and data format. An overview of Herakles’ data abstraction is shown
in figure 2.
Events
Basketball-specific
Events
Soccer-specific
Events
...
Generic Movement Data
Intermediate Schema
Raw Sensor Data
Figure 2: Data abstraction layers ranging from sensor schema up to sport specific schemata.
The sensor network sends its data as raw sensor data. Inside the P2P network, this data
is converted into a common intermediate schema for standardization. This includes for
example a unit conversion.
The intermediate schema improves the interchangeability in Herakles: While the sensor’s
data format can change, the intermediate schema stays unchanged. When different sensors
are used, only the adapter between raw sensor data and the intermediate schema has to be
rewritten. Other parts of Herakles including the continuous queries for the sport statistics
stay unchanged.
In our opinion, generic movement information is interesting for more than one sport and
should be reused. Therefore, on top of the intermediate schema, we differentiate between
generic movement data (e.g., the player’s position and running speed) and sport-specific
events (e.g., shots on goal for soccer). Sport-specific events need to be created for each
sport individually (e.g., basketball and soccer). By separating generic movement data from
sport-specific events, we can reuse parts of continuous queries in different sports. In ge-
74
neral, Herakles uses continuous queries built on each other to transform the data: one
continuous query receives the raw sensor data, transforms it into the intermediate schema
and provides the results as an artificial data source. This source can be reused from other
continuous queries to determine the movement data, identifying sport-specific events etc.
Another aspect of the data abstraction is the storage and the access to static values. For
instance, the positions of the soccer-goals are needed in many statistics (e.g., for counting
shots on goal or identifying goals itself). Another example is the identification, which
sensor is attached to which player or ball. Since Herakles uses a decentralized and dynamic
P2P network, a central server is not applicable. Additionally, it is not feasible to define the
static values on each peer individually. Therefore, Herakles needs a decentralized way to
share these static information across the network.
The data abstraction of Herakles uses a so-called distributed data container (DDC). It
reads information (e.g., from a file) and distributes them automatically to other peers in
the network without further configuration. Then, each peer can use these values inside
its own DSMS. Changes at one peer in the network are propagated to the other peers
automatically. With this, the user has to define the static values only at one peer.
3.2 Data Stream Management System
Herakles uses a distributed DSMS in a P2P network for processing the sensor data. Since
the peers are private machines (e.g., notebooks or mobile devices), the peers are considered
to be autonomous and heterogeneous [VLO09].
To use a distributed DSMS in such a P2P network with enough performance, reliability,
availability and scalability for live sports analysis, multiple mechanisms must be in place.
Query distribution makes it possible to use more than one peer for query processing (to
share system loads). With fragmentation, the data streams can be split and processed in
parallel on multiple peers. Replication increases the reliability of the system by executing
a query on multiple peers at the same time. Therefore, results are available, even if a peer
failure occurs [Mic14]. Recovery also reacts to peer failures in order to restore the previous
state of the distributed system at run-time. Dynamic load balancing monitors the resource
usage of the peers in the network at run-time and shifts continuous queries from one peer
to another.
In our opinion, a P2P network of DSMS instances with all features mentioned above can
handle the challenges of processing and analyzing data from active sensors in the context
of live sport analysis.
3.3 Presentation
The coach wants an agile and lightweight device to access the data. Therefore, we chose
mobile devices like smartphones or tables as front-end. These devices are connected to
75
the P2P network (e.g., via Wi-Fi) and receive the result streams of the continuous queries. The presentation of the calculated sport statistics should be fast and easy to support
the coaches’ decisions during the game. Therefore, the results have to be aggregated and
semantically linked to provide a simple and understandable access to information.
We interviewed different coaches to identify the needed information. We classified those
information in (1) player statistics, (2) team statistics and (3) global statistics. Player statistics provide information of a specific player (e.g., ball contacts or shots on goal). This
view can be used to track the behavior of a single player in significant situations. Team
statistics provide information about an entire team (e.g., ball possession or pass completion rate). An overview of the current game is provided by the global statistics view. All
of these views are updated in real-time since the mobile device continuously receives the
statistic values.
4 Implementation status
Currently, Herakles is under development and many functions and mechanics are already in place. Therefore, there are no evaluation results available, yet. However, we tested
Herakles with the analysis of a self-organized basketball game.
In this section, we give a brief overview of the current state of our implementation. At
first, we give a description of our sensor network and data abstraction (section 4.1). The
used distributed DSMS and its relevant features are explained in section 4.2. Section 4.3
contains a description about the used sport statistics. Section 4.4 describes the presentation
application, which is used for the coach to see the statistics in real-time.
4.1 Sensor networks and data abstraction
We already tested a GPS sensor network for outdoor and a Wi-Fi sensor network for indoor
sports. In both sensor networks, position data is provided by mobile applications using
sensors of the mobile device. That makes both sensor networks low priced. To test the
approach, we equipped 12 players and the ball with a device running the sensor application
and let them compete in a short basketball game while recording the resulting data.
The DDC and the different layers of the data abstraction component are implemented. The
DDC can be filled in two ways: (1) by reading a file or (2) by messages from other peers.
The DDC generates messages to be sent to other DDC instances on other peers resulting
in a consistent state through the P2P network. The intermediate schema for the sensor data
is shown in listing 1. x, y and z have to be sent by the sensors (but can be converted if they
are in a different unit, e.g., in centimeters), whereas ts can be calculated by the DSMS as
well as v and a (using two subsequent elements sent by the same sensor). In our opinion,
measuring the positions in millimeters and the time in microseconds are sufficient for most
sport statistics. Currently, we have implemented adapters to wrap the data streams from
the testing source networks mentioned above into our intermediate schema.
76
Listing 1: Intermediate schema.
1
2
3
4
5
6
7
sid
ts
x
y
z
v
a
-
unique ID
timestamp [microseconds]
x-position [mm]
y-position [mm]
z-position [mm]
current absolute velocity [ mm
s ]
current absolute acceleration [ mm
s2 ]
4.2 Distributed Data Stream Management System
For our distributed DSMS, we use Odysseus, a highly customizable framework for creating
DSMSs. Its architecture consists of easily extensible bundles, each of them encapsulating
specific functions [AGG+ 12]. Odysseus provides all basic functions for data stream processing and is already extended for the distributed execution in a P2P network of heterogeneous and autonomous peers. Furthermore, mechanisms for query distribution, fragmentation and replication are already available [Mic14]. Because of its current functionality in
addition with the extensibility, we decided to use Odysseus for Herakles.
However, Odysseus had not all features we need to implement a reliable real-time sport
analytic system. It did not support dynamic load balancing (for shifting continuous queries
at run-time) and recovery. Therefore, we extended Odysseus with these mechanisms.
For the dynamic load balancing, we implemented a communication protocol and a simple
load balancing strategy. For recovery, we implemented a combination of active standby
and upstream-backup. Active standby means to have a continuous query executed multiple times on different peers (similar to replication). However, only the streaming results
of one peer are used. But, if that peer fails (or leaves the network on purpose), another
peer with a copy of the query replaces it. With upstream-backup, a peer saves the stream
elements which had been sent to its subsequent peer, until that peer indicates that it has
processed said stream elements. If the subsequent peer fails, another peer can install the
lost continuous query again and the processing can be redone with the previously saved
stream elements.
4.3 Live Statistics
Analyzing the data and calculating useful statistics in terms of continuous queries is an
essential part of Herakles. Typically, queries are described by a declarative query language. To avoid complex query declarations on the mobile devices, we implemented an own
query language called SportsQL. It is a compact language especially designed for sport
statistic queries in Herakles. If the user selects a statistic to show, the mobile application
of Herakles generates a corresponding SportsQL query, which is sent to an Odysseus-
77
instance. The transition of the SportsQL query to an executable continuous query is done
inside of Odysseus. With this decision, the continuous queries and the mobile application
are disconnected from each other: we can create and improve the complex continuous queries in Odysseus without changing the mobile application. Furthermore, any other device
can also use SportsQL without changing Odysseus. An example of SportsQL is shown in
listing 2.
Listing 2: Example of SportsQL for the shots on goal of a specific player with the id 8.
1
{
2
3
4
5
6
}
"statisticType": "player",
"gameType": "soccer",
"entityId": 8,
"name": "shotsongoal"
In this query, a mobile device wants a statistic about the current amount of shots on goal
for a specific player. The attribute name identifies the statistic to generate and the attribute gameType identifies the analyzed game (in this case soccer). StatisticType
differentiates between team statistic, player statistic and global statistic mentioned in section 3.1. In this example, a player statistic is specified. Therefore, an entityId refers to
a player for which this statistic should be created. It is possible to send further parameters
within a SportsQL query such as time and space parameters. This can be used to limit the
query results, e.g., for a specific range of time or for a specific part of the game field.
4.4 Presentation
We implemented an extensible Android application intended for tablets and smartphones.
Android has been chosen because it is common and allows the application to be run on
different kinds of devices. Figure 3 shows a screenshot of our current application.
5 Future Work
In our current implementation of Herakles, we focus on soccer-specific statistics to show
the proof of concept. But there are a lot of possibilities for more advanced statistics and
views. Depending on the sport, we want to support additional statistics. Consequently, we
plan to enhance our mobile application, the SportsQL query language and the corresponding continuous queries in the P2P network.
Herakles works with radio-based sensors reducing the costs compared to video-based systems. Therefore, it focuses on training sessions or friendly games. Nevertheless, the probability to get a license to use such a system within an official, professional match is low
78
Figure 3: Mobile application for the coach with statistics and a game topview.
whereas video-based systems are more accepted. An extension could be to use a videobased system to get the player positions.
Currently, we do not consider inaccuracies in sensor data streams. For example, some
sensors send inaccurate data or no data at all for a certain time. Therefore, there can be
anomalies in sensor data [ACFM14] and sport statistics, which should be considered in
Herakles in the future.
Additionally, we do not face the problems of security, which rise with the use of private
P2P networks. Currently, we expect that each peer is cooperative and does not want to
damage the system on purpose. But this assumption must be weakened in the future: sensor
data streams have to be encrypted, peers need to be checked (e.g., web of trust [GS00])
and a distributed user management should be in place.
An open issue is the lack of evaluation. Currently, we are implementing the last steps. We
made a few tests with a GPS sensor network and are about to begin with the evaluation.
We plan to measure how much data Herakles can process in real-time, how fast it can react
to sport-events (like interruptions), how accurate the sport statistics are and how much
reliability and availability the P2P network really provides.
6 Summary
In professional sports, complex and expensive computer-based analysis systems are used
to collect data, to setup statistics, and to compare players. Most of them are video-based
and not every team has the opportunity to buy and maintain them. With Herakles, we
proposed an alternative sport analysis system, which uses a decentralized and dynamic
P2P network of conventional private computers. A sensor network placed on the playing
field (e.g., sensors attached to the players and balls) is continuously sending position data
to this P2P network. On each peer, a DSMS is installed. This collection of DSMSs is
used to cooperatively process the streaming sensor data in real-time. To be independent
79
from specific sensors, we designed a data abstraction layer, which separates the sensors
from our continuous queries generating the sport statistics. Herakles presents the statistics
on a mobile device, where the user immediately sees those statistics during the game.
Users can choose other statistics at any time and the P2P network adapts to these changes
automatically.
In the P2P network, we use Odysseus, a component-based framework for developing
DSMSs. Despite the data stream processing, it supports continuous query distribution, replication and fragmentation. We added dynamic load balancing and recovery mechanisms
to fulfill our requirements. Finally, Herakles uses Android-based mobile devices to show
the calculated statistics in real-time.
There are still many tasks to do: primarily, extensions to other sports and statistics. We
are about to begin with the evaluation to measure performance of efficiency of Herakles.
But we are confident: With Herakles, we show that it is possible to analyze sport events in
real-time with commodity hardware like notebooks.
References
[ACFM14]
Annalisa Appice, Anna Ciampi, Fabio Fumarola und Donato Malerba. Data Mining Techniques in Sensor Networks - Summarization, Interpolation and Surveillance. Springer Briefs in Computer Science. Springer, 2014.
[AGG+ 12]
H.-Jürgen Appelrath, Dennis Geesen, Marco Grawunder, Timo Michelsen und Daniela Nicklas. Odysseus: a highly customizable framework for creating efficient event
stream management systems. DEBS ’12, Seiten 367–368. ACM, 2012.
[Bau00]
C.S. Baum. Sports analysis and testing system, 2000. US Patent 6,042,492.
[GS00]
T. Grandison und M. Sloman. A survey of trust in internet applications. Communications Surveys Tutorials, IEEE, 3(4):2–16, Fourth 2000.
[LMS+ 08]
Marco Leo, Nicola Mosca, Paolo Spagnolo, Pier Luigi Mazzeo, Tiziana D’Orazio
und Arcangelo Distante. Real-time multiview analysis of soccer matches for understanding interactions between ball and players. In Proceedings of the 2008 international conference on Content-based image and video retrieval, Seiten 525–534.
ACM, 2008.
[Mic14]
Timo Michelsen. Data stream processing in dynamic and decentralized peer-to-peer
networks. In Proceedings of the 2014 SIGMOD PhD symposium, Seiten 1–5. ACM,
2014.
[SDK+ 08]
Alan F. Smeaton, Dermot Diamond, Philip Kelly, Kieran Moran, King-Tong Lau,
Deirdre Morris, Niall Moyna, Noel E O’Connor und Ke Zhang. Aggregating multiple
body sensors for analysis in sports. 2008.
[vdGFW+ 11] Thomas von der Grün, Norbert Franke, Daniel Wolf, Nicolas Witt und Andreas Eidloth. A real-time tracking system for football match and training analysis. In Microelectronic Systems, Seiten 199–212. Springer, 2011.
[VLO09]
Quang Hieu Vu, Mihai Lupu und Beng Chin Ooi. Peer-to-Peer Computing: Principles and Applications. Springer, 2009.
80
Bestimmung von Datenunsicherheit in einem
probabilistischen Datenstrommanagementsystem
Christian Kuka
SCARE-Graduiertenkolleg
Universität Oldenburg
D-26129 Oldenburg
[email protected]
Daniela Nicklas
Universität Bamberg
D-96047 Bamberg
[email protected]
Abstract: Für die kontinuierliche Verarbeitung von unsicherheitsbehafteten Daten in
einem Datenstrommanagementsystem ist es notwendig das zugrunde liegende stochastische Modell der Daten zu kennen. Zu diesem Zweck existieren mehrere Ansätze,
wie etwas das Erwartungswertmaximierungsverfahren oder die Kerndichteschätzung.
In dieser Arbeit wird aufgezeigt, wie die genannten Verfahren in ein Datenstrommanagementsystem verwendet werden können, umso eine probabilistische Datenstromverarbeitung zu ermöglichen und wie sich die Bestimmung des stochastischen Modells
auf die Latenz der Verarbeitung auswirkt. Zudem wird die Qualität der ermittelten
stochastischen Modelle verglichen und aufgezeigt, welches Verfahren unter welchen
Bedienungen bei der kontinuierlichen Verarbeitung von unsicherheitsbehafteten Daten
am effektivsten ist.
1
Einführung
Für die qualitätssensitive Verarbeitung von Sensordaten ist es notwendig die aktuelle Qualität der Daten zu kennen. Eine der hierbei häufig verwendeten Qualitätsdimensionen ist
der statistische Fehler von Sensormessungen. In vielen Fällen wird hierbei die aus dem
Datenblatt stammende Kennzahl für die Standardabweichung herangezogen um das stochastische Modell im Sinne einer Normalverteilung zu verwenden. Jedoch kann das Rauschen eines Sensors von vielen Kriterien abhängen und sich vor allem auch dynamisch zur
Laufzeit ändern. Eine Form der Qualitätsbestimmung besteht darin, direkt das zugrunde
liegende stochastische Modell der Sensormessungen kontinuierlich neu zu ermitteln. Vor
allem im Bereich der kontinuierlichen Verarbeitung von hochfrequenten Sensordaten ist
es hierbei notwendig die Speicherkapazitäten des Systems zu beachten und die Daten so
schnell wie möglich zu verarbeiten. Für diese Form der Verarbeitung existiert mittlerweile eine Vielzahl von Systemen, welche unter dem Begriff Datenstrommanagementsystem
zusammengefasst werden können.
Im Rahmen von Datenstrommanagementsystemen hat sich für die Verarbeitung von Unsicherheiten der Begriff der probabilistischen Datenstromverarbeitung [TPD+ 12, JM07,
KD09] etabliert. Ziel der Verarbeitung ist es nicht nur den reinen Messwert, sondern die
zugrunde liegende Unsicherheit innerhalb der Verarbeitung in einem Datenstrommanagementsystem zu repräsentieren und zu verarbeiten, so dass der entstehende kontinuierliche
81
Ausgabestrom einer Anfrage auch immer die aktuelle Ergebnisunsicherheit enthält. Bei
der Verarbeitung von Unsicherheiten kann dabei zwischen zwei Klassen unterschieden
werden, der Verarbeitung von diskreten Wahrscheinlichkeitsverteilungen und der Verarbeitung von kontinuierlichen Wahrscheinlichkeitsverteilungen. Diskrete Verteilungen werden häufig dazu genutzt die Existenzunsicherheit von möglichen Welten darzustellen. Kontinuierliche Wahrscheinlichkeitsverteilungen dienen dagegen dazu, Unsicherheiten in der
Sensorwahrnehmung, welche etwa durch das Messverfahren an sich oder Umwelteinflüsse
induziert werden, zu beschreiben. Im Folgenden liegt der Fokus daher auf der Bestimmung
von kontinuierlichen stochastischen Modellen.
Die Bestimmung von stochastischen Modellen auf Basis von Datenströmen bei Filteroperationen wurde unter anderem in [ZCWQ03] behandelt. Hierbei war allerdings das Ziel,
das stochastische Modell zu verwenden, um das Rauschen um einen Selektionsbereich innerhalb der Verarbeitung zu bestimmen. Ziel dieser Arbeit ist es aber das mehrdimensionale stochastische Modell der Daten selbst zu bestimmen, um eine probabilistische Verarbeitung der Daten, wie sie in [TPD+ 12] mit dem Mischtyp-Modell eingeführt wurde,
zu ermöglichen. Das Modell hat den Vorteil, dass es sowohl die Unsicherheit über die
Existenz einzelner Attribute, sowie auch die Unsicherheit über die Existenz ganzer Tupel
repräsentieren kann. Zur Evaluation von verschiedenen Verfahren zur Bestimmung und
Verarbeitung der mehrdimensionalen stochastischen Modelle wurde diese probabilistische
Verarbeitung mit den Konzepten der deterministischen Verarbeitung mit Zeitintervallen
aus [Krä07] kombiniert und in dem Datenstrommanagementsystem Odysseus [AGG+ 12]
implementiert.
2 Verfahren zur Bestimmung von stochastischen Modellen
Für die Bestimmung von mehrdimensionalen stochastischen Modellen, wie sie bei der
probabilitischen Datenstromverarbeitung verwendet werden, existieren prinzipiell mehrere Möglichkeiten. Zu diesen Verfahren zählen etwa das Erwartungswertmaximierungsverfahren und die Kerndichteschätzung, welche im Folgenden näher erläutert werden.
2.1 Erwartungsmaximierungsverfahren
Das Erwartungswertmaximierungsverfahren [DLR77] dient dazu die Parameter eines stochastischen Modells durch mehrere Iterationen an die Verteilung von Daten anzunähern.
Hierzu wird versucht die Log-Likelihood L zwischen den zu bestimmenden Parametern
und den zur Verfügung stehenden Daten in jeder Iteration t des Algorithmus zu maximieren. Als Parameter bieten sich hierfür die Parameter einer multivariaten Mischverteilung
aus Gauß-Verteilungen mit Parameter θ = {wi , µi , Σi }m
i=1 an. Eine multivariate Mischverteilung aus Gauß-Verteilungen über eine kontinuierliche Zufallsvariable X ist eine Menge
82
von m gewichteten Gauß-Verteilungen X1 , X2 , . . . , Xm , wobei Xi die Wahrscheinlichkeitsdichtefunktion
fX (x) =
m
+
wi fXi (x) mit fXi (x) =
i=1
T −1
1
1
e− 2 (x−µi ) Σi (x−µi )
1/2
| Σi |
(2π)k/2
/m
ist. Dabei gilt, dass 0 ≤ wi ≤ 1, i=1 wi = 1, k die Größe des Zufallsvektors ist und jede
Mischverteilungskomponente Xi eine k-variate Gauß-Verteilung mit Erwartungswert µi
und Kovarianz-Matrix Σi ist. Zur Annäherung einer Gauß-Mischverteilung wird zunächst
ein initiales stochastisches Modell mit m Gauß-Verteilungen bestimmt. Auf Basis des aktuellen Modells werden nun im E-Schritt die Erwartungswerte bestimmt, also die Wahrscheinlichkeiten, dass die aktuellen Werte aus dem aktuellen stochastischen Modell generiert wurden.
(t)
τij
(t)
(t)
wj FXj (xi ; θj )
= /m
l=1
(t)
(t)
wl FXj (xi ; θl )
(t)
γj =
n
+
i=1
, i = 1, . . . , n, j = 1, . . . , m
(t)
τij , j = 1, . . . , m
Während des M-Schrittes werden die neuen Parameter für θ anhand der Ergebnisse aus
dem E-Schritt bestimmt.
(t+1)
wj
=
(t+1)
Σj
=
(t)
n
γj
1 + (t)
(t+1)
, j = 1, . . . , m, µj
= (t)
τij , j = 1, . . . , m
n
γj i=1
n
1 +
(t)
γj
i=1
(t)
(t+1)
τij (xi − µj
(t+1) T
)(xi − µj
) , j = 1, . . . , m
Nach jedem EM-Schritt wird die Log-Likelihood berechnet und mit einem gegebenen
Schwellwert verglichen. Ist die Differenz kleiner als der gegebene Schwellwert oder überschreitet die Anzahl der Iterationen die maximale Anzahl, werden die bestimmten Parameter für die Gewichte (w), den Erwartungswert (µ), sowie die Kovarianz-Matrix (Σ) der
Mischverteilung zurückgeliefert.
2.2
Kerndichteschätzung
Im Gegensatz zum EM-Verfahren wird bei der Kerndichteschätzung (KDE) für jeden
Messwert eine Komponente in einer Mischverteilung erstellt und eine Bandbreite bestimmt. Die Bandbreite dient dazu eine Varianz/Kovarianz Matrix für alle Komponenten
der Mischverteilung zu bilden und so das eigentliche zugrunde liegende Modell möglichst
gut wieder zu geben. Zur Bestimmung der Bandbreite B haben sich mehrere Verfahren
83
etabliert, wie etwa die Scott-Regel [Sco92]. Die Parameter der Komponenten der Mischverteilung lassen sich somit wie folgt berechnen:
1
, µj = xj , Σj = Σ(x) ∗ B
n
Wobei Σ(x) die Varianz/Kovarianz der zugrunde liegenden Daten repräsentiert. Man sieht
bereits, dass das KDE ohne mehrmalige Iterationen über die zugrunde liegenden Daten
auskommt, da sowohl der Erwartungswert wie auch die Varianz/Kovarianz inkrementell
bestimmt werden können.
wj =
Da bei der Kerndichteschätzung die Anzahl an Komponenten der Mischverteilung linear
zu der Zahl der Messwerte steigt und somit das Ergebnis generell ungeeignet ist für eine
Verarbeitung in einem Datenstrommanagementsystem, benötigt es ein Verfahren zur Reduktion der Komponenten. In [ZCWQ03] stellen die Autoren ein Verfahren vor, welches
das KDE-Verfahren auf einen eindimensionalen Strom anwendet und die dabei resultierende Mischverteilung durch ein Kompressionsverfahren auf eine geringere Anzahl von
Verteilungen reduziert. Dieses Verfahren ist allerdings nicht für multivariate Verteilungen
anwendbar. In [CHM12] wurde eine Selbstorganisierende Merkmalskarten (SOM) verwendet um Cluster zu bilden und diese Cluster durch eine Verteilung darzustellen. SOMs
haben allerdings allgemein den Nachteil, dass die Gefahr einer Überanpassung der Gewichtsvektoren besteht. Eine weitere Möglichkeit zur Reduktion der Komponenten besteht in dem Bregman Hard Clustering Verfahren [BMDG05]. Bei dem Bregman Hard
Clustering Verfahren wird versucht, ähnliche Verteilungen innerhalb einer Mischverteilung durch die Bildung von Clustern zu vereinfachen. Hierbei werden zunächst Cluster
mit je einem Repräsentanten gebildet und anschließend für jedes Cluster eine Minimierung ausgeführt mit dem Ziel den Informationsverlust zwischen den Clusterzentren und
den Komponenten zu minimieren. Das Verfahren kann als eine Generalisierung des Euklidischen k-Means Verfahrens angesehen werden, wobei die Kullback-Leibler Divergenz
als Minimierungsziel verwendet wird. Um allerdings die Bestimmung des Integrals innerhalb der Kullback-Leibler Divergenz zu umgehen, wird die Kullback-Leibler Divergenz in
eine Bregman Divergenz umgewandelt. Die Bregman Divergenz ist dabei definiert als:
DF (θj ||θi ) = F (θj ) − F (θi )− < θj − θi , ∇F (θi ) >
(1)
Hierbei wird die Dichtefunktion einer Normalverteilung in die kanonische Dekomposition
der jeweiligen Exponentialfamilie wie folgt umgeschrieben:
N (x; µ, σ 2 ) = exp{< θ, t(x) > −F (θ) + C(x)}
Wobei θ = (θ1 =
µ
σ 2 , θ2
(2)
= − 2σ1 2 ) die natürlichen Parameter, t(x) = (x, x2 ) die notwenθ2
dige Statistik und F (θ) = − 4θ12 +
teilung darstellen.
1
2
log −π
θ2 die Log-Normalisierung für eine Normalver-
Unter der Bedingung, dass beide Verteilungen von der gleichen Exponentialfamilie stammen, lässt sich die Kullback-Leibler Divergenz in die Bregman Divergenz umformen
KL(N (x; µi , σi2 )||N (x; µj , σj2 )) = DF (θj ||θi )
(3)
, so dass nun direkt die Bregman Divergenz als Distanz innerhalb des k-Means Verfahrens
zur Clusterbildung angewendet werden kann.
84
3
Evaluation der Verfahren
Im Folgenden werden die Verfahren zur Bestimmung des stochastischen Modells der Daten eines Datenstrom hinsichtlich ihrer Latenz, aber auch hinsichtlich der Güte des stochastischen Modells evaluiert. Zu diesem Zweck wurden die Verfahren als Verarbeitungsoperatoren innerhalb des Odysseus DSMS realisiert. Die Evaluation wurde dabei sowohl
auf synthetischen Daten, wie auch auf Daten aus einem Ultrabreitband-Positionierungssystem [WJKvC12] durchgeführt. Hierzu wurden 10.000 Messwerte aus einer Normalverteilung, sowie aus einer logarithmischen Normalverteilung generiert um einen Datenstrom
aus Messwerten zu simulieren. Die Evaluation der Latenz und der Güte des Modells betrachtet dabei drei Szenarien mit Datenfenstern der Größe 10, 100 und 1000. Das Datenfenster definiert dabei die Anzahl an Messwerten auf denen die Operatoren das stochastische Modell bestimmen sollen. Die Güte des Modells betrachtet das aktuell bestimmte
stochastische Modell im Hinblick auf alle 10.000 Datensätze. Als Qualitätskriterium wird
hierzu das Akaike-Informationskriterium verwendet. Das AIC ist ein Maß für die relative
Qualität eines stochastischen Modells für eine gegebene Datenmenge und ist definiert als:
AIC = 2k − 2 ln(L)
(4)
Der Parameter k repräsentiert hierbei die Anzahl der freien Parameter in dem stochastischen Modell und der Parameter L gibt die Log-Likelihood zwischen dem stochastischen
Modell und der gegebenen Datenmenge wieder. Dieses Informationskriterium ist für die
Evaluation deshalb gut geeignet, da es sowohl die Nähe der generierten Mischverteilung
aus den drei Verfahren zu den tatsächlichen Daten bewertet und zudem die Anzahl der
Komponenten innerhalb der Mischverteilungen in die Bewertung mit einfließen lässt. Die
Nähe zu den tatsächlichen Daten ist wichtig für die Qualität der Verarbeitungsergebnisse
und die Komponentenanzahl der Mischverteilung hat eine Auswirkung auf die Latenz der
Verarbeitung, da jede Komponente innerhalb einer Mischverteilung bei Operationen wie
der Selektion oder dem Verbund mit einem Selektionskriterium bei einer probabilistischen
Verarbeitung integriert werden muss.
Um mögliche Ausreißer zu minimieren wurde jede Evaluation wurde dabei 10-mal wiederholt. Als Testsystem diente ein Lenovo Thinkpad X240 mit Intel Core i7 und 8GB
RAM. Die verwendete Java Laufzeitumgebung war ein OpenJDK Runtime Environment
(IcedTea 2.5.2) (7u65-2.5.2-2) mit einer OpenJDK 64-Bit Server VM (build 24.65-b04,
mixed mode). Bei dem Betriebssystem handelte es sich um ein Debian GNU/Linux mit
einem 3.14 Kernel.
Das EM-Verfahren versucht ein stochastisches Modell an die eingehenden Daten anzupassen. Dabei spielen neben der Datenfenstergröße die Anzahl der Iterationen, der Konvergenzschwellwert für die Veränderung der Log-Likelihood in jeder Iteration, sowie die
Komponentenanzahl der Mischverteilungen eine Rolle für die Latenz dieses Operators.
Für die Evaluation wurde der Konvergenzschwellwert auf 0.001 gesetzt, die Anzahl an
Iterationen auf 30 und die Zahl der Komponenten auf 2. Die gleiche Anzahl an Iterationen wird ebenfalls in der von V. Garcia bereitgestellten Java Bibliothek jMEF1 verwendet.
1 http://vincentfpgarcia.github.io/jMEF/
85
800
700
600
Zeit (ms)
500
EM
400
Bregman
300
KDE
200
100
0
0
1000
2000
3000
4000
5000
6000
Elemente
7000
8000
9000
10000
Abbildung 1: Latenz der Operatoren bei einem Datenfenster der Größe 100 für Daten aus
einer logarithmischen Normalverteilung
Das KDE-Verfahren bestimmt für jeden Datenwert eine eigene Komponente in der resultierenden Mischverteilung. Der entwickelte Operator verwendet hierzu die Scott-Regel zur
Bestimmung der Bandbreite der Kovarianzmatrix der Komponenten. Das Bregman Hard
Clustering, welches in einem weiteren Schritt verwendet wird um die Anzahl an Komponenten auf die gewünschte Zahl zu minimieren wurde mit einer maximalen Anzahl von 30
Iterationen konfiguriert. Um die Resultate vergleichbar zu halten wurde der Operator so
konfiguriert, dass er ebenfalls eine 2-komponentige Mischverteilung ermittelt, also zwei
Cluster bildet.
Der hier verwendete Konvergenzschwellwert für das Erwartungswertmaximierungsverfahren liegt oberhalb des, in der verwendeten Apache Commons Math3 Bibliothek2 als Standardwert festgelegten, Wertes von 0.00001, da sich in den Versuchen zeigte, dass bereits
ein höherer Konvergenzschwellwert ausreichte um die Verfahren hinsichtlich der Güte des
stochastischen Modells und der gemessenen Latenz miteinander zu vergleichen.
3.1 Synthetische Sensordaten
Das Latenzverhalten der einzelnen Verfahren ist in Abb. 1 für Daten aus einer log. Normalverteilung für ein Datenfenster der Größe 100 dargestellt. Das EM-Verfahren weist hierbei
eine ähnliche und stabile Latenz von durchschnittlich ca. 200 Millisekunden auf. Dies
ist durch die mehrmalige Iteration über die aktuell gültigen Daten zur Bestimmung der
2 http://commons.apache.org/proper/commons-math
86
Log-Normalverteilung 1000
Normalverteilung 1000
Normalverteilung 100
Normalverteilung 10
0
5
10
15
KDE + Bregman-Hard Clustering
20
25
30
35
EM
Abbildung 2: Vergleich des AIC zwischen EM-Verfahren und KDE mit Bregman Hard
Clustering bei unterschiedlichen Datensatzfenstergrößen für Werte aus einer Normalverteilung und einer logarithmischen Normalverteilung
Log-Likelihood zwischen dem jeweils temporären stochastischen Modell und den Daten
geschuldet. Im Gegensatz zum EM-Verfahren kann das Band bei der Kerndichteschätzung
kontinuierlich bestimmt werden. Allerdings fällt auf, dass trotz mehrmaliger Wiederholung der Messung das Verfahren zum Bregman Hard Clustering eine deutlich höhere Latenz aufweist. Dieses Verhalten ist dabei unabhängig von der Art der Verteilung. Dies ist
vor allem auf die Tatsache zurück zu führen, dass das Bregman Hard Clustering Verfahren
in jeder Iteration die Bregman Divergenz zwischen den Clusterzentren und den einzelnen
Komponenten bestimmen muss und zusätzlich noch den Zentroiden aus jedem Cluster in
jeder Iteration neu ermitteln muss. Beim Vergleich der durchschnittlichen Latenz bei unterschiedlichen Größen von Datenfenstern zeigt sich, dass die Latenz des EM-Verfahrens
konstant bleibt, während die Latenz des Bregman Hard Clusterings stark ansteigt. Bei der
Qualitätsbetrachtung des ermittelten stochastischen Modells fällt auf, dass die Qualität des
EM-Verfahrens im Sinne des AIC bei Werten aus einer logarithmischen Normalverteilung
deutlich besser abschneidet als das KDE-Verfahren in Kombination mit dem Bregman
Hard Clustering. Bei Werten aus einer Normalverteilung dagegen unterscheidet sich der
AIC-Wert nur geringfügig bei den beiden Verfahren. Ein gleiches Verhalten lässt sich auch
bei Datensatzfenstern der Größe 1.000 beobachten.
Ist allerdings die Anzahl an Datensätzen gering, ändert sich dieses Verhalten. Bei einem
Datensatzfenster der Größe 10 zeigt sich unabhängig von dem zugrunde liegenden stochastischen Modell der Daten, dass die Kombination aus KDE und Bregman Hard Clustering
das bessere Modell liefert. Zudem unterscheiden sich die Latenzen bei dieser Datenmenge
zwischen den beiden Verfahren nur gering.
87
400
Y-Position (mm)
200
-500
-400
-300
-200
-100
0
0
100
Position 3
Position 7
Position 4
Position 8
200
300
-200
-400
-600
-800
X-Position (mm)
Position 1
Position 5
Position 2
Position 6
Abbildung 3: Messwerte der Positionsbestimmung für die Positionen 1–8
3.2 Reale Sensordaten
Um zu zeigen, dass die Verfahren auch stochastische Modelle von echten Sensordaten
erstellen können, wurden die Operatoren auf Sensordatenaufzeichnungen eines Ultrabreitband-Positionierungssystem [WJKvC12] angewendet. Insgesamt wurden 8 Positionen
(vgl. Abbildung 3) bestimmt, von denen im Folgenden die Positionen 6 und 7 als repräsentative Positionen näher betrachtet werden. Hierbei wurde das stochastische Modell
jeder Position mit dem EM-Verfahren und der Kombination aus KDE und Bregman Hard
Clustering auf einem Datensatzfenster der Größe 10 und einem Datensatzfenster der Größe
100 bestimmt.
Bei der Betrachtung der zeitlichen Bestimmung des stochastischen Modells in Abb. 4 fallen zunächst für die Position 6 anfängliche Ausreißer bei der Nähe zum Modell auf. Dies
deutet auf eine anfängliche Anpassung der Positionierungsknoten der Anwendung hin. In
den darauf folgenden Messungen bleiben sowohl die Modellqualität des EM-Verfahrens,
wie auch das resultierende Modell des Bregman Hard Clustering stabil. Wie bereits bei
den synthetischen Daten ist auch bei realen Sensordaten das Phänomen erkennbar, dass
die Kombination aus KDE mit Bregman Hard Clustering bei kleinen Datensatzfenstern
im Vergleich zum EM-Verfahren bessere stochastische Modelle ermittelt. Dagegen ist bei
größeren Datensatzfenstern das EM-Verfahren besser geeignet um gute stochastische Modell im Sinne des AIC zu bestimmen.
88
25
25
20
20
AIC
15
AIC
15
EM
10
Bregman
5
0
EM
10
Bregman
5
0
10
20
30
40
50
60
Messungen
70
80
90
0
100
(a) Position 6
0
10
20
30
40
50
60
Messungen
70
80
90
100
(b) Position 7
Abbildung 4: Qualität des stochastischen Modells über die Zeit bei einem Datensatzfenster
der Größe 100 von Position 6 und 7
4
Zusammenfassung und Ausblick
In dieser Arbeit wurden Verfahren zur kontinuierlichen Bestimmung des zugrunde liegenden mehrdimensionalen stochastischen Modells von Messwerten aus aktiven Datenquellen
vorgestellt. Ziel ist es, diese mehrdimensionalen stochastischen Modelle in einem probabilistischen Datenstrommanagementsystem zu verarbeiten. Bei den Verfahren handelt es
sich um das Erwartungsmaximierungsverfahren und die Kerndichteschätzung in Kombinationen mit dem Bregman Hard Clustering Ansatz. Zunächst wurden die Grundlagen
der jeweiligen Verfahren aufgezeigt. Zur Repräsentation der Unsicherheiten wurde das in
[Krä07] entwickelte Modell durch das Mischtyp Modell [TPD+ 12] erweitert und in dem
Odysseus DSMS realisiert. Bei der Evaluation der Verfahren wurde zunächst auf Basis
von synthetischen Daten die Latenz der einzelnen Verfahren ermittelt. Hierbei zeigte sich,
dass die Kombination aus Kerndichteschätzung und Bregman Hard Clustering aufgrund
der mehrmaligen Iterationen über die Komponenten einer Mischverteilung eine wesentlich höhere Latenz als das Erwartungsmaximierungsverfahren aufweist. Zudem sind die
resultierenden stochastischen Modelle im Sinne des Akaike Informationskriterium in den
meisten Fällen schlechter als die angenäherten Modelle des Erwartungsmaximierungsverfahrens. Aus Sicht der Latenzoptimierung und angesichts der Qualität der bestimmten Modelle sollte daher das Erwartungsmaximierungsverfahren bei der Datenstromverarbeitung
bevorzugen werden. Einzige Ausnahme sind Anwendungen in denen nur geringe Mengen
an Daten zur Verfügung stehen. Hier konnte die Kombination aus Kerndichteschätzung
und Bregman Hard Clustering die besseren stochastischen Modelle bestimmen. Eine Evaluation auf Basis von Sensoraufzeichnungen von Ultrabreitband-Lokalisierungssensoren
bestätigten die Resultate aus der Evaluation mit synthetischen Daten.
89
Danksagung
Die Autoren möchten Herrn Prof. Huibiao Zhu von der East China Normal University
für seine Unterstützung danken. Diese Arbeit wurde durch die Deutsche Forschungsgesellschaft im Rahmen des Graduiertenkollegs (DFG GRK 1765) SCARE (www.scare.unioldenburg.de) gefördert.
Literatur
[AGG+ 12] H.-J. Appelrath, Dennis Geesen, Marco Grawunder, Timo Michelsen und Daniela
Nicklas. Odysseus: a highly customizable framework for creating efficient event
stream management systems. In Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems, DEBS ’12, Seiten 367–368, New York,
NY, USA, 2012. ACM Press.
[BMDG05] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon und Joydeep Ghosh. Clustering
with Bregman Divergences. Journal of Machine Learning Research, 6:1705–1749,
2005.
[CHM12]
Yuan Cao, Haibo He und Hong Man. SOMKE: Kernel density estimation over data
streams by sequences of self-organizing maps. IEEE Transactions on Neural Networks
and Learning Systems, 23(8):1254–1268, 2012.
[DLR77]
Arthur P. Dempster, Nan M. Laird und Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal statistical Society, 39(1):1–
38, 1977.
[JM07]
T. S. Jayram und S. Muthukrishan. Estimating statistical aggregates on probabilistic
data streams. In In ACM Symposium on Principles of Database Systems, Seiten 243–
252, New York, NY, USA, 2007. ACM Press.
[KD09]
Bhargav Kanagal und Amol Deshpande. Efficient query evaluation over temporally
correlated probabilistic streams. In International Conference on Data Engineering,
2009.
[Krä07]
Jürgen Krämer. Continuous Queries over Data Streams-Semantics and Implementation. Dissertation, Philipps-Universität Marburg, 2007.
[Sco92]
D.W Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. 1992.
+
[TPD 12]
Thanh T. L. Tran, Liping Peng, Yanlei Diao, Andrew McGregor und Anna Liu. CLARO: modeling and processing uncertain data streams. The VLDB Journal, 21(5):651–
676, Oktober 2012.
[WJKvC12] Thorsten Wehs, Manuel Janssen, Carsten Koch und Gerd von Cölln. System architecture for data communication and localization under harsh environmental conditions
in maritime automation. In Proceedings of the 10th IEEE International Conference
on Industrial Informatics (INDIN), Seiten 1252–1257, Los Alamitos, CA, USA, 2012.
IEEE Computer Society.
[ZCWQ03] Aoying Zhou, Zhiyuan Cai, Li Wei und Weining Qian. M-kernel merging: Towards
density estimation over data streams. In 8th International Conference on Database
Systems for Advanced Applications, Seiten 285–292. IEEE, 2003.
90
Kontinuierliche Evaluation von
kollaborativen Recommender-Systeme in
Datenstrommanagementsystemen
– Extended Abstract –
Cornelius A. Ludmann, Marco Grawunder, Timo Michelsen, H.-Jürgen Appelrath
University of Oldenburg, Department of Computer Science
Escherweg 2, 26121 Oldenburg, Germany
{cornelius.ludmann, marco.grawunder, timo.michelsen, appelrath}@uni-oldenburg.de
Recommender-Systeme (RecSys) findet man in vielen Informationssystemen. Das Ziel eines RecSys ist das Interesse eines Benutzers an bestimmten Objekten (engl. item) vorherzusagen, um aus einer großen Menge an Objekten diejenigen dem Benutzer zu empfehlen,
für die das vorhergesagte Interesse des Benutzers am größten ist. Die zu empfehlenden
Objekte können zum Beispiel Produkte, Filme/Videos, Musikstücke, Dokumente, Point
of Interests etc. sein. Das Interesse eines Benutzers an einem Objekt wird durch eine Bewertung (engl. rating) quantifiziert. Die Bewertung kann explizit durch den Benutzer angegeben (der Benutzer wird dazu aufgefordert, ein bestimmtes Objekt zu bewerten) oder
implizit vom Verhalten des Benutzers abgeleitet werden (im einfachsten Fall durch eine
binäre Bewertung: Objekt genutzt vs. nicht genutzt). Mit Methoden des maschinellen Lernens wird aus bekannten Bewertungen ein Modell trainiert, dass unbekannte Bewertungen
vorhersagen kann. Für die Bestimmung der Empfehlungsmenge werden die Bewertungen
aller unbewerteten Objekte für einen Benutzer vorhergesagt und die best-bewerteten Objekte empfohlen.
Im realen Einsatz eines RecSys entstehen kontinuierlich neue Bewertungen, die bei der
Integration in das Modell die Vorhersagen für alle Benutzer verbessern können. Betrachtet
man die Bewertungs- und Kontextdaten nicht als statische Lerndaten, sondern als kontinuierlich auftretende, zeitannotierte Events eines potenziell unendlichen Datenstroms,
entspricht das eher der Situation, wie ein RecSys produktiv eingesetzt wird. Zur Umsetzung eines RecSys schlagen wir die Erweiterung eines Datenstrommanagementsystems
(DSMS) vor, welches mit Hilfe von Datenstrom-Operatoren und kontinuierlichen Abfrageplänen Datenströme mit Lerndaten zum kontinuierlichen Lernen eines Modells nutzt.
Die Anwendung des Modells zur Bestimmung der Empfehlungsmenge wird ebenso durch
Events eines Datenstroms ausgelöst (zum Beispiel durch die Anzeige in einer Benutzerapplikation).
Integriert man ein RecSys in ein DSMS, so stellt sich die Frage, wie verschiedene Ansätze bzw. Implementierungen evaluiert und verglichen werden können. Bei der Evaluation
werden die Daten i. d. R. in Lern- und Testdaten aufgeteilt. Die Evaluation von RecSys,
bei denen der zeitliche Zusammenhang der Daten eine Rolle spielt, hat besondere Anfor-
91
derungen: Die zeitliche Reihenfolge der Lerndaten muss erhalten bleiben und es dürfen
keine Testdaten zur Evaluation genutzt werden, die zeitlich vor den genutzten Lerndaten
liegen. Um den zeitlichen Verlauf zu berücksichtigen, wird ein Datenstrommodel genutzt,
welches die Nutzdaten mit Zeitstempeln annotiert.
Zur Evaluierung eines DSMS-basierten RecSys schlagen wir den in Abbildung 1 dargestellten Aufbau vor. Als Eingabe erhält das DSMS als Bewertungsdaten die Tupel (u, i, r)t
mit der Bewertung r des Benutzers u für das Objekt i zum Zeitpunkt t sowie Anfragen für
Empfehlungen für den Benutzer u zum Zeitpunkt t. Der Evaluationsaufbau gliedert sich
grob in drei Teile: Continuous Learning nutzt die Lerndaten zum maschinellen Lernen des
RecSys-Modells. Continuous Recommending wendet zu jeder Empfehlungsanfrage das
Modell an, um für den entsprechenden Benutzer eine Empfehlungsmenge auszugeben.
Continuous Evaluation teilt die Bewertungsdaten in Lern- und Testdaten auf und nutzt das
gelernte Modell zur Evaluation. Dazu wird die Bewertung für ein Testtupel vorhergesagt
und die Vorhersage mit der wahren Bewertung verglichen.
route
predict_rating
test_prediction
Continuous Evaluating
window
train_recsys_model
Continuous Learning
get_unrated_items
predict_rating
recommend
Continuous Recommending
Feedback
Abbildung 1: Aufbau eines RecSys mit kontinuierlicher Evaluation
Dieser Aufbau wurde mit dem DSMS Odysseus [AGG+ 12] prototypisch umgesetzt und
die Evaluation mit dem MovieLens-Datensatz1 durchgeführt. Als nächste Schritte sollen
weitere Evaluationsmethoden mit diesem Aufbau umgesetzt, sowie Algorithmen zum maschinellen Lernen für RecSys für den Einsatz in einem DSMS optimiert werden.
Literatur
[AGG+ 12] H.-Jürgen Appelrath, Dennis Geesen, Marco Grawunder, Timo Michelsen und Daniela
Nicklas. Odysseus: A Highly Customizable Framework for Creating Efficient Event
Stream Management Systems. In DEBS’12, Seiten 367–368. ACM, 2012.
1 http://grouplens.org/datasets/movielens/
92
Using Data-Stream and Complex-Event Processing
to Identify Activities of Bats
Extended Abstract
Sebastian Herbst, Johannes Tenschert, Klaus Meyer-Wegener
Data Management Group, FAU Erlangen-Nürnberg, Erlangen, Germany
E-Mail <firstname>.<lastname>@fau.de
1
Background and Motivation
Traditional tracking of bats uses telemetry [ADMW09]. This is very laborious for the
biologists: At least two of them must run through the forest to get a good triangulation.
And, only one bat at a time can be tracked with this method.
The Collaborative Research Center 1508 of the German Science Foundation (DFG) has
been established to develop more sophisticated sensor nodes for the bats to carry1 . These
mobile nodes must not be heavier than the telemetry senders used so far, but they offer
much more processing capacity in addition to sending a beacon with an ID.
Ground nodes receive the signals transmitted by the mobile nodes. They are also sensor
nodes that run on batteries. Currently, all their detections are forwarded to a central base
station, where a localization method integrates them into a position estimation for each
bat [NKD+ 15]. The base station is a standard computer with sufficient power and energy.
In the future, some parts of the localization may already be done on the ground nodes to
reduce the data transmission and thus save some energy.
Output of the localization method is a position stream. Each element of this stream contains a timestamp, a bat ID, and x and y coordinates. While the biologists would like to
have a z coordinate as well, it cannot be provided at the moment because of technological
restrictions. Plans are to include it in the future. While filtering and smoothing have already been done, the precision of the coordinate values depends on the localization method
used. It may only be in the order of meters to tens of meters, or (with much more effort)
in the range of decimeters. Furthermore, some positions may be missing in the stream. A
bat may be temporarily out of the range of the ground nodes, or its mobile node may be
switched off to save energy. Hence, the subsequent processing must be robust with respect
to imprecise position values.
1 The
sensor nodes are glued at the neck of the bats. After two to four weeks, they fall off.
93
2
Goals and Challenges
The goal of this work is to investigate the use of of data-stream processing (DSP) and
complex-event processing (CEP) to extract information meaningful for biologists from the
position stream described. Biologists are interested in patterns of bat behavior, which are
known to some extent, but have not been observed over longer periods of time, and have
not been correlated in time for a group of bats. The elementary parts of the patterns can be
expressed in terms of flight trajectories, which are a special case of semantic trajectories
[PSR+ 13]. Biologists however are more interested in bat activities indicated by sequences
of trajectories.
Our idea is to identify these activities with the near-real-time processing that a combination
of DSP and CEP can provide. This gives information on current activities earlier to the
biologists, so they could go out and check by themselves what is happening. Also, they
can experiment with the bats by providing extra food or emitting sounds. Furthermore, the
mobile nodes on the bats can be configured to some extent, as can be the ground nodes.
So when a particular behavior is reported to the biologists, they can adjust the localization
method, e. g. switch to higher precision, even if that costs more energy.
In order to reach that goal, a first set of DSP queries and CEP rules has been defined.
Activities are managed as objects for each bat. A change of activity is triggered by events
and is recorded as an update of this object. Current activities as well as the previous
activities of each bat are displayed for the biologists. Ongoing work evaluates the activity
detection by comparing the output with the known activities of bats in a simulation tool.
It has already helped to adjust the substantial number of parameter values used in the DSP
queries and CEP rules.
Acknowledgments. This work has been supported by the Deutsche Forschungsgemeinschaft (DFG) under the grant of FOR 1508 for sub-project no. 3.
References
[ADMW09] Sybill K. Amelon, David C. Dalton, Joshua J. Millspaugh, and Sandy A. Wolf. Radiotelemetry; techniques and analysis. In Thomas H. Kunz and Stuart Parsons, editors,
Ecological and behavioral methods for the study of bats, pages 57–77. Johns Hopkins
University Press, Baltimore, 2009.
[NKD+ 15] Thorsten Nowak, Alexander Koelpin, Falko Dressler, Markus Hartmann, Lucila
Patino, and Joern Thielecke. Combined Localization and Data Transmission in
Energy-Constrained Wireless Sensor Networks. In Wireless Sensors and Sensor Networks (WiSNet), 2015 IEEE Topical Conference on, Jan 2015.
[PSR+ 13]
Christine Parent, Stefano Spaccapietra, Chiara Renso, Gennady Andrienko, Natalia
Andrienko, Vania Bogorny, Maria Luisa Damiani, Aris Gkoulalas-Divanis, Jose
Macedo, Nikos Pelekis, Yannis Theodoridis, and Zhixian Yan. Semantic Trajectories Modeling and Analysis. ACM Computing Surveys, 45(4), August 2013. Article
42.
94
Streaming Analysis of Information Diffusion
Extended Abstract
Peter M. Fischer, Io Taxidou
Univ. of Freiburg, CS Department, 79110 Freiburg, Germany
{peter.fischer,taxidou}@informatik.uni-freiburg.de
1
Background and Motivation
Modern social media like Twitter or Facebook encompass a significant and growing share
of the population, which is actively using it to create, share and exchange messages. This
has a particularly profound effect on the way how news and events are spreading.Given
the relevance of social media both as sensor of the real world (e.g., news detection) and
its impact on the real world (e.g., shitstorms), there has been a significant work on fast,
scalable and thorough analyses, with a special emphasis on trend detection, event detection
and sentiment analysis.
To understand the relevance and trustworthiness of social media messages deeper insights
into Information Diffusion are needed: where and by whom a particular piece of information has been created, how it has been propagated and whom it may have influenced.
Information diffusion has been a very active field of research, as recently described in a
SIGMOD Record survey [GHFZ13]. The focus has been on developing models of information diffusion and targeted empirical studies. Given the complexity of most of these
models, nearly all of the investigations have been performed on relatively small data sets,
offline and in ad-hoc setups. Despite all this work, there is little effort to tackle the problem
of real-time evaluation of information diffusion, which is needed to assess the relevance.
These analyses need to deal with Volume and Velocity on both on messages and social
graphs. The combination of message streams and social graphs is scarely investigated,
while incomplete data and complex models make reliable results hard to achieve. Existing
systems do not handle the challenges: Graph computation systems neither address the fast
change rates nor the realtime interaction between the graph and other processing, while
data streams systems fall short on the combination of streams and complex graphs.
2 Goals and Challenges
The goal of our research is to develop algorithms and systems to trace the spreading of
information in social media that produce large scale, rapid data. We identified three crucial
building blocks for such a real-time tracing system:
95
1) Algorithms and Systems to perform the tracing and influence assignment, in order to
deliver paths that information most likely propagated 2) Classification of user roles, as to
provide support for assessing their impact on information diffusion process 3) Predictions
on the spreading rate in order to allow estimations of information diffusion lifetime and
how representative evaluations on the current state will be.
Our first task is to design, implement and evaluate algorithms and systems that can trace
information spreading and assign influence at global scale while producing the results in
real-time, matching the volumes and rates of social media. This requires a correlation between the message stream and the social graph. While we already showed that real-time
reconstruction of retweets is feasible when the social graph fragment is locally accessible [TF14], real-life social graphs contain hundreds of millions of users, which requires
distributed storage and operation. Our approach keeps track of the (past) interactions and
drives the partitioning on the communities that exist in this interaction graph. Additionally, since the information available for reconstruction is incomplete, either from lack of
social graph information or from API limitations, we aim to develop and evaluate methods that infer missing path information in a low-overhead manner. In contrast to existing,
model-based approaches, we rely on a lightweight, neighborhood-based approach.
Access to diffusion paths enables a broad range of analyses of the information cascades.
Given this broad range, we are specifically focusing on features that provide the baselines
for supporting relevance and trustworthiness, namely the identification of prominent user
roles such as opinion leaders or bridges. An important aspect includes the interactions
and connections among users, that lead towards identifying prominent user roles. Our
approach will rely on stream-aware clustering instead of fixed roles over limited data.
The process of information spreading varies significantly in speed and duration: most
cascades end after a short period, others are quickly spreading for a short time, while
yet other group see multiple peaks of activity or stay active for longer periods of time.
Understanding how long such a diffusion continues provides important insights on how
relevant a piece of information is and how complete its observation is. Virality predictions
from the start of a cascade are hard to achieve, while incremental, lightweight forecasts
are more feasible. New observations can then be used to update and extend this forecast,
incorporating temporal as well as strucutural features.
References
[GHFZ13] Adrien Guille, Hakim Hacid, Cécile Favre, and Djamel A. Zighed. Information diffusion
in online social networks: a survey. SIGMOD Record, 42(2):17–28, 2013.
[TF14]
Io Taxidou and Peter M. Fischer. Online Analysis of Information Diffusion in Twitter.
In Proceedings of the 21st International Conference Companion on World Wide Web,
WWW ’14 Companion, 2014.
96
Towards a Framework for Sensor-based Research and
Development Platform for Critical, Socio-technical Systems
Henrik Surm1, Daniela Nicklas2
1
2
Faculty of Information Systems and
Applied Computer Sciences
University of Bamberg, Bamberg
[email protected]
Department for Computing Science,
University Oldenburg, Oldenburg
[email protected]
1 Motivation
The complexity of critical systems, in which failures can either endanger human life or
cause drastic economic losses, has dramatically increased over the last decades. More
and more critical systems envolve into so-called “socio-technical systems”: humans are
integrated by providing and assessing information and by making decisions in otherwise
semi-autonomous systems. Such systems depend heavily on situational awareness,
which is obtained by processing data from multiple, heterogenous sensors: raw sensor
data is cleansed and filtered to obtain features or events, which are combined, enriched
and interpreted to reach higher semantical levels, relevant for system decisions.
Existing systems which use sensor data fusion often require an a-priori configuration
which can not be changed while the system is running, or require a human intervention
to adapt to changed sensor sources, and provide no run-time-extensibility for new sensor
types [HT08]. In addition, analysis of data quality or query plan reliability is often not
possible and management of recorded data frequently is done by hand.
Our goal is to support the research, development, evaluation, and demonstration of such
systems. Thus, in the proposed talk we analyze requirements and challenges for the data
management of sensor based research environments, and we propose a data stream based
architecture which fulfills these requirements.
2 Challenges and Requirements
To support the research, development, test, and demonstration of sensor-based, critical
applications, we plan to address the following challenges and requirements:


Changing sensor configurations: during research and development, the sensor
configuration might change often. In addition, when sensor data is delivered
from moving objects, the available sensor sources may change during runtime.
Information quality: since sensors do not deliver exact data, quality aspects
needs to be considered at all levels of processing.
97



Validation of data management: since the sensor data processing is a vital part
of the system, it should be validatable and deterministic.
Reproducability of experiments and tests, intelligent archiving: in the process of
research, development and test, application sensor-data and additional data for
ground-truth (e.g., video-streams) needs to be archived and replayable.
Integration with simulation environments: before such systems are deployed in
the real, they need to be modeled and analyzed in various simulation tools. The
transition from pure simulation to pure real-world execution should be easy.
In the project, we plan to address these challenges by a modular, extensible and
comprehensive framework. While these challenges are similar to other data analysis and
integration scenarios, they have to be addressed under the specific constraints of limited
maritime communication channels and data standards.
The main contribution will be a combined solution that adresses these challenges in a
unified framework. This is why we will base our work on the Odysseus [AGGM12]
framework: it has a clear formal foundation based on [K08] and is designed for
extensibility, using the OSGI service platform. Odysseus offers bundles for data
integration, mining, storage and data quality representation, and allows installation and
updates of bundles without restarting the system, which leads to a high flexibility and
run-time adaptability. The framework will offer a unified sensor access and data fusion
approach which allows a flexible use of the platform for the transition from simulation to
real-world applications, reducing the expenditure of time for research and development.
3 Outlook
The framework will be used in two scenarios, which is a research port for analysis of
sensor-based support for port navigation, and driver state observations for cooperative,
semi-autonomous driving applications. Further, we plan to explore that approach in
other projects, covering cooperative e-navigation and smart city applications.
References
[AGGM12]
APPELRATH, H.-JÜRGEN ; GEESEN, DENNIS ; GRAWUNDER, MARCO ; MICHELSEN,
TIMO ; NICKLAS, DANIELA: Odysseus: a highly customizable framework for
creating efficient event stream management systems. In: DEBS ’12 Proceedings of
the 6th ACM International Conference on Distributed Event-Based Systems, DEBS
’12. New York, NY, USA : ACM, 2012 — ISBN 978-1-4503-1315-5, S. 367–368
[HT08]
HE, YINGJIE ; TULLY, ALAN: Query processing for mobile wireless sensor networks:
State-of-the-art and research challenges. In: Third International Symposium on
Wireless Pervasive Computing (ISWPC '08), S. 518 - 523. IEEE Computer Society
2008 — ISBN 978-1-4244-1652-3
[K07]
KRÄMER, JÜRGEN: Continuous Queries over Data Streams - Semantics and
Implementation. Dissertation, Universität Marburg, 2007
98
Dataflow Programming for Big Engineering Data
– extended abstract –
Felix Beier, Kai-Uwe Sattler, Christoph Dinh, Daniel Baumgarten
Technische Universität Ilmenau, Germany
{first.last}@tu-ilmenau.de
Nowadays, advanced sensing technologies are used in many scientific and engineering
disciplines, e. g., in medical or industrial applications, enabling the usage of data-driven
techniques to derive models. Measures are collected, filtered, aggregated, and processed
in a complex analytic pipeline, joining them with static models to perform high-level tasks
like machine learning. Final results are usually visualized for gaining insights directly
from the data which in turn can be used to adapt the processes and their analyses iteratively to refine knowledge further. This task is supported by tools like R or MATLAB,
allowing to quickly develop analytic pipelines. However, they offer limited capabilities of
processing very large data sets that require data management and processing in distributed
environments – tasks that have vastly been analyzed in the context of database and data
stream management systems. Albeit the latter provide very good abstraction layers for data
storage, processing, and underlying hardware, they require a complex setup, provide only
limited extensibility, and hence are hardly used in scientific or engineering applications
[ABB+ 12]. As consequence, many tools are developed, comprising optimized algorithms
for specialized tasks, but burdening developers with the implementation of low-level data
management code, usually in a language that is not common in their community.
In this context, we analyzed the source localization problem for EEG/MEG signals (which
can be used, e. g., to develop therapies for stroke patients) in order to develop an approach
for bridging this gap between engineering applications and large-scale data management
systems. The source localization problem is challenging, since the problem is ill-posed
and signal-to-noise ratio (SNR) is very low. Another challenging problem is the computational complexity of inverse algorithms. While large data volumes (brain models and
high sampling rates) need to be processed, low latencies constraints must be kept because
interactions with the probands are necessary. The analytic processing chain is illustrated
in Fig. 1. The Recursively Applied and Projected Multiple Signal Classification (RAPMUSIC) algorithm is used for locating neural sources, i. e., activity inside a brain corresponding to a specific input. Therefore, 366 MEG/EEG sensors are placed above the
head which are continuously sampled at rates of 600 – 1250 Hz. The forward solution
of the boundary element model (BEM) of the brain at uniformly distributed locations on
the white matter surface is passed as second input. It is constructed once from a magnetic
resonance imaging (MRI) scan and, depending on the requested accuracy, comprises 10s
of thousands of vertices representing different locations on the surface. RAP-MUSIC recursively identifies active neural regions with a complex pipeline for preprocessing signal
measures and correlating them with the BEM. To meet the latency constraints, the RAP-
101
Figure 1: Overview Source Localization Processing Chain
MUSIC algorithm has been parallelized for GPUs and a C++ library has been created
in an analysis tool called MNE-CPP [DLS+ 13], including parsers for data formats used
by vendors of medical sensing-equipment, signal filter operators, transformation routines,
etc. Although this library can be used to create larger analyses pipelines, implementing
and evaluating new algorithms still requires a lot of low-level boilerplate code to be written, leading to significant development overheads. Latter can be avoided with the usage of
domain-specific languages which are specialized for signal processing, natively working
on vectors or matrices as first-class data types like MATLAB. But they offer less control
over memory management and parallelization for custom algorithms which are crucial for
meeting the latency constraints. When large-scale data sets shall be processed, even a specialized tool quickly runs into performance problems as distributed processing is mostly
not supported because of large development overheads for cluster-scale algorithms.
To handle these problems, we propose to apply dataflow programming here. We implemented a multi-layered framework which allows to define analytic programs by an abstract
flow of data, independent from its actual execution. This enables quick prototyping while
letting the framework handle data management and parallelization issues. Similar to Pig
for batch-oriented MapReduce jobs, a scripting language for stream-oriented processing,
called PipeFlow, is provided as front-end. In the current version, PipeFlow allows to inject
primitives for partitioning dataflows, executing sub-flows in parallel on cluster nodes leveraging multi-core CPUs, and merging partial results. We plan to automatically parallelize
flows in the future using static code analysis and a rule-based framework for exploiting
domain-specific knowledge about data processing operators. The dataflow programs are
optimized by applying graph rewriting rules and code for an underlying execution backend is generated. Therefore, the framework provides an engine called PipeFabric which
offers a large C++ library of operator implementations with focus on low-latency processing. One key aspect of PipeFabric is its extensibility for complex user-defined types and
operations. Simple wrappers are sufficient to embed already existing domain-specific libraries. For our use case, processing of large matrices is required. Therefore, we used the
Eigen library and are currently porting functions from MNE-CPP. We are also working on
code generators for other back-ends like Spark which will be useful for comparing capabilities of different frameworks for common analytic workloads – which, to the best of our
knowledge, has not been done yet.
References
[ABB+ 12] I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. NoDB: efficient
query execution on raw data files. In ACM SIGMOD, 2012.
+
[DLS 13] C. Dinh, M. Luessi, L. Sun, J. Haueisen, and M. S Hamalainen. Mne-X: MEG/EEG
Real-Time Acquisition, Real-Time Processing, and Real-Time Source Localization
Framework. Biomedical Engineering/Biomedizinische Technik, 2013.
102
Joint Workshop on Data
Management for Science
Joint Workshop on Data Management for Science
Sebastian Dorok1,5 , Birgitta König-Ries2 , Matthias Lange3 ,
Erhard Rahm4 , Gunter Saake5 , Bernhard Seeger6
1
Bayer Pharma AG
Friedrich Schiller University Jena
Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben
4
University of Leipzig
5
Otto von Guericke University Magdeburg
6
Philipps University Marburg
2
3
Message from the chairs
The Workshop on Data Management for Science (DMS) is a joint workshop consisting
of the two workshops Data Management for Life Sciences (DMforLS) and Big Data in
Science (BigDS). BigDS focuses on addressing big data challenges in various scientific
disciplines. In this context, DMforLS focuses especially on life sciences. In the following,
we give short excerpts of the call for papers of both workshops:
Data Management for Life Sciences In life sciences, scientists collect an increasing
amount of data that must be stored, integrated, processed, and analyzed efficiently to
make effective use of them. Thereby, not only the huge volume of available data raises
challenges regarding storage space and analysis throughput, but also data quality issues,
incomplete semantic annotation, long term preservation, data access, and compliance issues, such as data provenance, make it hard to handle life science data. To address these
challenges, advanced data management techniques and standards are required. Otherwise,
the use of life science data will be limited. Thereby, one question is whether general purpose techniques and methods for data management are suitable for life science use cases
or specialized solutions tailored to life science applications must be developed.
Big Data in Science The volume and diversity of available data has dramatically increased in almost all scientific disciplines over the last decade, e.g. in meteorology, genomics, complex physics simulations and biological and environmental research. This
development is due to great advances in data acquisition (e.g. improvements in remote
sensing) and data accessibility. On the one hand, the availability of such data masses leads
to a rethinking in scientific disciplines on how to extract useful information and on how
to foster research. On the other hand, researchers feel lost in the data masses because appropriate data management tools have been not available so far. However, this is starting
to change with the recent development of big data technologies that seem to be not only
useful in business, but also offer great opportunities in science.
105
The joint workshop DMS brings together database researchers with scientists from various disciplines especially life sciences to discuss current findings, challenges, and opportunities of applying data management techniques and methods in data-intensive sciences.
The joint workshop is held for the first time in conjunction with the 16th Conference on
Database Systems, Technology, and Web (BTW 2015) at the University of Hamburg on
March 03, 2015.
The contributions were reviewed by three to four members of the respective program committee. Based on the reviews, we selected eight contributions for presentation at the joint
workshop. We assigned each contribution to one of three different sessions covering different main topics.
The first session comprises contributions related to information retrieval. The contribution
Ontology-based retrieval of scientific data in LIFE by Uciteli and Kirsten presents an
approach that utilizes ontologies to facilitate query formulation. Colmsee et al. make
also use of ontologies, but use them for improving search results. In Improving search
results in life science by recommendations based on semantic information, they describe
and evaluate their approach that uses document similarities based on semantic information.
To improve performance of sampling analyses using MapReduce, Schäfer et al. present
an incremental approach. In Sampling with incremental MapReduce, the authors describe
a way to limit data processing to updated data.
In the next session, we consolidate contributions dealing with data provenance. In his
position paper METIS in PArADISE, Heuer examines the importance of data provenance
in the evaluation of sensor data, especially in assistance systems. In their contribution
Extracting reproducible simulation studies from model repositories using the COMBINE
archive toolkit, Scharm and Waltemath deal with reproducible simulation studies.
The last session covers the topic data analysis. In Genome sequence analysis with MonetDB: a case study on Ebola virus diversity, Cijvat et al. present a case study on genome
analysis using a relational main-memory database system as platform. In RightInsight:
Open source architecture for data science, Bulut presents an approach based on Apache
Spark to conduct general data analyses. In contrast, Authmann et al. focus on challenges in
spatial applications and suggest an architecture to address them in their paper Rethinking
spatial processing in data-intensive science.
We are deeply grateful to everyone who made this workshop possible – the authors, the
reviewers, the BTW team, and all participants.
Program chairs
Data Management for Life Sciences
Gunter Saake (Otto von Guericke University Magdeburg)
Uwe Scholz (IPK Gatersleben)
Big Data in Science
Birgitta König-Ries (Friedrich Schiller University Jena)
Erhard Rahm (University of Leipzig)
Bernhard Seeger (Philipps University Marburg)
106
Program committee
Data Management for Life Sciences
Sebastian Breß (TU Dortmund)
Sebastian Dorok (Otto von Guericke University Magdeburg)
Mourad Elloumi (University of Tunis El Manar, Tunisia)
Ralf Hofestädt (Bielefeld University)
Andreas Keller (Saarland University, University Hospital)
Jacob Köhler (DOW AgroSciences, USA)
Matthias Lange (IPK Gatersleben)
Horstfried Läpple (Bayer HealthCare AG)
Ulf Leser (Humboldt-Universität zu Berlin)
Wolfgang Müller (HITS GmbH)
Erhard Rahm (University of Leipzig)
Can Türker (ETH Zürich, Switzerland)
Big Data in Science
Alsayed Algergawy (Friedrich Schiller University Jena)
Peter Baumann (Jacobs Universität)
Matthias Bräger (CERN)
Thomas Brinkhoff (FH Oldenburg)
Michael Diepenbroeck (Alfred-Wegner-Institut)
Christoph Freytag (Humboldt Universität)
Michael Gertz (Uni Heidelberg)
Frank-Oliver Glöckner (MPI für Marine Mikrobiologie)
Anton Güntsch (Botanischer Garten und Botanisches Museum, Berlin-Dahlem)
Thomas Heinis (Imperial College, London)
Thomas Hickler (Senckenberg)
Jens Kattge (MPI für Biogeochemie)
Alfons Kemper (TU München)
Meike Klettke (Uni Rostock)
Alex Markowetz (Uni Bonn)
Thomas Nauss (Uni Marburg)
Jens Nieschulze (Forschungsreferat für Datenmanagement der Uni Göttingen)
Kai-Uwe Sattler (TU Ilmenau)
Stefanie Scherzinger (OTH Regensburg)
Myra Spiliopoulou (Uni Magdeburg)
Uta Störl (HS Darmstadt)
107
Ontology-based Retrieval of Scientific Data in LIFE
Alexandr Uciteli1,2, Toralf Kirsten2,3
1
Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig
2
LIFE Research Centre for Civilization Diseases, University of Leipzig
3
Interdisciplinary Centre for Bioinformatics, University of Leipzig
Abstract: LIFE is an epidemiological study determining thousands of Leipzig inhabitants with a wide spectrum of interviews, questionnaires, and medical investigations. The heterogeneous data are centrally integrated into a research database
and are analyzed by specific analysis projects. To semantically describe the large
set of data, we have developed an ontological framework. Applicants of analysis
projects and other interested people can use the LIFE Investigation Ontology (LIO)
as central part of the framework to get insights, which kind of data is collected in
LIFE. Moreover, we use the framework to generate queries over the collected scientific data in order to retrieve data as requested by each analysis project. A query
generator transforms the ontological specifications using LIO to database queries
which are implemented as project-specific database views. Since the requested data is typically complex, a manual query specification would be very timeconsuming, error-prone, and is, therefore, unsuitable in this large project. We present the approach, overview LIO and show query formulation and transformation.
Our approach runs in production mode for two years in LIFE.
1 Introduction
Epidemiological projects study the distribution, the causes and the consequences of
health-related states and events in defined populations. The goal of such projects is to
identify risk factors of (selected) diseases in order to establish and to optimize a
preventive healthcare. LIFE is an epidemiological and multi-cohort study in the
described context at the Leipzig Research Centre for Civilization Diseases (Univ. of
Leipzig). The goal of LIFE is to determine the prevalence and causes of common
civilization diseases including adiposity, depression, and dementia by examining
thousands of Leipzig (Germany) inhabitants of different ages. Participants include
pregnant women and children from 0-18 as well as adults in separate cohorts. All
participants are determined in a possible multi-day program with a selection out of
currently more than 700 assessments. To these assessments belong interviews and selfcompleted questionnaires to physical examinations, such as for anthropometry, EKG,
MRT, and laboratory analyses of taken specimen. Data is acquired for each assessment
depending on the participant’s investigation program using specific input systems and
prepared input forms. All collected data is integrated and, thus, harmonized in a central
research database. This database consists of data tables referring to assessments (i.e.,
investigations). Their complexity ranges from tables with a small number of columns to
tables with a very high column number. For example, the table referring to the
Structured Clinical Interview (SCID) consists of more than 900 columns, i.e., questions
and sub-questions of the interview input form.
The collected data are analyzed in an increasing number of analysis projects; currently,
there are more than 170 projects active. Each project is initially specified by a proposal
109
documenting the analysis goal, plan and the required data. However, there are two key
aspects that are challenging. Firstly, the applicant needs to find assessments (research
database tables) of interest to specify the requested data in the project proposal. This
process can be very difficult and time consuming, in particular, when the scientist is
looking for specific data items (columns), such as weight and height, without knowing
the corresponding assessment. Secondly, current project proposals typically request data
from up to 50 assessments which are then organized in project-specific views (according
to the data requests). These views can be very complex. Usually, they combine data from
multiple research database tables, several selection expressions and a multitude on projected columns out of the data tables. A manual specification of database queries to create such views for each analysis project would be a very error-prone and timeconsuming process and is, therefore, nearly impossible. Hence, we make the following
contributions.
- We developed an ontological framework. The framework utilizes the LIFE Investigation Ontology (LIO) which classifies and describes assessments, relations between them, and their items.
- We implemented ontology-based tools using LIO to generate database queries
which are stored as project-specific analysis views within the central research
database. The views allow scientists and us to easily access and to export the
requested data of an analysis project. Both, LIO and ontology-based tools are
running in production mode for two years.
The rest of the paper is organized as follows. The Section 2 describes the ontological
framework and especially LIO. The Section 3 deals with ontology-based query formulation and transformation, while Section 4 describes some implementation aspects. Section
5 concludes the paper.
2 Framework
The goal of the ontological framework is to semantically describe all integrated data of
biomedical investigations in LIFE using an ontology. The ontology is utilized on the one
hand by scientists to search for data items or complete investigations of interest or
simply to browse the ontology to get information about the captured data of the
investigations. On the other hand, the ontology helps to query and retrieve data of the
research database by formulating queries on a much higher level than SQL.
The ontological framework consists of three interrelated layers (Fig. 1). The integrated
data layer comprises all data elements (instance data) of the central research database
providing data of several source systems in an
integrated, preprocessed and cleaned fashion.
The metadata layer describes all instance data of
the research database on a very technical level.
To these metadata belongs the used table and
column names, corresponding data types but also
the original question or measurement text and the
code list when a predefined answer set has been
originally associated to the data item. This
metadata is stored in a dedicated metadata repository (MDR) and is inherently interrelated with
Figure 1: Framework overview
110
Figure 2: Selection of the LIFE Investigation Ontology
the instance data. Finally, the ontology layer is represented by the developed LIFE Investigation Ontology (LIO) [KK10] and its mapping to the collected metadata in the
MDR.
LIO utilizes the General Formal Ontology (GFO) [He10] as a top-level ontology and,
thus, reuses defined fundamental categories of GFO, such as Category, Presential and
Process. Fig. 2a gives a high-level overview over LIO. Subcategories of GFO:Presential
refer to collected scientific data, participants and specimen. Scientific data is structured
on a technically level by categories within the sub-tree of LIO:Data. Instances of these
categories are concrete data files and database tables, e.g., of the research database. They
are used to locate instance data for later querying. Subcategories of GFO:Process refer
to processes of two different types, data acquisition and material analysis processes.
Instances of these categories are documented, e.g., by specific states and conditions of
the examination, the examiner conducting the investigation etc. This process documentation is additionally specified to the scientific data that they possibly generate. The documentation can be used for downstream analysis of the scientific data, to evaluate the
process quality and for an impact analysis on the measurement process. Finally, subcategories of GFO:Category are utilized to semantically classify biomedical investigations
in LIFE. Fig. 2b shows an overview of main categories. Fundamentally, we differentiate
between items (e.g., questions of a questionnaire) and item sets. This separation allows
us to ontologically distinguish between data tables (item set) of the research database
and its columns (item). Moreover, we are able to classify both, investigation forms as
predefined and rather static item sets and specific project-specific analysis views which
potentially include items from multiple investigation forms. The latter can be dynamically defined and provided by a user group.
All biomedical investigations together with their containing items (i.e., questions and
measurements) are associated to LIO categories using the instance_of relationship type.
Fig. 3 shows an example; the interview socio-demography consists of several questions
including for country of birth, material status and graduation. Both, the interview and its
items, are associated to LIO:Interview and LIO:Item, respectively. Special relationships
with type has_item represent the internal structure of the interview. The semantic classification, i.e., the association to LIO subclasses of LIO:Metadata, is manually specified
by the investigator in an operational software application from which it is imported into
111
Figure 3: Utilization of item set and item categories to describe investigations
the overall metadata repository (MDR). Moreover, the MDR captures and manages the
structure of each investigation form, and, thus, its items, and their representation in the
central research database. By reusing both kinds of specifications in LIO, the mappings
between collected metadata in the MDR and LIO categories are inherently generated.
This makes it easy to describe and classify new investigations (assessments) in LIO; it
necessitates only an initial manual semantic classification and import of corresponding
metadata into the MDR.
3 Ontology-based Query Formulation and Transformation
Scientific data of the central research database are analyzed in specific analysis projects.
Each of them is specified by a project proposal, i.e., the applicant describe the analysis
goal, the analysis plan and the data she request for. In the simplest case, the data request
consists in a list of assessments. This is extended, in some cases, by defined inclusion or
exclusion criteria. In complex scenarios, the applicant is interested in specific items
instead of all items of an assessment. In all scenarios, data can be queried per single
assessment. However, it is common to request data in a joined fashion, especially, when
the applicant focuses on specific items from multiple assessments. Currently, each data
request is satisfied by specific analysis views which are implemented as database views
within the relational research database.
We use LIO to formulate queries over the scientific data which are finally transformed
and stored as project-specific analysis views in the research database. Fig. 4 sketches the
query formulation and transformation process. Firstly, LIO is used to formulate queries
for each analysis project. The applicant can search for assessments of interest browsing
along LIO’s structure and the associated instances, i.e., concrete assessments or items.
She can select complete assessments as predefined item sets and specific items of an
assessment. These selections are used by the query generator to create the query
projection, i.e., the items for that data should be retrieved. These items are firstly sorted
by the selected assessment in alphabetic order and, secondly, by their rank within each
assessment, i.e., with respect to their position on the corresponding input form. Inclusion
and exclusion criteria can be specified on item level. The query generator interrelates
single conditions by the logical operator AND and creates the selection expression of the
resulting query for each assessment.
Per default, the query generator produces one query for each selected assessment or item
set of an assessment. Moreover, experienced users can create new item sets containing
items from multiple assessments. These item sets result in join queries using patient
identifiers and examination time points (due to recurrent visits) as join criteria. To find
out which item (column) contains patient identifiers and time points, the items are
specifically labelled when the assessment definitions (source schema) are imported into
the MDR. Some sources allow an automatic labelling (using source-specific rules),
112
whereas other sources need a manual
intervention to fully describe their schemas
and the resulting items of LIO.
The query generator takes the ontologybased specifications (selections and
conditions) using LIO as input and firstly
creates SQL-like queries. These queries
have intermediate character and utilize
keywords SELECT, FROM and WHERE
with the same meaning as in SQL. In
contrast to SQL, they contain ontology
categories and associated instances as
placeholder which are then finally replaced
by conrete table and column names of the Figure 4: Query fromulation and data access
research database when SQL queries are
generated. To resolve table and column names the query generator utilizes mappings
between LIO and the MDR.
4 Implementation
LIO currently consists of 33 categories, more than 700 assessments and ca. 120 analysis
results (latter two are instances in LIO) together with more than 39,000 items in total.
The large and increasing number of assessments, their containing items and their correspondences (mappings) to database table and column metadata are stored in the MDR
which is implemented in a relational database system. Assessments and items are loaded
on demand from the MDR and are associated to LIO categories as instances. Therefore,
new assessments can be easily added to LIO and without modifying the LIO’s core
structure or changing ontology files.
We implemented a Protégé [NFM00, Sc11] plug-in loading LIO and corresponding
instances from the MDR to support an applicant when she specifies the required data for
a project proposal. She can navigate along LIO’s structure and, hence, is able to find and
pick the items of interest for her proposal. On the other hand, the plug-in allows us to
formulate and to transfer ontology-based queries into SQL-queries which are then stored
as project-specific analysis views over the scientific data of the research database. These
views can be access in two different ways. Firstly, the views can be used for further
database-internal data processing using the database API and SQL. This is the most
preferred way for persons with database skills. Secondly, the plug-in includes options to
propagate views to a web-based reporting software which wraps the database views in
tabular reports. These reports can be executed by an applicant. The retrieved data are
then available for download to continue data processing with special analysis tools, such
as SPSS, R etc.
There are other approaches which are highly related to our ontology-based framework.
i2b2 [Mu10] is a framework for analyzing data in clinical context. In contrast to our
approach, it utilizes a separate data management and, thus, necessitates additional data
load and transformation processes. Moreover, the goal of i2b2 is primarily to find relevant patients and not to retrieve scientific data. Like LIO, the Search Ontology [Uc14] is
used to formulate queries over data. Its focus is on queries for search engines, while our
113
approach focuses on structured data in a relational database. Similar to LIO, the Ontology of Biomedical Investigations (OBI) [Br10] classifies and describes biomedical investigations. In contrast to OBI, LIO utilizes a core structure which is dynamically extended
by assessments fully described in a dedicated metadata repository. Hence, our framework is able to generate queries over data of the research database and prevents from
describing each investigation in detail by using OBI.
5 Conclusion
We introduced an ontology-based framework to query large and heterogeneous sets of
scientific data. The framework consists of the developed LIFE Investigation Ontology
(LIO) on the top level which semantically describes scientific data of the central research
database (base level). Both levels are interrelated by (technical) metadata which are
managed in a metadata repository. LIO gets insight which data are available within the
research database, on the one hand, and is used, on the other hand, to formulate queries
over the collected scientific data. The ontology-based queries are transformed into database queries which are stored as analysis-specific database views. The queries include
per default items of a single assessment. Moreover, join queries merging items from
multiple assessments are also supported. Together, ontology-based querying simplifies
the data querying for end users and frees IT-people from implementing rather complex
SQL queries. In future, we will extend LIO and the query generator to overcome current
limitations, e.g., according to the specification and transformation of query conditions.
Acknowledgment: This work was supported by the LIFE project. The research project
is funded by financial means of the European Union and of the Free State of Saxony.
LIFE is the largest scientific project of the Saxon excellence initiative.
References
[Br10]
Brinkman, R. R. et al.: Modeling biomedical experimental processes with OBI. In
Journal of biomedical semantics, 2010, 1 Suppl 1; pp. S7.
[He10] Herre, H.: General Formal Ontology (GFO): A Foundational Ontology for Conceptual
Modelling. In (Poli, R.; Healy, M.; Kameas, A. Eds.): Theory and Applications of
Ontology: Computer Applications. Springer Netherlands, Dordrecht, 2010; pp. 297–345.
[KK10] Kirsten, T.; Kiel, A.: Ontology-based Registration of Entities for Data Integration in
Large Biomedical Research Projects. In (Fähnrich, K.-P.; Franczyk, B.
Eds.): Proceedings of the annual meeting of the GI. Köllen Druck+Verlag GmbH, Bonn,
2010; pp. 711–720.
[Mu10] Murphy, S. N. et al.: Serving the enterprise and beyond with informatics for integrating
biology and the bedside (i2b2). In Journal of the American Medical Informatics
Association JAMIA, 2010, 17; pp. 124–130.
[NFM00]Noy, N. F.; Fergerson, R. W.; Musen, M. A.: The Knowledge Model of Protégé-2000:
Combining Interoperability and Flexibility. In (Goos, G. et al. Eds.): Knowledge
Engineering and Knowledge Management Methods, Models, and Tools. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2000; pp. 17–32.
[Sc11] Schalkoff, R. J.: Protégé, OO-Based Ontologies, CLIPS, and COOL: Intelligent systems:
Principles, paradigms, and pragmatics. Jones and Bartlett Publishers, Sudbury, Mass.,
2011; pp. 266–272.
[Uc14] Uciteli, A. et al.: Search Ontology, a new approach towards Semantic Search. In
(Plödereder, E. et al. Eds.): FoRESEE: Future Search Engines 2014 - 44. annual meeting
of the GI, Stuttgart - GI Edition Proceedings P-232. Köllen, Bonn, 2014; pp. 667–672.
114
Improving Search Results in Life Science by
Recommendations based on Semantic Information
Christian Colmsee1, Jinbo Chen1, Kerstin Schneider2, Uwe Scholz1, Matthias Lange1
1
Department of Cytogenetics and Genome Analysis
Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben
Corrensstr. 3
06466 Stadt Seeland, Germany
{colmsee,chenj,scholz,lange}@ipk-gatersleben.de
2
Department Automation / Computer Science
Harz University of Applied Sciences
Friedrichstr. 57-59
38855 Wernigerode, Germany
{kschneider}@hs-harz.de
Abstract: The management and handling of big data is a major challenge in the
area of life science. Beside the data storage, information retrieval methods have to
be adapted to huge data amounts as well. Therefore we present an approach to
improve search results in life science by recommendations based on semantic
information. In detail we determine relationships between documents by searching
for shared database IDs as well as ontology identifiers. We have established a
pipeline based on Hadoop allowing a distributed computation of large amounts of
textual data. A comparison with the widely used cosine similarity has been
performed. Its results are presented in this work as well.
1 Introduction
Nowadays the management and handling of big data is a major challenge in the field of
informatics. Larger datasets are produced in less time. In particular, this aspect is
intensively discussed in life science. At a technology level, new concepts and algorithms
have to be developed to enable a seamless processing of huge data amounts. In the area
of data storage new database concepts are implemented, such as column based storage or
in memory databases. In respect to data processing, distributed data storage and
computation have been made available by new frameworks such as the Hadoop
framework (http://hadoop.apache.org). Hadoop is using the MapReduce approach
[DG08] allowing the distribution of tasks in the map phase over different clusters and to
reduce the amount of data in the reduce phase. Furthermore Hadoop is able to integrate
extensions such as the column oriented database HBase. So the framework combines the
distributed computation architecture with the advantages of a NoSQL database system.
Hadoop has already been used in life science applications such as Hadoop-BAM
[NKS+12] and Crossbow [LSL+09].
115
Beside these technological aspects, information retrieval (IR) plays an important role as
well. In this context search engines play a pivotal role for an integrative IR over widely
spread and heterogeneous biological data. Search engines are complex software systems
and have to fulfil various qualitative requirements to get accepted by the scientific
community. Its major components are discussed in [LHM+14]:
 Linguistic (text and data decomposition, e.g. tokenization; language processing, e.g.
stop words and synonyms)
 Indexing (efficient search, e.g. inverse text index)
 Query processing (fuzzy matching and query expansion, e.g. phonetic search, query
suggestion, spelling correction)
 Presentation (intuitive user interface, e.g. faceted search)
 Relevance estimation (feature extraction and ranking, e.g. text statistics, text feature
scoring and user pertinence)
 Recommender systems (semantic links between related documents, e.g. “page like
this” and “did you mean”
The implementation of those components is part of the research project LAILAPS
[ECC+14]. LAILAPS is an information retrieval system to link plant genomic data in the
context of phenotypic attributes for a detailed forward genetic research. The underlying
search engine allows fuzzy querying for candidate genes linked to specific traits over a
loosely integrated system of indexed and interlinked genome databases. Query assistance
and an evidence based annotation system enable a time efficient and comprehensive
information retrieval. The results are sorted by relevance using an artificial neural
network incorporating user feedback and behaviour tracking.
While the ranking algorithm of LAILAPS provides user specific results, the user might
be interested in links to other relevant database entries to an entry of his interest. Such a
recommender system is still a missing LAILAPS feature but would have enormous
impact for the quality of search results. A scientist may search for a specific gene to
retrieve all relevant information to this gene without a dedicated search in different
databases. To realise such a goal, recommendation systems are a widely used method in
information retrieval. This concept is already used in several life science applications.
For example, EB-eye as IR system for all databases that are hosted at the European
Bioinformatics Institute (EBI) provide suggestions to alternative database records
[VSG+10]. Another example are PubMed based IR systems for searching in biomedical
abstracts [Lu11]. In this work we will describe a concept of providing recommendations
in LAILAPS based on semantic information.
2 Results
When users are searching in LAILAPS for specific terms, the result is a list of relevant
database entries. Beside the particular search result, the user would benefit from a list of
related database entries that are potentially of interest to him. To implement this feature
it is necessary to measure the similarity between database entries. A widely used concept
is the expression of a document as a vector of words (tokens) and the computation of its
116
distance by cosine similarity. Within this approach the tokens of each document will be
compared, meaning documents using similar words have a higher similarity to each
other. But to get a more useful result especially in the context of life science it would be
necessary to integrate semantic information in the comparison of documents. Here we
present a method allowing the estimation of semantic relationships of documents.
2.1 Get semantics with database identifiers
A widely used concept to provide semantic annotation in life science databases are
ontologies such as the Gene Ontology (GO) [HCI+04]. Each GO term has a specific ID
allowing the exact identification of a term. Beside the use of ontologies, annotation
targets are repositories of gene functions. This wide range of databases such as Uniprot
[BAW+05], have in common that they can be referenced by a unique identifier for each
database entry.
With the help of such unique identifiers for ontologies and database entries the
documents in LAILAPS can be compared on a semantic level. If for example two
documents share a GO identifier, this could be interpreted as a semantic connection
between these documents. The final goal therefore would be to design a recommendation
system, which is determining these information and to recommend the end user database
entries based on these unique identifier.
For the extraction of above mentioned identifiers, different methods could be applied.
One method is the usage of regular expressions, where specific patterns are used, such as
a token beginning with the letters GO, is likely a GO term [BSL+11]. Another method is
described in Mehlhorn et al. [MLSS12], where predictions are made with the support of
a neural network. Feature extractions were focused on positions, symbols as well as
word statistics to predict a database entry identifier. To include a very high number of
database identifiers, we decided to use the neural network based approach, allowing the
identification of IDs based on known ID patterns.
2.2 Determine document relations
We applied Hadoop to identify IDs in a high throughput manner. The Hadoop pipeline
has two MapReduce components (see Figure 1). The first MapReduce job has a database
as an input file. Each database entry consists of a unique document ID as well as
document content. The mapper will then analyse each document and detect tokens that
might be an ID with the IDPredictor tool from Mehlhorn et al. [MLSS12]. The reducer
will then generate a list of pairs with a token and documents including this token. The
second MapReduce job will then determine the document relations. Here the mapper
will build pairs of documents having an ID in common. The reducer will finally count
the number of shared IDs for each document pair. A high count of shared IDs means a
high similarity between two documents. The source code of the pipeline is available at:
http://dx.doi.org/10.5447/IPK/2014/18.
117
Figure 1: Hadoop pipeline including two MapReduce components as well as the ID prediction
component
2.3 Cosine similarity versus ID prediction
As a benchmark we computed documents from the Swissprot database and compared the
ranking results with the cosine similarity mentioned in section 2.1. While the cosine
similarity score between two documents is built upon word frequencies and results in a
value between zero and one, the ID prediction score is an integer value based on shared
IDs. To make both values comparable, we calculated z-scores for both ranking scores.
To detect deviations in the ranking, the results were plotted on a scatterplot (see Figure
2). The plot illustrates, that in most cases there are only small differences in the ranking.
But there are some cases of large differences in the relative ranking, indicating, that for
specific document relations the semantic component leads to a completely different
ranking in contrast to the simple approach of comparing words.
Figure 2: Scatterplot illustrating the different ranking results between cosine similarity and ID
prediction score
118
When looking into specific results with strongly different rankings, semantically
similarities could be detected. Picking up one example from Figure 2 (marked with a red
circle) a document pair was ranked at place 1 in IDPrediction and at place 148 in cosine
similarity. When looking into these documents we could determine that they are sharing
a lot of IDs like EC (Enzyme Commission) numbers and GO terms. Both documents are
dealing with fatty acid synthase in fungal species. A protein BLAST against Swissprot of
the protein sequence of document A listed document B in the fourth position with a
score of 1801 and an identity of 44%.
3 Discussion and Conclusion
In this work we developed a system providing recommendations based on semantic
information. By the support of a neural network, IDs were predicted. With this
information, documents can be compared on a semantic level. To support big data in life
science we implemented the documents distance computation as a Hadoop pipeline. The
results of our approach have shown differences to cosine similarity in case of rankings.
The ID prediction based approach is able to detect semantic similarities between
documents and recommend this information to the users. However to get a precise idea
about the quality improvement, the new method should be applied to the LAILAPS
frontend system to determine if the users are more interested in this new information. To
implement the presented pipeline into LAILAPS powerful systems such as ORACLE
Big Data [Dj13] could be a solution. It allows supporting multiple data source including
Hadoop, NoSQL as well as the ORACLE database itself. Although Hadoop is a
powerful system, LAILAPS would also benefit from a more integrative approach like
using in memory technology. Users who would like to install their own LAILAPS
instance might be not able to set up their own Hadoop cluster. In memory systems might
be able to allow as the just-in-time computation of the available data as well. LAILAPS
would benefit from further investigations in this field.
Acknowledgements
The expertise and support of Steffen Flemming regarding the establishment and
maintenance of the Hadoop cluster is gratefully acknowledged.
References
[BAW+05]
Bairoch, A.; Apweiler, R.; Wu, C.H.; Barker, W.C.;Boeckmann, B.;
Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M. et al. The
universal protein resource (Uniprot). Nucleic acids research, 33
(suppl 1):D154-D159, 2005.
[BSL+11]
Bachmann, A.; Schult, R.; Lange, M.; Spiliopoulou, M. Extracting
Cross References from Life Science Databases for Search Result
119
Ranking. In Proceedings of the 20th ACM Conference on Information
and Knowledge Management, 2011.
[DG08]
Dean, J.; Ghemawat, S.: Mapreduce: simpified data processing on
large clusters. Communications of the ACM, 51(1):107-113, 2008.
[Dj13]
Djicks, J.: Oracle: Big Data for the Enterprise. Oracle White Paper,
2013.
[ECC+14]
Esch, M.; Chen, J.; Colmsee, C.; Klapperstück, M.; Grafahrend-Belau,
E.; Scholz, U.; Lange, M.: LAILAPS – The Plant Science Search
Engine. Plant and Cell Physiology, Epub ahead of print, 2014.
[HCI+04]
Harris, M.A.; Clark, J.; Ireland, A.; Lomax, J.; Ashburner, M.; Foulger,
R.; Eilbeck, K.; Lewis, S.; Marshall, B.; Mungall, C. et al.: The gene
ontology (GO) database and informatics resource. Nucleic acids
research, 32 (Database issue):D258, 2004.
[LHM+14]
Lange, M; Henkel, R; Müller, W; Waltemath, D; Weise, S:
Information Retrieval in Life Sciences: A Programmatic Survey. In M.
Chen, R. Hofestädt (editors) Approaches in Integrative Bioinformatics.
Springer, 2014, pp 73-109.
[LSL+09]
Langmead, B.; Schatz, M.C.; Lin, J.; Pop, M.; Salzberg, S.L.:
Searching for SNPs with cloud computing. Genome Biology,
10(11):R134, 2009.
[Lu11]
Lu, Z.: PubMed and beyond: a survey of web tools for searching
biomedical literature. Database. Oxford University Press, 2011
[MLSS12]
Mehlhorn, H.; Lange, M.; Scholz, U.; Schreiber, F.: IDPredictor:
predict database links in biomedical database. J. Integrative
Bioinformatics, 9, 2012.
[NKS+12]
Niemenmaa, M.; Kallio, A.; Schumacher, A.; Klemelä, P.;
Korpelainen, E.; Heljanko, K.: Hadoop-BAM: directly manipulating
next generation sequencing data in the cloud. Bioinformatics,
28(6):876-877, 2012.
[VSG+10]
Valentin, F.; Squizzato, S.; Goujon, M.; McWilliam, H.; Paern, J.;
Lopez, R.: Fast and efficient searching of biological data resources using EB-eye. Briefings in bioinformatics, 11(4):375-384, 2010.
120
Sampling with Incremental MapReduce
Marc Schäfer, Johannes Schildgen, Stefan Deßloch
Heterogeneous Information Systems Group
Department of Computer Science
University of Kaiserslautern
D-67653 Kaiserslautern, Germany
{m schaef,schildgen,dessloch}@cs.uni-kl.de
Abstract: The goal of this paper is to increase the computation speed of MapReduce
jobs by reducing the accuracy of the result. Often, the timely processing is more important than the precision of the result. Hadoop has no built-in functionality for such an
approximation technique, so the user has to implement sampling techniques manually.
We introduce an automatic system for computing arithmetic approximations. The sampling is based on techniques from statistics and the extrapolation is done generically.
This system is also extended by an incremental component which enables the reuse
of already computed results to enlarge the sampling size. This can be used iteratively
to further increase the sampling size and also the precision of the approximation. We
present a transparent incremental sampling approach, so the developed components
can be integrated in the Hadoop framework in a non-invasive manner.
1
Introduction
Over the last ten years, MapReduce [DG08] has become an often-used programming
model for analyzing Big Data. Hadoop1 is an open-source implementation of the MapReduce framework and supports executing jobs on large clusters. Different from traditional
relational database systems, MapReduce focusses on the three characteristics (”The 3 Vs”)
of Big Data, namely volume, velocity and variety [BL12]. Thus, efficient computations
on very large, fast changing and heterogeneous data are an important goal. One benefit
of MapReduce is that it scales. So, it is well-suited for using the KIWI approach (”Kill
It With Iron”): If a computation is too slow, one can simply upgrade to better hardware
(”Scale Up”) or add more machines to a cluster (”Scale Out”).
In this paper, we focus on a third dimension additional to resources and time, namely
computation accuracy. The dependencies of the dimensions can be depicted in a timeresources-accuracy triangle. It says that one cannot make Big-Data analyses in short time
with few resources and perfect accuracy. The area of the triangle is constant. Thus, if one
wants to be accurate and fast, more resources are needed (KIWI approach). If a hundredpercent accuracy is not mandatory, a job can run fast and without upgrading the hardware.
On the one hand, most work regarding Big-Data analysis, i.e. frameworks and algorithms
are 100% precise. On the other hand, these approaches often give up the ACID properties
1 http://hadoop.apache.org
121
and claim: Eventual consistency (”BASE”) is enough. So, let us add this: For many computations, a ninety-percent accuracy is enough. One example: Who cares if the number
of your friends’ friends’ friends in a social network is displayed as 1,000,000 instead of
1,100,000?
Some people extend the definition of Big Data by a fourth ”V”: veracity [Nor13]. This
means, the data sources differ in their quality. Data may be inaccurate, outdated or just
wrong. So, in many cases, Big-Data analyses are already inaccurate. When using sampling, the accuracy of the result decreases again, but the computation time improves. Sampling means, only a part of the data is analyzed and the results are extrapolated in the
end.
Within this work, we extended the Marimba framework (see section 4.1) by a sampling
component to execute existing Hadoop jo

Volume P-242(2015) - Mathematical Journals

Transcription

Similar documents

Endbericht

Serendipity-Metriken für Musik Recommender

Übungen mit dem Applet „Wahrscheinlichkeitsnetz“

Skript zur Vorlesung Computerintensive Verfahren in der Statistik

- IPD Goos

CAS PIA Anwendungsfälle

PDF-Dokument

Skalierbarer und zuverlässiger Zugriff auf feingranulare Daten

Logische Optimierung verteilter Anfragen

Satz von Kuratowski