Report - Semagrow

Transcription

Report - Semagrow
ICT Seventh Framework Programme (ICT FP7)
Grant Agreement No: 318497
Data Intensive Techniques to Boost the Real – Time Performance of Global
Agricultural Data Infrastructures
D5.1.2 Semantic Store Infrastructure
Deliverable Form
Project Reference No. ICT FP7 318497
Deliverable No. D5.1.2
Relevant Workpackage: WP5: Semantic Infrastructure
Nature: P
Dissemination Level: PU
Document version: V2.0
Date: 21/11/2014
Authors: IPB, UAH, NCSR-D
Document description: This document describes the development, integration, and
deployment effort that was required in order to deploy the
computational and software infrastructure needed in order to test
and pilot SemaGrow technologies.
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
Document History
Version
Date
Author
(Partner)
Remarks
ToC v0.1
20/09/2013
IPB
Draft version of the ToC.
ToC v0.2
01/10/2013
NCSR-D, UAH
Final version of the ToC.
Draft v0.3
15/10/2013
IPB
Contribution from IPB.
Draft v0.4
31/10/2013
NCSR-D
Contribution from NCSR-D and UAH
Draft v0.9
1/11/2013
NCSR-D, SWC
Internal review
Final v1.0
11/11/2013
IPB
Delivered as D5.1.1
Draft v1.1
3/11/2014
IPB, FAO
Added description of the FAO Web Crawler
Database population and hosting at IPB (Sect. 4.3)
Draft v1.2
7/11/2014
NCSR-D, UAH
Added description of the rdfCDF Toolkit and its
connection to Repository Integration (Sect. 4.2)
Draft v1.3
14/11/2014
UAH, IPB,
NCSR-D
Added descriptions of various infrastructure tools
deployed at NCSR-D and IPB (Chap. 4)
Draft v1.9
19/11/2014
NCSR-D, SWC
Internal review
Final v2.0
21/11/2014
IPB
Delivered as D5.1.2
Page 2 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
EXECUTIVE SUMMARY
This document describes the development, integration, and deployment effort that was required in
order to deploy the computational and software infrastructure needed in order to test and pilot
SemaGrow technologies.
Page 3 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
TABLE OF CONTENTS
LIST OF FIGURES ...........................................................................................................................5
LIST OF TABLES.............................................................................................................................6
LIST OF TERMS AND ABBREVIATIONS ...........................................................................................7
1.
INTRODUCTION ..................................................................................................................9
1.1
1.2
1.3
1.4
2.
COMPUTATIONAL INFRASTRUCTURE ................................................................................ 11
2.1
2.2
2.3
3.
PARADOX III cluster ................................................................................................................ 11
PARADOX IV cluster ................................................................................................................ 11
4store cluster .......................................................................................................................... 13
ACCESS TO THE COMPUTATIONAL INFRASTRUCTURE ........................................................ 14
3.1
3.2
3.3
3.4
4.
PARADOX batch system .......................................................................................................... 14
EMI-based Grid layer .............................................................................................................. 14
gUSE/WS-PGRADE portal layer ............................................................................................... 16
RESTful interface..................................................................................................................... 17
SOFTWARE INFRASTRUCTURE ........................................................................................... 18
4.1
4.2
4.3
4.4
4.5
5.
Purpose and Scope ................................................................................................................... 9
Approach .................................................................................................................................. 9
Relation to other Workpackages and Deliverables .................................................................. 9
Big Data Aspects ....................................................................................................................... 9
Toolkit For Repository Integration ......................................................................................... 18
rdfCDF Toolkit ......................................................................................................................... 20
Crawler Database Population and Hosting............................................................................. 21
Clean AGRIS Date/Time Service .............................................................................................. 21
SemaGrowREST ...................................................................................................................... 22
REFERENCES ..................................................................................................................... 23
Page 4 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
Figure 1: PARADOX III cluster............................................................................................................... 12
Figure 2: PARADOX IV installation. ...................................................................................................... 13
Figure 3: WS-PGRADE workflow example. .......................................................................................... 17
Figure 4: Architecture of the toolkit for repository integration. ......................................................... 19
Page 5 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
Table 1: List of SemaGrow gLite services. ............................................................................................ 15
Table 2: Characteristic examples of date cleaning .............................................................................. 22
Page 6 of 23
D5.1.2 Semantic Store Infrastructure
API
Application Programming Interface
BDII
Berkeley Database Information Index
CA
Certification Authority
CE
Computing Element
CLI
Command Line Interface
DB
Database
DC
Dublin Core
DCI
Distributed Computing Infrastructure
DPM
Disk Pool Manager
EMI
European Middleware Initiative
GP-GPU
General-purpose Computing on GPU
GPU
Graphics Processing Unit
GT
Globus Toolkit, a software framework for implementing
computational grids
GUI
Graphical User Interface
HTTP
Hypertext Transfer Protocol
IEEE
Institute of Electrical and Electronics Engineers
JAR
Java Archive
JSON
JavaScript Object Notation
LFC
Logical File Catalogue
LOM
Learning Object Metadata
MCPD
Multi-crop Passport Descriptors
MPI
Message Passing Interface
MPICH
High performance and widely portable implementation of the
Message Passing Interface
OS
Operating System
PBS
Portable Batch System
RAM
Random-access memory
RDF
Resource Description Framework
Page 7 of 23
FP7-ICT-2011.4.4
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
REST
Representational State Transfer - XML-based protocol for invoking
web services over HTTP
SDMX
Statistical Data and Metadata Exchange
SE
Storage Element
SG
Science Gateway
SOA
Service Oriented Architecture
SPARQL
SPARQL Protocol and RDF Query Language
URL
Uniform Resource Locator
VO
Virtual Organisation
VOMS
Virtual Organization Membership Service
WMS
Workload Management System
XML
Extensible Markup Language
XSL
Extensible Stylesheet Language
XSLT
Extensible Stylesheet Language Transformations
Page 8 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
1.
1.1 Purpose and Scope
The aim of this document is to present semantic store infrastructure deployed for the project over
the Institute of Physics Belgrade (IPB) data center, with all developed and integrated components.
The document describes deployed large-scale computational infrastructure used for the project’s
experiments on distributed semantic stores, as well as the software infrastructure developed or
integrated in order to support the SemaGrow experiments and pilots.
1.2 Approach
The deployed large-scale computational infrastructure used for the project’s experiments is
described in Chapter 2 and Chapter 3. Chapter 2 gives details on technical characteristics of the
infrastructure, while Chapter 3 lists available end-user interfaces that enables access to the
infrastructure. Chapter 4 documents software developed in order to establish the infrastructure
required for deploying Semagrow and carrying out experiments and pilots.
1.3 Relation to other Workpackages and Deliverables
The development, integration, and deployment effort documented in this deliverable is done within
the Task 5.1 of workpackage WP5. This work is needed for pilot deployment (Deliverable 6.2.1). The
relationship between software developed in this task as opposed to software developed in Task 6.2
Pilot Deployment is as follows:
The software developed in Task 5.1 and documented in Chapter 4 of this deliverable is part
of the Semagrow ecosystem, developed by technical partners, but is not a core research
prototype. Its development was necessary in order to be able to run realistic pilots, but the
development itself was not part of piloting.
The software developed in Task 6.2 and documented in Deliverable 6.2.1 is client-side
software developed by use case partners. Its development was part of the piloting effort, in
the sense that it was used to evaluate the impact of Semagrow technologies on developing
clients that consume distributed data.
1.4 Big Data Aspects
Chapter 2 documents the computational infrastructure used to prepare and execute large-scale
experiments in SemaGrow. Specifically:
PARADOX III (Section 2.1) has been used for the large-scale triplification of NetCDF and XML
data using triplifiers developed in WP2 (cf. D2.2.2 and also Sect. 4.2) and the software
infrastructure developed in this task (Chapter 3).
PARADOX IV (Section 2.2) has been used for crawler database population.
During the third year:
Page 9 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
PARADOX IV will be used for ISI-MIP hosting, as well as for creation of super-large scale
datasets using the data generator.
The 4store cluster (Section 2.3) will be used to populate and host the Crawler Database
(Section 4.3) service for the whole duration of the final project year.
Furthermore, Section 4.2 documents the development of software infrastructure needed in order
to expose large-scale NetCDF datasets without triplification. This will be used to compare
performance between serving triplified NetCDF datasets and serving over the raw data. This is
particularly important for the large-scale NetCDF datasets used in the Heterogeneous Data
Collections and Streams use cases, since their size makes their duplication in RDF stores a major
concern and experimentation is needed in order to understand and measure the performance
gained by indexing in RDF stores.
Page 10 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
2.
The Institute of Physics Belgrade (IPB) provides SemaGrow large-scale computational infrastructure
for the project’s experiments on distributed semantic store. The same infrastructure is used for
heterogeneous repository integration by the toolkit described in Chapter 4, as well as by
AgroTagger component provided by FAO. The infrastructure is organized in two clusters, PARADOX
III and PARADOX IV, and their technical characteristics are given in this chapter. Beside the
clustered resources, IPB provides additional hardware resources, services that support various
access channels to the clusters, as well as SemaGrow-specific services (Triplestore server, 4store
cluster, RESTful interface server). Technical characteristics of these additional hardware resources
used for installation of various services are given in Chapter 3.
2.1 PARADOX III cluster
PARADOX III cluster (Figure 1) consists of 88 computing nodes, each with two quad-core Intel Xeon
E5345 processors at 2.33 GHz, totalling to 704 processor cores. Each node contains 8 GB of RAM
memory, and 100 GB of local disk space. In addition to the scratch space (local disk space),
PARADOX III provides up to 50 TB of disk storage that is shared between the machines. Nodes are
interconnected by the star topology Gigabit Ethernet network through three stacked highthroughput Layer 3 switches, each node being connected to the switch by two Gigabit Ethernet
cables arranged in a channel bonding configuration. PARADOX III resources are available and can be
accessed through various layers and gateways (Chapter 3), which are installed at 15 additional
Xeon-based service nodes. In addition to standard applications developed using the serial
programming approach, the cluster is optimized and heavily tested for parallel processing mode
using MPICH [2], MPICH-2, and OpenMPI [3] frameworks.
2.2 PARADOX IV cluster
Fourth major upgrade of PARADOX installation (Paradox IV, shown in Figure 2) consists of 106
working nodes and 3 service nodes. Working nodes (HP ProLiant SL250s Gen8, 2U height) are
configured with two Intel Xeon E5-2670 8-core Sandy Bridge processors, at a frequency of 2.6 GHz
and 32 GB of RAM (2 GB per CPU-core). The total number of new processor-cores in the cluster is
1696, in addition to the PARADOX III resources. Each working node contains an additional GP-GPU
card (NVIDIA Tesla M2090) with 6 GB of RAM. With a total of 106 NVIDIA Tesla M2090 graphics
cards, PARADOX IV is a premier computer resource in the wider region, which provides access to a
large production GPU cluster and new technology. The peak computing power of PARADOX IV is
105 TFlops, which is about 18 times more than the previous PARADOX III installation.
Page 11 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
Figure 1: PARADOX III cluster.
One service node (HP DL380p Gen8), equipped with an uplink of 10 Gbps, is dedicated to cluster
management and user access (gateway machine). All cluster nodes are interconnected via
Infiniband QDR technology, through a non-blocking 144-port Mellanox QDR Infiniband switch. The
communication speed of all nodes is 40 Gbps in both directions, which is a qualitative step forward
over the previous (Gigabit Ethernet) PARADOX installation. The administration of the cluster is
enabled by an independent network connection through the iLO (Integrated Lights-Out) interface
integrated on motherboards of all nodes.
PARADOX IV also provides a data storage system, which consists of two service nodes (HP DL380p
Gen8) and 5 additional disk enclosures. One disk enclosure is configured with 12 SAS drives of 300
GB (3.6 TB in total), while the other four disk enclosures are configured each with 12 SATA drives of
2 TB (96 TB in total), so that the cluster provides around 100 TB of storage space. Storage space is
distributed via a Lustre high performance parallel file system that uses Infiniband technology, and is
available both on working and service nodes.
New PARADOX IV cluster is installed in four water-cooled racks, while the additional 3 racks contain
PARADOX III equipment. The cooling system consists of 7 cooling modules (one within each rack),
which are connected via a system of pipes with a large industrial chiller and configured so as to
minimize power consumption.
Page 12 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
Figure 2: PARADOX IV installation.
2.3 4store cluster
4store (Garlik) is a scalable and stable RDF database that stores RDF triplets in quad format. The
4store source code has been made available under the GNU General Public Licence version 3, and
together with the documentation can be obtained from http://4store.org/. For SemaGrow
purposes, within the IPB data center, a dedicated 4store cluster has been deployed. The cluster
consists of 8 quad-core Intel Xeon E5345 processors at 2.33 GHz, 4 GB of RAM, and 1 TB of disk
space per processor. Nodes are interconnected by the star topology Gigabit Ethernet network
through Layer 3 switch. The cluster is accessible via SPARQL HTTP protocol server, which can
answer SPARQL queries using the standard SPARQL HTTP query protocol.
Page 13 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
3.
Hardware resources organized in PARADOX III and PARADOX IV clusters and described in Chapter 2
are available through three general-purpose layers (batch system, Grid layer, and gUSE/WSPGRADE portal layer), and SemaGrow-dedicated RESTful interface.
3.1 PARADOX batch system
PARADOX batch system uses open-source Torque resource manager [4], also known as PBS or
OpenPBS [5], and Maui [6] batch scheduler. The batch system provides commands for job
management that allow users to submit, monitor, and delete jobs, and has the following main
components:
pbs_server (job server), providing the basic batch services such as receiving/creating a batch
job, modifying the job, protecting the job against system crashes, and running the job;
pbs_mom (job executor), a daemon that places the job into execution when it receives a
copy of the job from the pbs_server; pbs_mom creates a new session with the environment
identical to a user login session and returns the job’s output to the user.
Maui (job scheduler), a daemon that contains the site’s policies controlling job priorities,
where and when they will be executed, etc. Maui scheduler can communicate with various
pbs_mom instances to learn about the state of system's resources and with the pbs_server
to learn about the availability of jobs to execute.
In order to use PARADOX batch system user has to access PARADOX gateway machine (ui.ipb.ac.rs),
to create a job script (quite similar to standard shell scripts), to submit the job script file to a
dedicated semagrow queue, and to monitor the job. All technical details, recommendations and
PARADOX batch system usage instructions are described in PARADOX User Guide [7].
3.2 EMI-based Grid layer
gLite middleware [8] is product of a number of current and past Grid project, such as DataGrid,
DataTag, Globus, EGEE, WLCG, and EMI. Through the almost decade-long process of development,
gLite has become one of the most popular frameworks for building applications tapping into the
geographically distributed computing and storage resources. gLite middleware is distributed as a
set of software components providing services to discover, access, allocate and monitor shared
resources in a secure way and according to well defined policies. These services form an
intermediate layer (middleware) between the physical resources and the applications. Its
architecture is following the Service Oriented Architecture (SOA) paradigm, simplifying the
interoperability among different, heterogeneous Grid services and allowing easier compliance with
upcoming standards.
Generally, the complex structure of the gLite technology can be divided in four main groups:
Security services concern authentication, authorization, and auditing. An important role in
the context of authorization is played by the gLite Virtual Organization Membership Service
Page 14 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
(VOMS), which allows fine-grained access control. Using the short-time proxy to minimize
the risk of identification compromise, the user is authenticated on the Grid infrastructure,
while for the long-running jobs, MyProxy (PX) service provides a proxy renewal mechanism
to keep the job proxy valid for as long as needed.
Job management services concern the execution and control of computational jobs for their
whole lifetime throughout the Grid infrastructure. In the gLite terminology, the Computing
Element (CE) provides an interface to access and manage a computing resource typically
consisting in a batch queue of a cluster farm. The Workload Management System (WMS)
provides a meta-scheduler that dispatches jobs on the available CEs best suited to run user’s
job according to its requirements and well defined VO-level and resource-level policies. Job
status tracking during the job’s lifetime and after its end is performed by the Logging and
Bookkeeping service (LB).
Data managements services concern the access, transfer, and cataloguing of data. The
granularity of the data access control in gLite is on the file level. The Storage Element (SE)
provides an interface to a storage resource, ranging from simple disk server to complex
hierarchical tape storage systems. The gLite LCG File Catalogue service keeps track of the
location of the files (as well as the relevant metadata) and replicas distributed in the Grid.
Information and monitoring services provide mechanism to collect and publish information
about the dynamical state of Grid services and resources, as well as to discover them. gLite
adopted two information systems: the Berkley DB Information Index (BDII), and the
Relational Grid Monitoring Architecture (R-GMA).
Semagrow gLite services are located at the IPB, and their technical characteristics are given in Table
1. All services are installed at dual-core Intel Xeon machines with 4 or more GB of RAM per
machine. Machines are running Scientific Linux Operating System [9], and EMI-3 middleware that is
regularly upgraded to the latest version.
Table 1: List of SemaGrow gLite services.
gLite service
Service endpoint
Technical characteristics
VOMS
voms.ipb.ac.rs
Dual-core Xeon 3060 @ 2.40 GHz, 4 GB RAM
PX
myproxy.ipb.ac.rs
Dual-core Xeon 3060 @ 2.40 GHz, 4 GB RAM
APEL
apel.ipb.ac.rs
Dual-core Xeon 3060 @ 2.40 GHz, 6 GB RAM
CREAM CE
ce64.ipb.ac.rs
cream.ipb.ac.rs
Dual-core Xeon 3060 @ 2.40 GHz, 4 GB RAM
Dual-core Xeon E3110 @ 3.00 GHz, 4 GB RAM
WMS/LB
wms.ipb.ac.rs
wms-aegis.ipb.ac.rs
Dual-core Xeon E3110 @ 3.00 GHz, 8 GB RAM
Dual-core Xeon 3060 @ 2.40 GHz, 8 GB RAM
DPM SE
dpm.ipb.ac.rs
Dual-core Xeon 3060 @ 2.40 GHz, 4 GB RAM
LFC
lfc.ipb.ac.rs
Dual-core Xeon 3060 @ 2.40 GHz, 4 GB RAM
BDII
bdii.ipb.ac.rs
Dual-core Xeon 3060 @ 2.40 GHz, 4 GB RAM
Page 15 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
In order to use PARADOX resources, also known as AEGIS01-IPB-SCL Grid site, through the Grid
layer, user has to be authenticated by a personal X.509 Grid certificate obtained from
corresponding national Certification Authority [10], and authorized at SemaGrow-dedicated
(vo.semagrow.rs) VOMS service [11]. Usage of gLite interface is described in details in gLite User
Guide [12].
3.3 gUSE/WS-PGRADE portal layer
gUSE/WS-PGRADE portal [13] includes a set of high-level Grid services (workflow manager, storage,
broker, grid-submitters for various types of Grids, etc.) and a graphical portal service based on
Liferay technology [14]. gUSE is implemented as a set of web services that bind together in flexible
ways on demand to deliver user services in Grid and/or Web services environments. User interfaces
for gUSE services are provided by the WS-PGRADE web application.
WS-PGRADE uses its own XML-based workflow language with a number of features: advanced
parameter study features through special workflow entities (generator and collector jobs,
parametric files), diverse distributed computing infrastructure (DCI) support, condition-dependent
workflow execution and workflow embedding support.
The structure of WS-PGRADE workflows can be represented by a directed acyclic graph (DAG),
illustrated in Figure 3. Big yellow boxes represent job nodes of the workflow, whereas smaller grey
and green boxes attached to the bigger boxes represent input and output file connectors (ports) of
the given node. Directed edges of the graph represent data dependency (and corresponding file
transfer) among the workflow nodes.
The execution of a workflow instance is data driven and forced by the graph structure: a node will
be activated (the associated job submitted) when the required input data elements (usually file, or
set of files) become available at each input port of the node. This node execution is represented as
the instance of the created job. One node can be activated with several input sets (for example, in
the case of a parameter sweep node) and each activation results in a new job instance. The job
instances contain also status information and in case of successful termination the results of the
calculation are represented in form of data entities associated with the output ports of the
corresponding node.
Page 16 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
Figure 3: WS-PGRADE workflow example.
SemaGrow dedicated gUSE/WS-PGRADE portal is provided by IPB. It is hosted on a 2 x Quad-core
Xeon E5345 @ 2.33GHz machine with 8 GB of RAM memory, and available at http://scibus.ipb.ac.rs/. WS-PGRADE user interface is described in WS-PGRADE Cookbook [15].
3.4 RESTful interface
Beside previously described general-purpose layers, within the SemaGrow project, IPB team has
developed a dedicated RESTful interface to the toolkit for large-scale integration of heterogeneous
repositories. This interface is developed on top of the CouchDB [16] REST API, which is extended by
the additional layer – CouchDB Proxy. This layer enables authentication with X.509 Grid certificates,
and X.509 RFC 3820-compliant proxy certificates, as well as tracking of document changes. It also
significantly simplifies CouchDB document extension.
In parallel with CouchDB, several daemons are tracking changes in the CouchDB documents, and
respond by corresponding actions (job submission, data management operation, etc.) on the Grid
Infrastructure. Since each action on the Grid requires authentication and authorization, daemons
are supplied with a robot certificate, and in this way certified on the Grid side.
Page 17 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
4.
This chapter documents software developed in order to establish the infrastructure required for
deploying Semagrow and carrying out the pilots.
Specifically, this chapter documents:
The Toolkit for Repository Integration, comprising tools for using the large-scale distributed
computational infrastructure described above to prepare RDF data stores from non-RDF
datasets and to expose the resulting RDF datasets on SPARQL endpoints. This toolkit was
used to triplify NetCDF data for the Heterogeneous Data collections and Streams pilots and
XML data for the Reactive Data Discovery pilots.
The rdfCDF Toolkit, comprising tools for converting and serving NetCDF data as well as for
preparing NetCDF files by consuming data exposed on SPARQL endpoints. rdfCDF integrates
the NetCDF triplifier developed in WP2 (UAH, NCSR-D) with the NetCDF Creator and the
NetCDF Endpoint developed within this task (UAH, NCSR-D). This toolkit was used to triplify
and serve NetCDF data for the Heterogeneous Data collections and Streams pilots.
The FAO Web Crawler Database, populated by crawling and semantic annotation software
provided by FAO. Semagrow effort (IPB) pertains to deploying the software on IPB and
exposing the database contents as one of the endpoints to be federated for the Reactive
Data Analysis pilots.
cleanDT, an endpoint that serves a structured and well-formed publication date for AGRIS
bibliography entries. The dataset is automatically constructed by applying heuristics (UAH)
to the (often) informal values used in the AGRIS publication date field. These dates can be
more accurately joined against temporal specifications in other datasets used in the
Reactive Data Analysis pilots.
SemagrowREST (NCSR-D), a REST API that wraps the Semagrow Stack WebApp (D5.4.3)
under a NoSQL querying endpoint. This layer reduces the flexibility of what can be queried,
but simplifies client development for those cases that it does support in Reactive Data
Discovery pilots.
In the remainder of this chapter we will describe the software outlined above and explain its role in
the pilots. Please cf. Deliverable 6.2.1 Pilot Deployment for more details about the full data and
software suite used in each pilot.
4.1 Toolkit For Repository Integration
The project offers SemaGrow SPARQL endpoint that federate SPARQL endpoints over
heterogeneous and diverse data sources. However, in several cases, data are provided as non-RDF
data. Furthermore, due to heterogeneity of data provider background and presence of several selfdeveloped systems, data are made available using different metadata standards. In order to allow
integration of such non-SPARQL endpoints to the framework, during the first year of the project, we
have developed a toolkit for large-scale integration of heterogeneous repositories.
Page 18 of 23
D5.1.2 Semantic Store Infrastructure
Dataset Uploader
FP7-ICT-2011.4.4
Grid Job Submitter
Couch DB
RESTful Interface
Converter
ORIGINAL FILES
RDF FILES
Grid Storage
Triple Store Builder
Grid Infrastructure
Triple Store
#1
Triple Store
#2
Triple Store
#3
Endpoint Service
Local Batch System
PARADOX IV
Figure 4: Architecture of the toolkit for repository integration.
The data collections involved in the SemaGrow Use Cases are expressed in different formats, such
as XML and NetCDF [1].
We designate as a dataset a set of such files for each collection, written using the same format, and
compressed into the single tarball (compressed tar archive).
The architecture of the toolkit is illustrated in Figure 4, and is organized into three main blocks:
RESTful interface that enables upload of datasets, and keeps technical metadata related to
datasets;
Gridified applications for conversion of the uploaded datasets to RDF datasets (Converters),
and upload of the produced RDFs to the triplestore (TripleStoreBuilder);
Triplestore with SPARQL endpoint.
Dataset providers use the RESTful interface of the toolkit for upload of the datasets. The upload is
usually done in two steps: we first specify an HTTP location from where the dataset could be
retrieved and stored on Grid storage system, and afterwards send this information to the CouchDB
Page 19 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
via a POST HTTP request. Once the information is stored in the CouchDB, the upload of the dataset
is performed by a Dataset uploader component of the RESTful interface. This process is followed up
by a Grid Job Submitter component that triggers submission of the job to the Grid infrastructure.
The aim of the job is to perform a conversion of the original files within the dataset to the
corresponding RDF files. The triplification process is executed by a triplification module specific to
the format and schema of each collection. For example, when the original data are expressed in
XML and follow a certain metadata schema (e.g. LOM), the triplification module uses and XSL
Transformation to produce the RDF triples from the original XML files. After the conversion, the
produced RDFs are published to the Triplestore by another gridified application designated as
TripleStoreBuilder.
Both the original datasets (provided manually) and the produced RDF datasets are stored
permanently on the Grid storage system. Each step in the presented workflow is followed by a
modification of corresponding dataset’s CouchDB document. Each modification introduces new
technical metadata used for dataset tracking. Such metadata are: location of the original and RDF
datasets in the Grid storage system, number of files within the dataset, relevant timestamps,
problems, etc.
Applications ported to the Grid environment (Converters and TripleStoreBuilder) are written in the
Java programming language. Java is supposed to be platform-independent, which significantly
simplified their porting. However, in order to avoid problems related to the compatibility of the
applications and different Java versions, together with the applications, we have deployed Java
environment to vo.semagrow.rs Grid software stack as well. For more details on the
architecture of the Converters and the TripleStore Builder, please cf. Deliverable 2.2.2, Data
Streams and Collections.
4.2 rdfCDF Toolkit
The rdfCDF Toolkit that serves as the intermediate between datasets in NetCDF [1] format and the
Semantic Web representation technologies and formats used by the SemaGrow Stack. Specifically,
the toolkit comprises the following:
The NetCDF Converter, the off-line triplification of NetCDF data into RDF, following an RDF
schema developed in SemaGrow by appropriately extending the Data Cube schema.
The NetCDF Endpoint, a SPARQL endpoint that operates over unconverted NetCDF files. The
NetCDF Endpoint performs on-line, implicit triplification into the same schema as the
NetCDF Converter and serves the results via a Sesame SPARQL endpoint. This allows us to
establish SPARQL endpoints that serve the original NetCDF data without requiring
duplication into RDF stores.
The NetCDF Creator, a Semagrow Stack client that queries the triplified NetCDF data and
uses query results to construct NetCDF files. This allows us to dynamically select subdatasets according to user specified restrictions and to combine measurements (datapoints)
from different NetCDF files.
Page 20 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
The point behind this ability to round-trip between data originally stored in NetCDF files back to
NetCDF files is that this allows us to offer to the end-users results collected by combining and
filtering the contents of the original NetCDF files, so that they could not have been simply found in
the original collection but need to be dynamically generated. Creating new NetCDF files out of
these results is important for the users, since NetCDF files constitute appropriate input for their
modelling experiments.
4.3 Crawler Database Population and Hosting
The Crawler Database allows the AGRIS Web Portal users to discover Web documents based on
their relevance to a given bibliographic entry from the AGRIS database. This functionality exploits
semantic annotations of crawled Web pages, allowing that their semantic similarity to the
bibliographic entry is estimated. More specifically, the workflow of the application consists of:
Crawling: Apache Nutch Web Crawler1, customized by FAO, crawls the Web starting from a
list of Web sites configured by FAO to focus on the agricultural domain
Parsing and annotating: the AgroTagger semantically annotates the crawled documents
with AGROVOC2 terms. The AgroTagger,3 developed by FAO, is based on the MAUI Indexer.4
This Java application produces RDF metadata that uses SKOS to link all the documents it
receives as input to AGROVOC terms.
Publishing: the produced RDF is loaded to the IPB 4store cluster (Section 2.3) and exposed
via a SPARQL endpoint.
This SPARQL endpoint is the Crawler Database that is one of the data sources used in the AGRIS
Demonstrator (cf. D6.2.1 Pilot Deployment).
4.4 Clean AGRIS Date/Time Service
The AGRIS dataset follows the BIBO schema and uses the bibo:issued property5 to provide
publication dates. As a sub-property of dct:date,6 this property inherits lack of specificity with
respect to its range, which can be any RDF literal. This is appropriate for human consumption and
for allowing flexibility to leave publication dates under-specified, but it makes it considerably
harder to develop queries that, for example, restrict a search within a given month or year. This
includes regular expression filtering, since month dates are often provided in different languages.
In order to offer the ability to safely author queries that join on dates, the project has added an
endpoint that links AGRIS bibliographical entries with cleaned xsd:dateTime values. These are
1
Please cf. http://nutch.apache.org
Please cf. http://aims.fao.org/standards/agrovoc/about
3
Please cf. https://github.com/agrisfao/agrotagger
4
Please cf. https://code.google.com/p/maui-indexer
5
Please cf. http://purl.org/ontology/bibo/issued
6
Please cf. http://purl.org/dc/terms/date
2
Page 21 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
automatically gleaned from the original AGRIS data by cleanDT, a Java application that implements
several regular expression heuristics to transform the AGRIS dates into YYYY-MM-DD format. In
cases of date ranges or under-specified dates that do not map to YYYY-MM-DD, the application
responds with the most specific right-truncated under-specification of YYYY-MM-DD that includes
the whole date range.7
Table 2 lists some characteristic examples.
The cleanDT source code is publicly available at http://bitbucket.org/bigopendata/cleandt
4.5 SemaGrowREST
SemaGrowREST is a REST API for the Semagrow Stack WebApp API. It provides the option to search
a dataset's subjects, objects or predicates using the q parameter and add limit results using the
page_size parameter.
The SemaGrowREST source code is publicly available at
https://bitbucket.org/bigopendata/semagrowrest
Table 2: Characteristic examples of date cleaning
AGRIS Value
Intention
cleanDT Value
8abr1995
On 8 April 1995
1995-04-08
8-15dec1990
Between 8 and 15 December 1990
1990-12
sum1985
Summer 1985
1985
0000
Unknown
7
Formally speaking, the format used in XML Schema to lexically represent xsd:date,
xsd:gMonth and xsd:gYear. Please cf. http://www.w3.org/TR/xmlschema-2/#isoformats
Page 22 of 23
D5.1.2 Semantic Store Infrastructure
FP7-ICT-2011.4.4
5.
[1]
Network Common Data Form (NetCDF), http://www.unidata.ucar.edu/software/netcdf
[2]
MPICH Home Page, http://www.mpich.org/
[3]
Open MPI: Open Source High Performance Computing, http://www.open-mpi.org/
[4]
Garrick Staples, “TORQUE resource manager”, SC'06 Proceedings of the 2006 ACM/IEEE
conference on Supercomputing
[5]
OpenPBS patches, tools, and information,
http://www.mcs.anl.gov/research/projects/openpbs/
[6]
Maui Cluster Scheduler, http://www.adaptivecomputing.com/products/open-source/maui/
[7]
PARADOX User Guide, http://www.scl.rs/paradox/PARADOX_UG-v1.pdf
[8]
EMI Products, http://www.eu-emi.eu/products
[9]
Scientific Linux Operating System, https://www.scientificlinux.org/
[10] EUGridPMA Clickable Map of Authorities, https://www.eugridpma.org/members/worldmap/
[11] SemaGrow VOMS service, https://voms.ipb.ac.rs:8443/voms/vo.semagrow.eu
[12] gLite User Guide, https://edms.cern.ch/file/722398/1.4/gLite-3-UserGuide.pdf
[13] Akos Balasko, Zoltan Farkas, Peter Kacsuk, “Building science gateways by utilizing the generic
WS-PGRADE/gUSE workflow system”, Computer Science 14 (2) 2013.
[14] Liferay technology, http://www.liferay.com/
[15] WS-PGRADE Cookbook, http://sourceforge.net/projects/guse/files/WS-PGRADECookbook.pdf
[16] Apache CouchDB, http://couchdb.apache.org/
Page 23 of 23