MataNui – Building a Grid Data Infrastructure that

Transcription

MataNui – Building a Grid Data Infrastructure that
MataNui – Building a Grid Data Infrastructure that “doesn’t suck!”
G. K. Kloss and M. J. Johnson
{G.Kloss | M.J.Johnson}@massey.ac.nz
Institute of Information and Mathematical Sciences, Massey University, Auckland
Introduction
MOA (Microlensing Observations in Astrophysics) [1]
is a Japan/New Zealand collaboration project. It
makes observations on dark matter, extra-solar planets and stellar atmospheres using the gravitational
microlensing technique, and is one of the few techniques to detect low-mass extra-solar planets. The
technique is working by analysing large quantities of
imagery from optical telescopes.
Abstract
In science and engineering the problem of being troubled by data management is quite common. Particularly, if partners within a project are
geographically distributed, and require fast access to data. These partners would ideally like to access or store data on local servers only, still retaining
access for remote partners without manual intervention. This project is attempting to solve such data management problems in an international
research collaboration (in astrophysics) with the participation of several New Zealand universities. Data is to be accessed and managed along with its
meta-data in several distributed locations, and it has to integrate with the infrastructure provided by the BeSTGRID project. Researchers also need
to be able to use simple but powerful graphical user interface for data management. This poster outlines the requirements, implementation and tools
involved for such a Grid data infrastructure.
Keywords: Grid Computing; Data Fabric; distributed; meta-data; data replication; GUI client; DataFinder.
Front Ends
The Problem
Astronomers world-wide are producing telescopic images. Teams across New Zealand and Japan are accessing these for their research, creating higher level
data products. All these data files need to be stored
and retrieved. Currently they are either stored on remotely accessible servers, or they are transferred via
removable offline media. Additionally, every stored
item is annotated with potentially extensive sets of
meta-data. Researchers commonly keep essential
parts of this meta-data separately on their system,
so that they can identify particular items to retrieve
for their work.
This process is tedious and manually cumbersome.
Especially, as not all data files are online available and
may require accessing various forms of offline media.
If available online, they have to be retrieved from
remote systems through potentially slow connections.
Direct and homogeneous access patterns for all data
files and their associated meta-data does not exist.
Data management is not a new topic, and there are
many solutions for it available already. Many of them
are hand knit, and many are commercial and potentially very expensive. But even more important, they
usually do not work well together with current Grid
infrastructures, as they were not designed to be “Grid
ready.” They are often complicated and require altering the research workflow to suit the needs of the
system. Lastly, the ones meeting most of the requirements commonly do not provide graphical end-user
tools to support data intensive research.
GridFTP is the most common way to integrate data
services into a Grid environment. It is commonly used
for scripts and automation. GridFTP is the most
common denominator for compatibility with the Grid,
and it features (among others) the capability of using
Grid certificate based authentication and third-party
transfers.
File System Mounts are a common way to integrate externally stored file systems directly into the
host system of compute resources (e. g. compute cluster, high performance computing servers). This enables scripts and applications to use the data simply
and directly without an additional retrieval or upload
step.
Figure 1: Concept DataFinder:
Data modelling and storage.
The DataFinder GUI Client [2] is an application
researchers can use as an easy to use end-user tool
supporting their data intensive needs (Fig. 2 and 4).
It has been developed as open source software by the
German Aerospace Centre to support internal projects
and external partners. The application allows easy
and flexible access to remote data repositories with
associated meta-data. The DataFinder is designed
for scientific and engineering purposes, and it assists
in this through the following:
• Handles access/transfer to/from data server(s)
• Retrieval and modification of meta-data
• Extensive (server-side) queries on all meta-data
• Support for project specific policies:
– Data hierarchy definition
– Enforcement of workflows
– Meta-data specification
• Scripting used to automate reoccurring tasks
• Can integrate 3rd party (GUI) tools
The DataFinder can act as a universal Grid/storage
system client [3] client (Fig. 1), as it is easily extensible to connect to further storage sub-systems (beyond
the list of ones already available).
Figure 2: Integration of GUI
applications with DataFinder.
Implementation
See Fig. 3.
Storage back-end (GridFS on MongoDB) –
For a straight forward implementation, a suitable data
storage server was sought. We chose the “NoSQL”
database MongoDB [4]. It features the “GridFS”
storage mode, capable of storing file-like data (in
large numbers and sizes) along with its meta-data.
MongoDB cam work in federation with distributed
servers, and data being automatically replicated to
the other instances. Therefore, every site can operate their own local MongoDB server, keeping data
access latencies low and performance high.
Results
Native file system mount (GridFS FUSE) –
A GridFS FUSE driver [5] is already available. So
a remote GridFS can be mounted into a local Linux
system.
Grid front-end (GridFTP) – To provisions access
through Grid means, the Griffin GridFTP server [6]
by the Australian Research Collaboration Service
(ARCS) is equipped with a GridFS storage back-end.
Through this, every Grid capable tool can be used
to store/retrieve files with any of the MongoDB instances interfaced by a Griffin server. This access
method also allows Grid applications to access the
storage server using the commonly used certificates.
Requirements
The envisioned Grid enabled data management system has to meet a few requirements. But most of all,
it should be implementable without having to “reinvent the wheel.” It should be possible to source
large portions of its essential components from existing (free) tools, and “just” require some “plumbing”
to join them to meet these requirements:
• Handle large amounts of data
• Arbitrary amounts of meta-data
• Manage storage/access from remote locations
• Use local access/storage through replication
• Perform (server side) queries on the meta-data
• Be robust, easy to deploy and easy to use
• Performance on larger data collections
• Use Grid Computing standards/practices
GUI front-end (DataFinder) – The DataFinder
is to be interfaced with the GridFS storage backend. To avoid giving a remote end user client full
access to the MongoDB server, a server interface
layer is introduced. For this, a RESTful web service authenticating against a Grid certificate is implemented. The implementation is based on the Apache
web server through the WSGI interface layer [7]. On
the client side, the DataFinder is facilitated with a
storage back-end accessing this web service. The
DataFinder is currently the only client fully capable
of making use of the available meta-data (creating,
modifying and accessing meta-data, as well as performing efficient server-side queries on it). Particularly server-side queries reduce data access latencies
significantly and improve query performance.
WebDAV front-end (Catacomb) – A potential
future pathway to access GridFS content is the Catacomb WebDAV server [8]. It can be modified to use
GridFS/MongoDB as a storage back-end instead of
the currently used MySQL relational database.
By choosing suitable existing building blocks, it becomes comparably simple to implement a consistent
Grid data infrastructure with the desired features.
The implementation makes currently good progress,
and is expected to be simple to deploy and configure,
as well as integrate seamlessly into the BeSTGRID or
other projects’ infrastructures. Particularly the problems of operating on large amounts of annotated data
from astrophysics research seem to benefit from this
research significantly. Data can be stored and accessed from geographically remote partners equally
fast, and processing on the data can be performed
locally. Data processing can easily be conducted on
sets returned as the results of queries (e. g. of particular spacial regions, indicating specific phenomena
indicated in the meta-data, produced by certain telescopes, in given time frames, etc.).
References
[1] I. A. Bond, F. Abe, R. Dodd, et al., “Real-time difference imaging analysis of MOA Galactic bulge observations during 2000,” Monthly
Notices of the Royal Astronomical Society, vol. 327, pp. 868–880, 2001.
[2] T. Schlauch and A. Schreiber, “DataFinder – A Scientific Data Management Solution,” in Proceedings of Symposium for Ensuring
Long-Term Preservation and Adding Value to Scientific and Technical Data 2007 (PV 2007), Munich, Germany, October 2007.
[3] T. Schlauch, A. Eifer, T. Soddemann, and A. Schreiber, “A Data Management System for UNICORE 6,” in Proceedings of EuroPar
Workshops – UNICORE Summit, ser. Lecture Notes in Computer Science (LNCS). Delft, Netherlands: Springer, August 2009.
[4] “MongoDB Project,” http://www.mongodb.org/.
[5] M. Stephens, “GridFS FUSE Project,” http://github.com/ mikejs/ gridfs-fuse.
[6] S. Zhang, P. Coddington, and A. Wendelborn, “Connecting arbitrary data resources to the Grid,” in Proceedings of the 11th International
Conference on Grid Computing (Grid 2010). Brussels, Belgium: ACM/IEEE, October 2010.
Figure 3: Overview Grid data infrastructure.
Figure 4: Turbine simulation
workflow with DataFinder
(with custom GUI dialogues).
[7] N. Piël, “Benchmark of Python WSGI Servers,” http://nichol.as/ benchmark-of-python-web-servers, March 2010.
[8] M. Litz, “Catacomb WebDAV Server,” in UpTimes – German Unix User Group (GUUG) Members’ Magazine, April 2006, pp. 16–19.