SDS

Transcription

SDS
SDS: A Framework for Scientific Data Services
Bin Dong, Suren Byna, John Wu
Scientific Data Management Group
Lawrence Berkeley National Laboratory
Contact: [email protected]
The Scientific Data Services (SDS) Framework
Applica4on(
SDS(Query(API(
HDF5(API(
SDS(Query(API(
HDF5(API(
SDS(Client(
SDS(Client(
Network(
Parallel&File&System&&
SDS(Server(
Database(
Batch(System(
SDS Client Implementation
Applica:on&
Post&
Processor&
Data
File Handle
and
Query String
Reader&
HDF5&API&
Parser&
File Handle and Read Parameters
Server&
Connector&
File Handle
and
Final Query String
SDS&&Server&
Requested
Data
SDS&Query&
Interface&
SDS Client
Original File Metadata or
Reorganized File Metadata
Parallel&File&System&
HDF5 API
•  The HDF5 Virtual Object Layer (VOL) feature allows capturing HDF5 calls
•  Developed a VOL plugin for SDS for capturing file open, read, and close functions
SDS Query Interface
•  An interface to perform SQL-style queries on arrays
•  Functional API that can be used from C/C++ applications
Parser
•  Checks the conditions in a query
•  Verifies the validity of file names, etc.
Server Connector
•  Packages a query or HDF5 read call information and sends to the SDS server
•  Using protocol buffers for communication
•  MPI Rank 0 communicates to the server and then informs the remaining MPI
processes
Reader
•  Reads data from the dataset location returned by the SDS Server
Post Processor
•  Performs any post-processing needed before copying to the application memory
•  Eg. Decompression, transposition, etc.
Query#Request#
File#Handle##
and##
Reorganiza6on#Type#
Reorganiza6on#
Evaluator#
SDS#Metadata#
Job#Script#
Organiza6on#
List#
Query#Response#
SDS#Metadata#
Query#
Evaluator#
SDS#Metadata#
Manager#
SDS#Metadata#
File#Data#
Data#
Reorganizer#
Data#Organiza6on#
Recommender#
Organiza6on#List#
Read#Sta6s6cs#
Job#Results#
Frequently#Read#File#
Handle#and#its#SDS#
Metadata#
ACribute#of##Reorganized#file#
Parallel#File#System#
Request Dispatcher
•  Receives SDS client requests and SDS Admin interface
•  SDS Admin interface issues reorganization commands
•  Based on the request, dispatcher passes on the request to Query Evaluator and
Reorganization Evaluator
Query Evaluator
•  Looks up SDS Metadata for finding available reorganized datasets and their
locations for a given dataset
•  SDS Metadata
•  File name, HDF5 dataset info, permissions
Reorganization Evaluator
•  Decides whether to reorganize based on the frequency of read accesses
•  Takes commands from the Admin interface
•  Instructs Data Reorganizer to create a reorganization job script
Data Organization Recommender
•  Identifies optimal data reorganization
•  Informs the Reorganization Evaluator with the selected strategy
Data Organizer
•  Locates reorganization code, such as sorting, indexing algorithms
•  Decides on the number of cores to use
•  Prepares a batch job script
•  Monitors the job execution
•  After reorganization job is complete, stores the new data location in the SDS
Metadata Manager
Results
1"
HDF5%Open+Close%
SDS%Metadata%Read%
0.8"
Time"(sec)"
Applica4on(
Reorganiza6on##
Request#
Request#
Dispatcher#
0.6"
0.4"
0.2"
0"
40"
80"
120"
160"
200"
240"
Number"of"Concurent"Clients"
Full%Data%Scan%
Time%(sec)%
MPI(Process(N(
SDS#Server#
User#Reorganiza6on#Request#
Periodic#SelfJstart#Request#
Reorganized#File#
MPI(Process(0(
Client#Request#
Reorganiza6on##Job##Running#
Implementation with SDS Client library and
persistent SDS Server
SDS#Admin#Interface#
SDS#Client#
Original#File##
Problem
•  File systems are static
•  Data stored on current file systems is immutable
•  Written and often laid out on file systems by data producers (simulations and
experiments)
•  Consumers of data are responsible for organizing the data for accelerated
analysis, which is often ignored resulting in poor analysis performance
Solution
•  The Scientific Data Services (SDS) framework
•  Bringing the merits of database management systems to manage scientific data
on file systems
•  Transparent access to data with with existing scientific data interfaces (starting
with HDF5)
•  Transparent data reorganization and Scientific data querying support
SDS Server Implementation
140.00%
120.00%
100.00%
80.00%
60.00%
40.00%
20.00%
0.00%
280"
320"
Read%Reorganized%Data%
1.2X%
2.3X%
8.1X%
E>1.10%
E>1.15%
E>1.20%
18.9X%
E>1.25%
25.7X%
E>1.30%
36.3X%
E>1.35%
42.0X%
E>1.40%
44.8X%
E>1.45%
51.2X%
E>1.50%
Query%
More results available in:
Bin Dong, Suren Byna, and John Wu, “Expediting Scientific Data Analysis with Reorganization”, IEEE Cluster 2013
The authors thank Quincey Koziol and Mohamad Chaarawi from The HDF5 Group for their support with HDF5 VOL, Homa Karimabadi, William Daughton,
and Vadim Roytershteyn for their guidance in understanding the read patterns of VPIC data analysis, Arie Shoshani and Spyros Blanas for their thoughtful
comments about the work.
This work is supported in part by the Director, Office of Laboratory Policy and Infrastructure Management of the U.S. Department of Energy under
Contract No.~DE-AC02-05CH11231, and used resources of The National Energy Research Scientific Computing Center (NERSC).
Contact:
Suren Byna ([email protected])
Lawrence Berkeley N
Creative Services Off
CRD_PowerPoint_Po