SDS
Transcription
SDS
SDS: A Framework for Scientific Data Services Bin Dong, Suren Byna, John Wu Scientific Data Management Group Lawrence Berkeley National Laboratory Contact: [email protected] The Scientific Data Services (SDS) Framework Applica4on( SDS(Query(API( HDF5(API( SDS(Query(API( HDF5(API( SDS(Client( SDS(Client( Network( Parallel&File&System&& SDS(Server( Database( Batch(System( SDS Client Implementation Applica:on& Post& Processor& Data File Handle and Query String Reader& HDF5&API& Parser& File Handle and Read Parameters Server& Connector& File Handle and Final Query String SDS&&Server& Requested Data SDS&Query& Interface& SDS Client Original File Metadata or Reorganized File Metadata Parallel&File&System& HDF5 API • The HDF5 Virtual Object Layer (VOL) feature allows capturing HDF5 calls • Developed a VOL plugin for SDS for capturing file open, read, and close functions SDS Query Interface • An interface to perform SQL-style queries on arrays • Functional API that can be used from C/C++ applications Parser • Checks the conditions in a query • Verifies the validity of file names, etc. Server Connector • Packages a query or HDF5 read call information and sends to the SDS server • Using protocol buffers for communication • MPI Rank 0 communicates to the server and then informs the remaining MPI processes Reader • Reads data from the dataset location returned by the SDS Server Post Processor • Performs any post-processing needed before copying to the application memory • Eg. Decompression, transposition, etc. Query#Request# File#Handle## and## Reorganiza6on#Type# Reorganiza6on# Evaluator# SDS#Metadata# Job#Script# Organiza6on# List# Query#Response# SDS#Metadata# Query# Evaluator# SDS#Metadata# Manager# SDS#Metadata# File#Data# Data# Reorganizer# Data#Organiza6on# Recommender# Organiza6on#List# Read#Sta6s6cs# Job#Results# Frequently#Read#File# Handle#and#its#SDS# Metadata# ACribute#of##Reorganized#file# Parallel#File#System# Request Dispatcher • Receives SDS client requests and SDS Admin interface • SDS Admin interface issues reorganization commands • Based on the request, dispatcher passes on the request to Query Evaluator and Reorganization Evaluator Query Evaluator • Looks up SDS Metadata for finding available reorganized datasets and their locations for a given dataset • SDS Metadata • File name, HDF5 dataset info, permissions Reorganization Evaluator • Decides whether to reorganize based on the frequency of read accesses • Takes commands from the Admin interface • Instructs Data Reorganizer to create a reorganization job script Data Organization Recommender • Identifies optimal data reorganization • Informs the Reorganization Evaluator with the selected strategy Data Organizer • Locates reorganization code, such as sorting, indexing algorithms • Decides on the number of cores to use • Prepares a batch job script • Monitors the job execution • After reorganization job is complete, stores the new data location in the SDS Metadata Manager Results 1" HDF5%Open+Close% SDS%Metadata%Read% 0.8" Time"(sec)" Applica4on( Reorganiza6on## Request# Request# Dispatcher# 0.6" 0.4" 0.2" 0" 40" 80" 120" 160" 200" 240" Number"of"Concurent"Clients" Full%Data%Scan% Time%(sec)% MPI(Process(N( SDS#Server# User#Reorganiza6on#Request# Periodic#SelfJstart#Request# Reorganized#File# MPI(Process(0( Client#Request# Reorganiza6on##Job##Running# Implementation with SDS Client library and persistent SDS Server SDS#Admin#Interface# SDS#Client# Original#File## Problem • File systems are static • Data stored on current file systems is immutable • Written and often laid out on file systems by data producers (simulations and experiments) • Consumers of data are responsible for organizing the data for accelerated analysis, which is often ignored resulting in poor analysis performance Solution • The Scientific Data Services (SDS) framework • Bringing the merits of database management systems to manage scientific data on file systems • Transparent access to data with with existing scientific data interfaces (starting with HDF5) • Transparent data reorganization and Scientific data querying support SDS Server Implementation 140.00% 120.00% 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% 280" 320" Read%Reorganized%Data% 1.2X% 2.3X% 8.1X% E>1.10% E>1.15% E>1.20% 18.9X% E>1.25% 25.7X% E>1.30% 36.3X% E>1.35% 42.0X% E>1.40% 44.8X% E>1.45% 51.2X% E>1.50% Query% More results available in: Bin Dong, Suren Byna, and John Wu, “Expediting Scientific Data Analysis with Reorganization”, IEEE Cluster 2013 The authors thank Quincey Koziol and Mohamad Chaarawi from The HDF5 Group for their support with HDF5 VOL, Homa Karimabadi, William Daughton, and Vadim Roytershteyn for their guidance in understanding the read patterns of VPIC data analysis, Arie Shoshani and Spyros Blanas for their thoughtful comments about the work. This work is supported in part by the Director, Office of Laboratory Policy and Infrastructure Management of the U.S. Department of Energy under Contract No.~DE-AC02-05CH11231, and used resources of The National Energy Research Scientific Computing Center (NERSC). Contact: Suren Byna ([email protected]) Lawrence Berkeley N Creative Services Off CRD_PowerPoint_Po