Data Staging and Asynchronous I/O in ADIOS

Transcription

Data Staging and Asynchronous I/O in ADIOS
Data Staging and
Asynchronous I/O in ADIOS
Hasan Abbasi
ORNL
Jong Choi
ORNL
Greg Eisenhauer
Georgia Tech
Scott Klasky
ORNL
Manish Parashar
Rutgers
Norbert Podhorszki
ORNL
Nagiza Samatova
NCSU
Karsten Schwan
Georgia Tech
Matthew Wolf
Georgia Tech
ORNL is managed by UT-Battelle
for the US Department of Energy
Outline •  ADIOS Overview •  Introduc3on to Staging •  Data Management in I/O Pipelines •  Staging in ADIOS •  Network and System service discussion 2 Presentation_name
hNp://www.nccs.gov/user-­‐support/center-­‐projects/adios/ ADIOS •  Abstracts Data-­‐at-­‐Rest to Data-­‐in-­‐Mo3on for HPC – Provides portable, fast, scalable, easy-­‐to-­‐use, metadata rich output – Dynamically allows users to change the method during an experiment/simula3on •  Provides solu3ons for “90% of the applica3ons” •  ADIOS has been cited almost 1,000 3mes Interface)to)apps)for)descrip/on)of)data)(ADIOS,)etc.))
Data)Management)Services)
Feedback)
Buffering)
)Schedule)
Mul/Bresolu/on)
methods)
Data)Compression)
methods)
Data)Indexing)
(FastBit))methods)
Plugins)to)the)hybrid)staging)area)
Provenance)
Workflow))Engine)
Run/me)engine)
Analysis)Plugins)
AdiosBbp)
IDX)
HDF5)
Visualiza/on)Plugins)
pnetcdf)
“raw”)data) Image)data)
Parallel)and)Distributed)File)System)
3 Presentation_name
Data)movement)
Viz.)Client)
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
Astrophysics
Climate
Combustion
CFD
Environmental Science
Fusion
Geoscience
Materials Science
Medical: Pathology
Neutron Science
• 
• 
• 
• 
• 
• 
Nuclear Science
Quantum Turbulence
Relativity
Seismology
Sub-surface modeling
Weather
Improving I/O Methods for High End simula3ons •  Reduce I/O overhead, reduce network data movement, improve wri3ng and reading performance •  To achieve this goal, ADIOS provides many methods –  Posix (1 file per process, independent set of files) –  Posix (1 file per process + metadata; read as one dataset) –  MPI-­‐Lustre (MPI-­‐IO wri3ng to 1 global file) –  Aggregate (1 file per OST) + 1 metadata file –  BG (1 file per rack) + 1 metadata file –  …. •  There’s no single right answer for all users. –  ADIOS gives the user flexibility without rewri3ng code. 4 Presentation_name
Large Writes Per Many cores •  First effort shows performance goes from 50 GB/s to over 100 GB/s •  New features for IBM BG/Q to eliminate the serial process in ADIOS for the metadata crea3on is now op3onal –  Metadata crea3on is serial due to the problem of threadsafe MPI on most systems •  Tes3ng has begun to use staging to write data –  Problem is size of the staging area •  Requires over 10K cores for staging… •  GPU on staging is useless if we do NOT do other processing 5 Presentation_name
Small Writes per many cores (Combustion)
•  Requires high performance I/O due to large output (200 GB/10 minutes) •  Frequent reading of large datasets on a small number of processors for analy3cs •  Individual process output is small, leading to low u3liza3on of network bandwidth with other I/O solu3ons •  Reading of large datasets with a different access paNern than they were wriNen out leads to •  frequent seeking for data •  very low read bandwidth •  Analysis codes spend 90% of their 3me reading data •  Allowed ADIOS team to focus on small but frequent output data 6 Presentation_name
Spa3al Temporal Aggrega3on •  Temporal aggrega3on is to open up another horizon to further consolidate data •  Data of mul3ple 3me steps are merged at each process •  Data is wriNen out only at the last 3me step or reaches the boundary of memory capacity •  Achieved up to 70x speedup for read performance, and 11x speedup for write performance in mission cri3cal climate simula3on GEOS-­‐5 (NASA), on Jaguar GEOS-5 Results
Original(
t2
t1
•  Common read patterns
for GEOS-5 users are
reduced from 10 – 0.1
seconds
New(
t3
t
•  Allows interactive data
exploration for mission
critical visualizations
A 2-D variable at 3 time steps are merged into
a 3-D variable with time as new dimension
V2
…
=60)
Data layout V3
0)
V1
30)
T1,2,3
New(
10/29/79)
7 Presentation_name
Temperature)tendency)from)moist)physics)
60)
V1 V2 V3 … V1 V2 V3 … V1 V2 V3 …
90)
T3
La4tude)
Original(
T2
=30)
T1
Read performance of a 2D slice
of a 3D variable + time
10/30/79)
Date)
10/31/79)
I/O Variability Single Storage Target
Performance Variations
300
•  Problem
Throughput (MB/sec)
250
•  Techniques that achieved
high performance I/O
•  Aggregation with write-behind strategy
•  Stripe alignment: to avoid contention
0
20
40
60
Runs
Hopper
1000
15
Static
I/O Re-routing (TF=0.1)
800
I/O Time (sec)
I/O Time (sec)
100
0
Titan
Static
I/O Re-routing (TF=0.1)
150
50
•  Are these techniques sufficient to get
the peak I/O performance?
16
200
14
13
12
600
400
200
11
0
0
50
100
150
200
250
Run (21:20PM to 23:30PM, 2/27/2013, no noise injected)
8 Presentation_name
0
20
40
60
80
100
Run (21:20PM to 1:20AM, 3/1/2013, no noise injected)
80
100
Introduc3on to Staging •  Initial development as a research effort to minimize I/O overhead
•  Draws from past work on threaded I/O
•  Exploits network hardware support for fast data transfer to remote memory
Hasan Abbasi, Matthew Wolf, Greg Eisenhauer, Scott Klasky, Karsten Schwan, Fang Zheng: DataStager: scalable data
staging services for petascale applications. Cluster Computing 13(3): 277-290 (2010)
Ciprian Docan, Manish Parashar, Scott Klasky: DataSpaces: an interaction and coordination framework for coupled
simulation workflows. Cluster Computing 15(2): 163-181 (2012)
9 Presentation_name
Data Management in I/O Pipelines •  Perform computation in the right
location
• 
Aggrega3on and chunking to improve data access •  Support dynamic placement
• 
End-­‐to-­‐End approach to data management •  Use data reduction techniques
Data chunks
…
Application
Intra-chunk
level
<
Hierarchical
Spatial
Aggregation
Chunk
level
Storage
10 Presentation_name
Optimized
Chunking Model
Optimized
Chunk Size?
Yes
Space Filling
Curve Reordering
Offsetorg
>
Dynamic
Subchunking
Offsetnew
Indexing and Compression •  Extreme scale data enhancement and reduc3on •  U3lize in transit and in situ mechanisms •  Scien3fic compression schemes (ISABELA and ISOBAR) •  In situ indexing to enable fast query and data access •  Deployed as services in the pipeline 11 Presentation_name
Predata: I/O Pipelines •  Use the staging nodes and create a workflow in the staging nodes. •  Allows us to explore many research aspects. •  Improve total simula3on 3me by 2.7% •  Allow the ability to generate online insights into the 260GB data being output from 16,384 compute cores in 40 seconds. BP file
sorted array
BP
BPwriter
writer
Sort
Sort
Bitmap
Bitmap
Indexing
Indexing
Particle array
12 Presentation_name
Histogram
Histogram
Plotter
Plotter
2D
2DHistogram
Histogram
Plotter
Plotter
Index file
In-­‐Memory Data Staging with DataSpaces Staging-­‐based (ADIOS DATASPACES transport method) •  Extract data from running simula3ons into the memory of staging servers •  Enables more loosely coupled data interac3ons •  Reduced resource conten3on, e.g., on-­‐node memory 13 Presentation_name
In-­‐Memory Data Staging with DIMES DIMES (ADIOS DIMES transport method) •  Extract data from running simula3ons directly into another applica3on’s memory space •  Enable more 3ghtly coupled data interac3ons •  Reduced network data movement (as compared to staging) 14 Presentation_name
Applica3on Coupling Loose coupling of XGC0 and M3D-­‐OMP and ELITE through ADIOS using DataSpaces Tight coupling of XGC1 and
XGCa in combination with a
monitoring dashboard
BP File
XGC1
ADIOS
ADIOS
Embedded
Schema to add
semantics of
data
Plotter/Analysis
XGCa
Staging
Services
DataSpaces
DIMES
Plotter/Analysis
Chunkbased
Operation
Monitoring System
15 Presentation_name
Loosely Coupled in-­‐situ Visualiza3on • 
• 
• 
• 
Visualiza3on and applica3on are de-­‐coupled Uses ADIOS on Dataspaces Viz opera3ons and rendering performed on system nodes XGC SciDAC simula3on example HPC Application
ADIOS
Vis/Analysis
(EAVL)
ADIOS
Staging
(Data Spaces)
16 Presentation_name
ICEE Method Enables WAN Staging Memory-­‐to-­‐memory data delivery (code coupling)
Transparent workflow execution
Analysis
Data Stream
FastBi
t Index
Data
Trans
form
Compr
ession
WAN Transportation
• FlexPath/EVPath
• DataSpaces
• ICEE
ADIOS
Data Generation
ADIOS
Data Hub
Analysis
Analysis
•  Streaming data from experimental sources •  Transparent abstrac3on for moving data over LAN or WAN •  Integra3on of indexing/querying to minimize long distance data movement •  Mul3ple methods for moving data (ICEE, DataSpaces, Flexpath) 17 Presentation_name
Network Services •  Managing Conten3on –  We need to avoid resource conten3on to minimize I/O overhead on applica3on performance •  Priority based flow control –  Par33oning of streams into data and control will improve scalability •  Mul3cast for RDMA –  A single source can be feeding data to mul3ple consumers for data staging opera3ons •  Select/Callback support –  Simple mechanism to check availability of data, par3cularly useful in combina3on with 1-­‐sided communica3on 18 Presentation_name
System Services •  Fault Tolerance –  I/O services can add their own resiliency but need appropriate no3fica3on from the network layer to ini3ate recovery •  ScaNer-­‐Gather –  More consistent support for scaNer-­‐gather (iovecs) across plaqorms •  RDMA to NVRAM –  Par3cularly important for Summit •  Rendezvous and discovery –  Iden3fy I/O services, ini3ate and manage connec3ons –  Support for WANs and LANs in a similar API •  Feedback/progress APIs for RDMA opera3ons –  Services that augment the data stream to improve performance need to keep track of performance 19 Presentation_name