Data Staging and Asynchronous I/O in ADIOS
Transcription
Data Staging and Asynchronous I/O in ADIOS
Data Staging and Asynchronous I/O in ADIOS Hasan Abbasi ORNL Jong Choi ORNL Greg Eisenhauer Georgia Tech Scott Klasky ORNL Manish Parashar Rutgers Norbert Podhorszki ORNL Nagiza Samatova NCSU Karsten Schwan Georgia Tech Matthew Wolf Georgia Tech ORNL is managed by UT-Battelle for the US Department of Energy Outline • ADIOS Overview • Introduc3on to Staging • Data Management in I/O Pipelines • Staging in ADIOS • Network and System service discussion 2 Presentation_name hNp://www.nccs.gov/user-‐support/center-‐projects/adios/ ADIOS • Abstracts Data-‐at-‐Rest to Data-‐in-‐Mo3on for HPC – Provides portable, fast, scalable, easy-‐to-‐use, metadata rich output – Dynamically allows users to change the method during an experiment/simula3on • Provides solu3ons for “90% of the applica3ons” • ADIOS has been cited almost 1,000 3mes Interface)to)apps)for)descrip/on)of)data)(ADIOS,)etc.)) Data)Management)Services) Feedback) Buffering) )Schedule) Mul/Bresolu/on) methods) Data)Compression) methods) Data)Indexing) (FastBit))methods) Plugins)to)the)hybrid)staging)area) Provenance) Workflow))Engine) Run/me)engine) Analysis)Plugins) AdiosBbp) IDX) HDF5) Visualiza/on)Plugins) pnetcdf) “raw”)data) Image)data) Parallel)and)Distributed)File)System) 3 Presentation_name Data)movement) Viz.)Client) • • • • • • • • • • Astrophysics Climate Combustion CFD Environmental Science Fusion Geoscience Materials Science Medical: Pathology Neutron Science • • • • • • Nuclear Science Quantum Turbulence Relativity Seismology Sub-surface modeling Weather Improving I/O Methods for High End simula3ons • Reduce I/O overhead, reduce network data movement, improve wri3ng and reading performance • To achieve this goal, ADIOS provides many methods – Posix (1 file per process, independent set of files) – Posix (1 file per process + metadata; read as one dataset) – MPI-‐Lustre (MPI-‐IO wri3ng to 1 global file) – Aggregate (1 file per OST) + 1 metadata file – BG (1 file per rack) + 1 metadata file – …. • There’s no single right answer for all users. – ADIOS gives the user flexibility without rewri3ng code. 4 Presentation_name Large Writes Per Many cores • First effort shows performance goes from 50 GB/s to over 100 GB/s • New features for IBM BG/Q to eliminate the serial process in ADIOS for the metadata crea3on is now op3onal – Metadata crea3on is serial due to the problem of threadsafe MPI on most systems • Tes3ng has begun to use staging to write data – Problem is size of the staging area • Requires over 10K cores for staging… • GPU on staging is useless if we do NOT do other processing 5 Presentation_name Small Writes per many cores (Combustion) • Requires high performance I/O due to large output (200 GB/10 minutes) • Frequent reading of large datasets on a small number of processors for analy3cs • Individual process output is small, leading to low u3liza3on of network bandwidth with other I/O solu3ons • Reading of large datasets with a different access paNern than they were wriNen out leads to • frequent seeking for data • very low read bandwidth • Analysis codes spend 90% of their 3me reading data • Allowed ADIOS team to focus on small but frequent output data 6 Presentation_name Spa3al Temporal Aggrega3on • Temporal aggrega3on is to open up another horizon to further consolidate data • Data of mul3ple 3me steps are merged at each process • Data is wriNen out only at the last 3me step or reaches the boundary of memory capacity • Achieved up to 70x speedup for read performance, and 11x speedup for write performance in mission cri3cal climate simula3on GEOS-‐5 (NASA), on Jaguar GEOS-5 Results Original( t2 t1 • Common read patterns for GEOS-5 users are reduced from 10 – 0.1 seconds New( t3 t • Allows interactive data exploration for mission critical visualizations A 2-D variable at 3 time steps are merged into a 3-D variable with time as new dimension V2 … =60) Data layout V3 0) V1 30) T1,2,3 New( 10/29/79) 7 Presentation_name Temperature)tendency)from)moist)physics) 60) V1 V2 V3 … V1 V2 V3 … V1 V2 V3 … 90) T3 La4tude) Original( T2 =30) T1 Read performance of a 2D slice of a 3D variable + time 10/30/79) Date) 10/31/79) I/O Variability Single Storage Target Performance Variations 300 • Problem Throughput (MB/sec) 250 • Techniques that achieved high performance I/O • Aggregation with write-behind strategy • Stripe alignment: to avoid contention 0 20 40 60 Runs Hopper 1000 15 Static I/O Re-routing (TF=0.1) 800 I/O Time (sec) I/O Time (sec) 100 0 Titan Static I/O Re-routing (TF=0.1) 150 50 • Are these techniques sufficient to get the peak I/O performance? 16 200 14 13 12 600 400 200 11 0 0 50 100 150 200 250 Run (21:20PM to 23:30PM, 2/27/2013, no noise injected) 8 Presentation_name 0 20 40 60 80 100 Run (21:20PM to 1:20AM, 3/1/2013, no noise injected) 80 100 Introduc3on to Staging • Initial development as a research effort to minimize I/O overhead • Draws from past work on threaded I/O • Exploits network hardware support for fast data transfer to remote memory Hasan Abbasi, Matthew Wolf, Greg Eisenhauer, Scott Klasky, Karsten Schwan, Fang Zheng: DataStager: scalable data staging services for petascale applications. Cluster Computing 13(3): 277-290 (2010) Ciprian Docan, Manish Parashar, Scott Klasky: DataSpaces: an interaction and coordination framework for coupled simulation workflows. Cluster Computing 15(2): 163-181 (2012) 9 Presentation_name Data Management in I/O Pipelines • Perform computation in the right location • Aggrega3on and chunking to improve data access • Support dynamic placement • End-‐to-‐End approach to data management • Use data reduction techniques Data chunks … Application Intra-chunk level < Hierarchical Spatial Aggregation Chunk level Storage 10 Presentation_name Optimized Chunking Model Optimized Chunk Size? Yes Space Filling Curve Reordering Offsetorg > Dynamic Subchunking Offsetnew Indexing and Compression • Extreme scale data enhancement and reduc3on • U3lize in transit and in situ mechanisms • Scien3fic compression schemes (ISABELA and ISOBAR) • In situ indexing to enable fast query and data access • Deployed as services in the pipeline 11 Presentation_name Predata: I/O Pipelines • Use the staging nodes and create a workflow in the staging nodes. • Allows us to explore many research aspects. • Improve total simula3on 3me by 2.7% • Allow the ability to generate online insights into the 260GB data being output from 16,384 compute cores in 40 seconds. BP file sorted array BP BPwriter writer Sort Sort Bitmap Bitmap Indexing Indexing Particle array 12 Presentation_name Histogram Histogram Plotter Plotter 2D 2DHistogram Histogram Plotter Plotter Index file In-‐Memory Data Staging with DataSpaces Staging-‐based (ADIOS DATASPACES transport method) • Extract data from running simula3ons into the memory of staging servers • Enables more loosely coupled data interac3ons • Reduced resource conten3on, e.g., on-‐node memory 13 Presentation_name In-‐Memory Data Staging with DIMES DIMES (ADIOS DIMES transport method) • Extract data from running simula3ons directly into another applica3on’s memory space • Enable more 3ghtly coupled data interac3ons • Reduced network data movement (as compared to staging) 14 Presentation_name Applica3on Coupling Loose coupling of XGC0 and M3D-‐OMP and ELITE through ADIOS using DataSpaces Tight coupling of XGC1 and XGCa in combination with a monitoring dashboard BP File XGC1 ADIOS ADIOS Embedded Schema to add semantics of data Plotter/Analysis XGCa Staging Services DataSpaces DIMES Plotter/Analysis Chunkbased Operation Monitoring System 15 Presentation_name Loosely Coupled in-‐situ Visualiza3on • • • • Visualiza3on and applica3on are de-‐coupled Uses ADIOS on Dataspaces Viz opera3ons and rendering performed on system nodes XGC SciDAC simula3on example HPC Application ADIOS Vis/Analysis (EAVL) ADIOS Staging (Data Spaces) 16 Presentation_name ICEE Method Enables WAN Staging Memory-‐to-‐memory data delivery (code coupling) Transparent workflow execution Analysis Data Stream FastBi t Index Data Trans form Compr ession WAN Transportation • FlexPath/EVPath • DataSpaces • ICEE ADIOS Data Generation ADIOS Data Hub Analysis Analysis • Streaming data from experimental sources • Transparent abstrac3on for moving data over LAN or WAN • Integra3on of indexing/querying to minimize long distance data movement • Mul3ple methods for moving data (ICEE, DataSpaces, Flexpath) 17 Presentation_name Network Services • Managing Conten3on – We need to avoid resource conten3on to minimize I/O overhead on applica3on performance • Priority based flow control – Par33oning of streams into data and control will improve scalability • Mul3cast for RDMA – A single source can be feeding data to mul3ple consumers for data staging opera3ons • Select/Callback support – Simple mechanism to check availability of data, par3cularly useful in combina3on with 1-‐sided communica3on 18 Presentation_name System Services • Fault Tolerance – I/O services can add their own resiliency but need appropriate no3fica3on from the network layer to ini3ate recovery • ScaNer-‐Gather – More consistent support for scaNer-‐gather (iovecs) across plaqorms • RDMA to NVRAM – Par3cularly important for Summit • Rendezvous and discovery – Iden3fy I/O services, ini3ate and manage connec3ons – Support for WANs and LANs in a similar API • Feedback/progress APIs for RDMA opera3ons – Services that augment the data stream to improve performance need to keep track of performance 19 Presentation_name