Karl W. Schulz

Transcription

Karl W. Schulz
A Distributed Memory Out-of-Core Method on HPC Clusters
and Its Application to Quantum Chemistry Applications
Karl W. Schulz
Director, Scientific Applications: Texas Advanced Computing Center and
Chief Software Architect: PECOS, Institute for Computational Engineering and Sciences
The University of Texas at Austin
XSEDE 12 Conference
Chicago, IL July 18, 2012
PECOS
Predictive Engineering and Computational Sciences
Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 Acknowledgements •  Co-­‐Author: Chris Simmons, ICES/PECOS, UT AusLn –  Quantum chemistry guru –  Famed Lustre file system killer •  XSEDE Resource Providers: –  TACC (Ranger, Lonestar4, Longhorn) –  SDSC (Gordon) •  Sponsors: –  DoE: ICES/PECOS PSAAP Center (NaLonal Nuclear Security AdministraLon; Award DE-­‐FC52-­‐08NA28615) –  NaLonal Science FoundaLon (NSF) Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 Outline •  Background –  CFOUR quantum chemistry applicaLon –  Out-­‐of-­‐core usage –  ScienLfic problem/moLvaLon •  GRVY Library –  New Ocore design –  Benchmarking results •  ApplicaLon Usage and Results Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 CFOUR – Quantum Chemistry ApplicaLon •  Includes a variety of high-­‐level ab ini5o (quantum) methods for the calculaLon of atomic and molecular properLes including: –  Møller-­‐Plesset (MP) perturbaLon theory –  Coupled-­‐cluster (CC) •  Research code under acLve development by groups at UT AusLn and Universitat Mainz, Germany: –  1.4 million lines of Fortran (77) (and now some f90!) –  Primary mode of parallelism is for BLAS (typically uses a threaded BLAS) •  Depending on the molecule under invesLgaLon and level of theory chosen, CFOUR can have large on-­‐node memory requirements: –  leads to an out-­‐of-­‐core solve if enough RAM is not available locally –  records are dumped to one or more files (using Fortran direct I/O semanLcs) Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 ScienLfic Background/MoLvaLon • 
Chris’s dissertaLon was focused on developing methods to determine energy levels in systems with strongly coupled electronic states: –  focused on the NO3 radical (a notoriously hard problem in quantum mechanics) –  goal was to compute highly accurate quasidiabaLc Hamiltonians using only ab ini5o informaLon • 
In order to parameterize the vibronic Hamiltonian –  a large number of fixed-­‐nuclei adiabaLc calculaLons were performed –  group theory and (Pseudo) Jahn-­‐Teller used to transform from adiabaLc to quasidiabaLc –  vibronic Hamiltonian can be parameterized using a 3D grid of points along the 3 unique modes that govern the coupling (hint hint, may lead to lots of jobs) • 
That’s great – but what does this mean to the non quantum-­‐chemist? –  Grid refinement studies determined a minimum grid size of 9x9x9 would achieve desired accuracy –  has to to be run for both the X2Aʹ′2 state and the doubly-­‐degenerate B2Eʹ′ states ➤  Grand total: 2,187 individual CFOUR runs required (with 121 basis funcLons) Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 Where to run 2,187 CFOUR Runs? • 
The good news: –  each of these runs are independent –  hence, each run can be run separately in parallel (normally this is great for scaling) • 
The so so news: • 
Considered several resources: –  on-­‐node memory requirements are not trivial for this level of theory: 57 GB peak –  no problem though: can leverage CFOUR’s out-­‐of-­‐core capability to offload the storage of the derivaLve molecular orbital amplitude informaLon (the largest resource consumer) –  a local Chemistry research cluster, hbar –  various supercomputers at TACC Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 Chemistry Department Cluster (hbar) Computational
Details
Computational
Details
Clandestine
Clusters
– hbar.cm.utexas.edu
Clandestine
Clusters
– hbar.cm.utexas.edu
• 
Specifically designed for the requirements of quantum chemistry codes large memory large local storage (16 disks per node)! –  elegant HVAC soluLon – 
– 
• 
Normal workhorse for individual CFOUR runs • 
Problem: it’s not big enough to complete the 2,187 runs in a reasonable amount of Lme ➤  Need to leverage larger HPC resources Large memory
16per
disks
per...node
... ALWAYS
Large memory
and 16 and
disks
node
ALWAYS
BUSY BUSY
Karl W. Schulz
Out-­‐of-­‐Core (Ocore) Simmons
The
of all Hamiltonians
C. Simmons C.GRVY The Mother of
all Mother
Hamiltonians
July 18, 2012 17
/ 39
TACC HPC Resources Longhorn 256 Nodes/512 GPUs 48 GB RAM, 30 GB local storage Karl W. Schulz
Lonestar 4 ~2000 Nodes 24 GB RAM, 64 GB local storage GRVY Out-­‐of-­‐Core (Ocore) Ranger ~4000 Nodes 32GB RAM, no local storage July 18, 2012 Ranger Runs • 
Ranger chosen iniLally since it was the largest (with the highest performing parallel file system, ~30 GB/sec aggregate for $SCRATCH) – 
– 
• 
Measured run Lme for an individual job was ~48 hours (1 point of the 2,187 to be carried out) Remember, each point can be run in parallel, so the next step was to submit a bunch of jobs and relax as the system knocks out 2,187 points And now, the bad news: several jobs at a Lme ran great, but once a larger number were running simultaneously, the constant I/O encumbered to Lustre by the out-­‐of-­‐core porLon of the solve took it’s toll on Lustre (and affected interacLvity system-­‐wide) –  tens of thousands of outstanding Lustre requests observed when Chris arempted to run many points simultaneously; an unsustainable soluLon for long periods of Lme –  we had to ask Chris to throrle his jobs to analyze 4 points at a Lme – 
JOB WR-MB RD-MB REQS OWNER WORKDIR"
"1437436 16390 0 22335 csim ...ANO1/grid/729/xap"
"1437435 14525 0 20471 csim ...ANO1/grid/729/xap"
• 
Unfortunately, this limiter would effecLvely put the kibosh on Chris’s graduaLon – 
– 
4 points in 2 days -­‐> ~3 years to soluLon with no queue wait Lmes this experience moLvated the idea of creaLng a new out-­‐of-­‐core uLlity for use by CFOUR….. Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 Ocore MoLvaLon and Desires • 
Based on this iniLal Ranger experience, the thought occurred to support the out-­‐of-­‐core solve via a distributed memory ramdisk – 
– 
– 
• 
we could throw a few more resources (compute nodes) at the problem and remove constant I/O stress leverage the high-­‐speed interconnect (which will support xfers faster than any local disk) but, it needs to be easy to integrate into CFOUR which reads/writes fixed-­‐size records ResulLng high-­‐level goals for the Ocore design: provide a flexible API for offloading fixed-­‐length records to remote storage pools leverage exisLng HPC data transfer mechanisms (MPI) for distributed memory processing remove excessive I/O burden to file systems(e.g. Lustre at TACC) from out-­‐of-­‐core methods improve applicaLon runLme performance by replacing disk-­‐based I/O with data transfers across high-­‐
speed interconnects, e.g. InfiniBand; –  enable quantum chemistry analysis which has tradiLonally run on small clusters of workstaLons (configured with large amounts of local scratch disk to accommodate out-­‐of-­‐core solves) to efficiently leverage larger HPC clusters for parametric studies – 
– 
– 
– 
• 
This MPI-­‐based, distributed Ocore approach was added to an exisLng uLlity library under development at PECOS (libGRVY) Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 lib GRVY
• 
What is it?: A simple toolkit library used to house various support funcLons owen required for scienLfic applicaLon development: wriren in C++; APIs provided for C and F90 a flexible method for parsing ascii input files (with backwards-­‐compaLbility support) a hierarchical Lming mechanism to provide basic performance staLsLcs an HDF5 based historical performance mechanism for logging applicaLon performance over Lme on various compute resources –  a simple priority-­‐based logging mechanism to control applicaLon messages –  miscellaneous file handling and math uLliLes –  NEW: a suite of distributed-­‐memory u5li5es for offloading out-­‐of-­‐core read/write opera5ons to a pool of distributed shared-­‐memory nodes using MPI for data transfer – 
– 
– 
– 
•  data transfer accomplished via standard MPI seman5cs •  data structures and record bookkeeping done using C++ STL –  vectors, maps –  queues, stacks –  master/slave polling method between the applica5on thread and each remote memory pool task thread Open Source: Download @ hRps://red.ices.utexas.edu/projects/soUware/wiki/GRVY Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 The GRVY Ocore Basic Design MPI Based Out-of-Core Method for Distributed Memory Architectures
Server
GRVY Ocore
Application
Work Loop
Early Design Addi4on: •  add monitoring of record read frequency •  allow less frequently accessed records to be dumped from Ocore memory pool to tradiLonal disk Karl W. Schulz
Distribute
via Round
Robin
Records
Transfered
via MPI
Memory Pool
DiskBased I/O
Memory Pool
DiskBased I/O
Memory Pool
DiskBased I/O
Additional Cores or
Servers Provide
Remote Memory Pools
GRVY Out-­‐of-­‐Core (Ocore) Disk-Based
Storage Used
for Overflow
July 18, 2012 GRVY Ocore Basic API C++ #include <grvy.h>"
using namespace GRVY;"
int "
"GRVY_MPI_Ocore_Class::Initialize (std::string inputfile)"
void "GRVY_MPI_Ocore_Class::Finalize
()
"
""
int "
"GRVY_MPI_Ocore_Class::Write "
(size_t record_id, T *data)"
int "
"GRVY_MPI_Ocore_Class::Read
"
(size_t record_id, T *data)
""
"
C #include <grvy.h>"
int "
"grvy_ocore_init
"
" (const char *input_file)
""
void "grvy_ocore_finalize
() ""
int "
"grvy_ocore_write_double
(size_t record_id, double *data)"
int "
"grvy_ocore_read_double
(size_t record_id, double *data) ""
"
"
use grvy"
subroutine
subroutine
function
function
"
F90 grvy_ocore_init
"
" (input_file,ierr)"
grvy_ocore_finalize
" () ""
grvy_ocore_write_real8
(record_id,data)"
grvy_ocore_read_real8
(record_id,data)"
Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) Intrinsic data type support includes: •  doubles •  reals •  4-­‐byte ints •  8-­‐byte ints AddiLonal semanLcs exist to retrieve records (for check-­‐poinLng purposes) July 18, 2012 GRVY RunLme Seyngs # input file for mpi_ocore "
"
[grvy/mpi_ocore]"
"
enable_ocore
" = 1
max_pool_size_in_mbs = 8192
max_map_size_in_mbs = 10
blocksize
= 8192
"
use_disk_overflow
= 1
watermark_ratio
= 0.2
allow_empty_records = 0
"
"
"
"
"
"
dump_raw_statistics = true
"
Karl W. Schulz
#
#
#
#
use MPI out-of-core (1=yes,0=no)"
max storage pool size on each child processor [MBs]"
max map size for sparse record access [MBs]"
number of array elements in each ocore read/write"
#
#
#
#
enabled disk-based overflow "
% of records to dump to disk when memory cache is full"
allow empty records to be returned if not written previously"
# dump raw read/write statistics for each record?"
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 Ocore Micro-­‐Benchmarking Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 Level Set ExpectaLons – what is reasonable? •  Before we begin benchmarking, we should characterize point-­‐to-­‐point MPI performance to put an upper bound on expected Ocore performance •  Remember, BW is a funcLon of message size MPI Bandwidth (MB/sec)
3500
CFOUR record size is 8192 (doubles) 3000
2500
2000
1500
1000
Measured With MVAPICH2 •  QDR IB •  Peak ~3 GB/sec TACC Longhorn 500
0
1
32
1K
32K
1M
32M
Message Size (Bytes)
Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 Micro-­‐benchmark Test #1a • 
First test mimics the characterisLcs of an external applicaLon reading and wriLng fixed records in random order as part of an out-­‐of-­‐core solve procedure A random integer sequence was used to define record indices that were sequenLally wriren to one or more Ocore threads (63.2% of the records were re-­‐
wriren) – 
– 
– 
– 
– 
– 
considered 1, 2, 4, and 8 remote nodes Requires a minimum of 2 MPI Tasks (1 applicaLon and 1 Ocore storage pool) Three record blocksizes considered Tests run on Longhorn in normal producLon operaLon Look at strong scaling where 8GB was offloaded from applicaLon task (1 MPI task per storage pool node) efficiencies between 83% to 99% observed (compared to raw MPI) 4K"Blocksize"
8K"Blocksize"
16K"Blocksize"
3500"
Ocore$Write$Speed$(MB/sec)$
• 
3000"
2500"
2000"
1500"
1000"
500"
0"
Max$MPI$
Transfer$Rate$
1$Node$
2$Nodes$
4$Nodes$
8$Nodes$
Special environment sauce:
MV2_IBA_EAGER_THRESHOLD=4194304
Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 Micro-­‐benchmark Test #1b • 
Previous test used 1 MPI task per node (ie. one storage pool per node) But, GRVY Ocore supports an arbitrary number of offload tasks; let’s repeat the test using mulLple storage tasks per node: Run on Longhorn (8-­‐way nodes, Nehalem) –  32 GB offloaded –  Blocksize = 8192 –  CPU and memory affinity seyngs used – 
• 
Results: 8 Ocore tasks per node outperformed 1 Ocore task per node by 21% –  recommend using 1 Ocore task per core – 
Karl W. Schulz
3000"
Ocore"Write"Speed"(MB/sec)"
• 
2500"
2000"
1500"
1000"
500"
0"
GRVY Out-­‐of-­‐Core (Ocore) 1"
2"
4"
#"of"Ocore"MPI"Tasks"Per"Node"
8"
July 18, 2012 • 
1st micro-­‐benchmark tests look reasonable (achieves most of the raw MPI performance) • 
But, how does this approach compare to tradiLonal file-­‐system caching on node? • 
Test #2 considers a larger 232 GB randomly-­‐
ordered data set which is dumped using: 1. 
2. 
• 
Fortran direct I/O semanLcs (a la CFOUR) wriLng to memory RAMDISK GRVY Ocore using MPI Test environment: Ran on one of Lonestar’s 1 TB big memory nodes –  8K blocksize –  MVAPICH2 with LiMIC (for faster SMP xfers) –  Measured wall-­‐clock Lme to complete wriLng and subsequent re-­‐read of 232 GB worth of 8K records – 
Karl W. Schulz
Wallclock Time (secs)
Micro-­‐benchmark Test #2a – compare to tradiLonal RAMDISK (on-­‐node) 350
300
250
200
150
100
50
0
GRVY Out-­‐of-­‐Core (Ocore) Filesystem/
RAMDISK
Ocore/MPI
July 18, 2012 Micro-­‐benchmark Test #2b – compare to commercial, virtualized RAMDISK • 
How does this approach compare to node aggregaLon techniques provided via commercial sowware, e.g. vSMP? aggregates mulLple compute nodes to have a single system image (with large memory) –  provides virtualized RAMDISK across memory of all parLcipaLng nodes –  all data transfer done behind the scenes, no user data-­‐management is required Gordon – 4 Compute Nodes • 
Test #2b repeats the previous test (232 GB) on SDSC’s Gordon system: aggregates 16 nodes (each with 64 GB RAM) to provide a single system image with 1TB (vSMP) –  we request ¼ of this super node (4 compute nodes) to have sufficient memory to write/
read 232 GB –  again, compare results using Fortran direct I/
O semanLcs (to vSMP ramdisk) against Ocore using MPI – 
Karl W. Schulz
Wallclock Time (secs)
– 
3500
3000
2500
2000
1500
1000
500
0
vSMP RAMDISK
Ocore/MPI
Special environment for MPI:
MV2_IBA_EAGER_THRESHOLD=4194304
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 Micro-­‐benchmark Test #2c – examine weak scaling of GRVY Ocore Karl W. Schulz
40%
20%
0%
N
od
e/
20
G
B
es
/4
4
0
N
G
od
B
e
s/
8
80
N
od
G
e
B
s
16
/1
60
N
od
G
e
B
s
32
/3
20
N
od
G
es
62
B
/
6
N
40
od
G
es
B
/1
24
0
G
B
➤  Efficiencies > 90% for this benchmark 60%
2
QDR InfiniBand 12-­‐way Westmere nodes 2 GB/core – 24 GB total we wrote/read 20GB per node for the tests and look at weak scaling efficiency 80%
od
– 
– 
– 
– 
100%
N
To examine, we performed a quick strong scaling test on Lonestar 120%
1
• 
Q: is there any substanLal overhead incurred aggregaLng large numbers of nodes to increase the effecLve Ocore storage pools? Scaled Efficiency
• 
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 ApplicaLon to CFOUR •  Using new Ocore method in CFOUR, analysis was migrated to Longhorn: –  2 nodes used per run (3 MPI tasks total) •  1 MPI task/4 threads allocated for main CFOUR applicaLon •  1 MPI Ocore task (32 GB pool) on same node •  1 MPI Ocore task (32GB pool) on remote node –  arbitrary number of jobs could be run simultaneously (only limit was queuing throughput) –  total calculaLon wall Lme reduced from years to just under 30 days –  on average, each CFOUR jobs wrote 385GB of data as part of the out-­‐
of-­‐core solve ➤  833 TB of temporary data was *not*wriren to the file system Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 ApplicaLon to CFOUR – Record Access • 
We realized that it would be useful to see what the out-­‐of-­‐
core access parern looked like for CFOUR (one of the iniLal enhancements) Plot shows how frequently each record was read (note that this is shown on a logscale) 100000
# of Record Reads
• 
10000
1000
100
10
1
0
100000 200000 300000 400000 500000 600000 700000
Ocore Record Index
• 
Very non-­‐uniform record access Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 • 
Seeing this non-­‐uniformity lead to the addiLon of another level of offload • 
When storage pool is full, offload less-­‐frequently read records to disk Arempts to minimize impact of slow I/O to disk •  Tested this approach with CFOUR on Longhorn (~11 GB oversubscribed) ➤  1 Node Job with on-­‐node Ocore pool and disk-­‐based overflow was only 7% slower than 2 node job with two Ocore pools and no disk-­‐based transacLons [empty] ApplicaLon to CFOUR – Record Access Pool is full Dump desired percentage of records to disk Karl W. Schulz
Watermark Storage Pool • 
Disk Time GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 Results: Compare Energy Levels From Hamiltonian Level Posi4on (cm-­‐1) Calculated Observed Diff Author 369 365 +4 Kawaguchi 777 758 +19 Neumark 1069 1057 +12 Neumark 1152 1173 -­‐21 Jacox 1424 1413 +11 Kawaguchi 1494 1492 +2 Hirota 1769 1774 -­‐5 Jacox 1845 1831 +14 Neumark 1931 1927 +4 Hirota 1579 1642 Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) 1.  Level of agreement with exp. observaLons is quite remarkable 2.  New bands idenLfied, spawning new experimental invesLgaLon July 18, 2012 Summary • 
A new MPI-­‐based out-­‐of-­‐core uLlity has been added to the GRVY library: –  distributes fixed size records to an arbitrary number of MPI tasks, using local memory to support virtual ramdisk pools –  can also leverage spinning storage to offload the least-­‐frequently accessed records when a local memory pool surpasses a given threshold –  MPI based Ocore performance is similar to tradiLonal ramdisk on-­‐node –  can eliminate excessive file-­‐system I/O at the expense of using more compute resources –  Micro-­‐benchmarking tests showed reasonably low overhead and good scaling properLes • 
CFOUR was updated to use new Ocore with minimal source code changes: –  Turned a big-­‐data (I/O) problem into a computer science/networking problem –  Eliminated the need to write ~1 PB to disk –  SystemaLc analysis completed to parameterize the vibronic Hamiltonian for NO3 radical with high accuracy; important because: •  Atmospheric NO3 produced chiefly by the reacLon of NO2 and ozone •  Thought to be the dominant oxidant in the atmosphere at night •  Understanding its photochemistry is important for understanding global warming and our inevitable demise Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012 Thanks for your Lme! QuesLons? [email protected] Karl W. Schulz
GRVY Out-­‐of-­‐Core (Ocore) July 18, 2012