Engineering Breakthroughs at NCSA
Transcription
Engineering Breakthroughs at NCSA
Engineering Breakthroughs at NCSA 4th International Industrial Supercomputing Workshop Amsterdam 23-24 October 2013 Seid Koric Senior Technical Lead - Private Sector Program at NCSA Adjunct Professor, Mechanical Science and Engineering Dept. University of Illinois http://www.ncsa.illinois.edu [email protected] National Center for Supercomputing Applications Who uses HPC ? Answer: Anyone whose problem can not fit on a PC or workstation, and/or would take a very long time to run on a PC Molecular Science and Materials Engineering Geoscience Weather & Climate Health/Life Science Astronomy Finance Modeling National Center for Supercomputing Applications What makes HPC so “High Performance” ? Answer: Parallelism, doing many things (computing) at the same time, or a set of independent processors work cooperatively to solve a single problem Source CCT/LSU National Center for Supercomputing Applications Scalable Speedup (Supercomputing 101) • Speed-up (Sp) = wallclock for 1 core / wallclock # of Cores • Speed-up reveals the benefit of solving problems in parallel • Every problem has a “sweet spot”; depends on parallel implementation and problem size • Real Sp is smaller then theoretical Sp due to: serial portions of the code, load imbalance between CPUs, network latency and bandwith, specifics in parallel implementation in the code, I/O, etc. Sweet Spot National Center for Supercomputing Applications Think Big ! “It is amazing what one can do these days on a dual-core laptop computer. Nevertheless, the appetite for more speed and memory, if anything is increasing. There always seems to be some calculations that ones wants to do that exceeds available resources. It makes one think that computers have and will always come in one size and one speed: “Too small and too slow”. This will be the case despite supercomputers becoming the size of football fields !” Tom Hughes, 2001, President of International Association for Computing Mechanics-IACM National Center for Supercomputing Applications The Industrial Software Challenge Performance Single core About 10-50% of theoretical peak Scalability Promising, but still heavily behind scientific peta-scale applications Granularity, Communication (slow) vs. Computation (fast) Load Balancing, Mapping tasks to cores to promote equal amount of work Serial Code Portions (Remember good old Amdahl’s law) I/O Strategies Parallel I/O vs. 1 file per core Licensing Model for HPC Commercial software vendors (aka: ISV’s) New Programing Models and Accelerators Hybrid (MPI/OpenMP, MPI/OpenACC) GPGPU (CUDA, OpenCL, OpenACC) Xeon-Phi (OpenCL) National Center for Supercomputing Applications Blue Waters - Sustained Petascale System Cray System & Storage cabinets: Compute nodes: Storage Bandwidth: System Memory: Memory per core : Gemini Interconnect Topology: • 300 • 25,000 • 1.2 TB/s • 1.5 Petabytes • 4 GB • 3D Torus Usable Storage: • 400 Petabytes Peak performance: • ~13 Petaflops Number of AMD processors: • 50,000 Number of AMD x86 core modules: • 400,000 Number of NVIDIA GPUs: • 4,200 National Center for Supercomputing Applications iForge - NCSA’s Premiere HPC Resource for Industry x86 Cores Platform 1 2048 Platform 2 576 CPU Type “Sandy Bridge” “Abu Dhabi” Clock 3.2 Ghz 3.4 GHz Cores/Node 16 32 Memory/Node 128 GB, 1600 MHz 256 GB, 1600MHz Global RAMdisk 1.5 Terabytes Total Memory 21 Terabytes Storage 700 Terabytes File system GPFS Interconnect 40 Gigabit QDR InfiniBand MPI Platform, Intel, MVAPICH2, OpenMP Operating System Red Hat Enterprise Linux 6.4 National Center for Supercomputing Applications Massively Parallel Linear Solvers in Implicit FEA • Implicit FEA code spends 70-80% of time solving large systems of linear equations, Ax=b , where A is sparse i.e., most coefficients are zero • A wide range of applications: finite element solid mechanics, computational fluid dynamics, reservoir simulation, circuit design, linear programming etc. National Center for Supercomputing Applications FE Model with Global Stiffness Matrix National Center for Supercomputing Applications Problem Specification (Matrices) • Originate from either in-house industrial and academic codes, or from a commercial FE code solving real world engineering problems, mostly unstructured automatic meshes • Mostly SPD with N=1-80 M, NNZ=120-1600M • Condition Numbers 103-108 National Center for Supercomputing Applications Problem Specification (solvers) •WSMP: direct solver from IBM/Watson, based on multifrontal algorithm, hybrid (MPI & p-threads), symmetric and nonsymmetric •Super LU: direct solver developed by LBNL, LU decomposition, MPI, nonsymmetric •MUMPS: direct solver funded by CEC ESPIRT IV, multifrontal algorithm, MPI, symmetric and nonsymmetric •Hypre: iterative solver, LLNL, Conjugate Gradient with AMG, IC, and SAI (Sparse Approx Inverse) pre-conditioners, MPI, symmetric •PETSc: iterative solver, ANL, Conjugate Gradients (CG), Bi-Conjugate Stabilized (BCGS), Conjugate Residual Gradient (CR) with Bjacobi, ACM (Additive Schwarz) , and AMG (Multi-Grid) pre-conditioners , MPI, symmetric and nonsymmetric •Commercial FEA Codes (NDA) National Center for Supercomputing Applications Solver Work in Progress (iForge) 250 Matrix 1M, SPD, N=1.5M, NNZ=63.6M, COND=6.9E4 Lower = Better Solution Time [sec] 200 150 16 cores 32 cores 64 cores 100 128 cores 256 cores 50 0 CG/Bjacobi, BCGS/Bjacobi, BCGS/ASM, PETSc, PETSc, PETSc, Rconv=1.E-5 Rconv=1.E-5 Rconv=1.E-5 CR/Bjacobi, PCG/ParaSails, MUMPS SPD, PETSc, Hypre, Direct Rconv=1.E-5 Rconv=1.E-5 WSMP SPD, SuperLU, Direct Unsymmetric, Direct National Center for Supercomputing Applications 10x Larger Problem Matrix 20M, SPD, N=20.05M, NNZ=827.49M, COND=~1.E7 Lower = Better 12000 10000 8000 Solution Time [sec] 16 cores 32 cores 6000 64 cores 128 cores 256 cores 4000 512 cores 2000 0 CR/Bjacobi, PETSc, PCG/Parasails, Hypre, Rconv=1.0E-5 WSMP, SPD, Direct MUMPS, SPD, Direct National Center for Supercomputing Applications The Largest Linear System solved with Direct Solver (N>40M, NNZ>1600M, Cond=4.E7) 40 M (higher=better) 35000 200 30000 180 160 25000 PETSc 20000 140 Hypre WSMP 15000 10000 Speedup Solution Time [sec] 40 M (lower=better) 120 PETSc 100 Hypre 80 WSMP 60 5000 40 0 16 32 64 128 256 512 20 Cores 0 16 32 64 128 256 Cores National Center for Supercomputing Applications 512 WSMP Performance on iForge Sparse Factroization Performance TFlop/Sec Watson Sparse Matrix Package Hybrid (MPI/Pthreads) Symmetric Solver N=2.8M, NNZ=107M, Higher = Better 6 5 X5690/Westmere 4 3 XE5-2670/Sandy Bridge 2 1 0 128 256 512 768 960 Number of Threads National Center for Supercomputing Applications ISV Implicit FEA Benchmark on iForge ABAQUS model: Number of elements: Number of nodes: Number of DOFs ABAQUS analysis job: Cluster: Number of cores used: Solver: Direct Sparse 2,274,403 12,190,073 >30M iForge 24-196 7hours->1hour Wall Clock Time (sec) Wall Clock Time vs. Number of Cores 30000 25000 20000 15000 10000 5000 0 0 50 100 150 # of cores 200 250 Explicit FEA: LS-Dyna on Blue Waters NCSA/PSP, Hardware Vendor (CRAY), ISV (LSTC), PSP partner (NDA)-all working together ! Real geometry, Loads, BC-s, highly nonlinear transient dynamic problem with difficult contact conditions MPP Dyna solver fully ported and optimized to CRAY’s Linux Environment and taking full advantage of Gemini interconnect National Center for Supercomputing Applications LS-Dyna Breakthrough on Blue Waters 26.5M nodes, 80M DOFs, Time in Hours, Lower = Better 16 14 iForge (MPI) Wall Clock (hours) 12 10 Blue Waters (MPI) 8 6 Blue Waters (Hybrid) 4 2 0 512 1024 1536 2048 3072 4096 8192 CPU Cores National Center for Supercomputing Applications Reaching Scalability on 10,000 cores 20 LS-Dyna Parallel Scalability (Lower=Better) 18 Molding-10m-8x (>70M nodes, >41M elements) Wall Clock Time [hours] 16 14 Blue Waters, Cray XE6 iForge, Intel SB 12 10 Highest known scaling of any ISV FEA code to date !! 8 6 4 2 0 512 1024 2048 4096 8192 10240 Cores National Center for Supercomputing Applications Typical MPP-Dyna Profiling As scaling increases, performance becomes more determined by communication! 512 cores 64 cores Computing Communication National Center for Supercomputing Applications LS-DYNA Work in Progress • Benchmarking even larger real-world problems • Memory management becoming a serious issue for DP (decomposition, distribution, MPMD, etc.) • Hybrid (MPI/OpenMP) solver uses less memory and less communication • Load balance in contact and rigid body algorithms National Center for Supercomputing Applications Star-CCM+ Breakthrough on Blue Waters Source: NCSA Private Sector Partner ”B" (Confidential) Code/Version: Star-CCM+ 7.6.9 Physics: Transient, turbulent, single-phase compressible flow Mesh size: 21.4 million unstructured polyhedral cells Complexity: Very complicated geometry, high resolution mesh Complex real-life production case: A highly complex CFD case both in terms of the mesh and physics involved. CD-adapco Star-CCM+ Case from “Partner B” Iteration/Simulation hour, Higher = Better Iterations / Simulation Hr 1000 Scaling with Infiniband levels off at 256 cores 800 iForge 600 Highest known scaling of StarCCM+ to date… 400 200 …and we broke the code! 0 0 128 256 384 512 640 768 896 102411521280140815361664179219202048 CPU Cores Blue Waters The Future of HPC A View from 11/2010 National Center for Supercomputing Applications Future of HPC, GPGPU Computing ? National Center for Supercomputing Applications OpenACC: Lowering Barriers to GPU Programming National Center for Supercomputing Applications Minimize Data Movement ! The name of the game in GPGPU PCI Bus National Center for Supercomputing Applications OpenACC Example: Solving Laplace (Heat) 2D Equation with FED Iteratively converges to correct value (temperature) by computing new values at each point from the average of neighboring points T T 2 0 2 x y 2 Ti ,kj 1 Ti k1, j k 1 i, j T k i 1, j T Ti ,kj1 2 Ti k1, j Ti k1, j Ti ,kj 1 Ti ,kj 1 4 Ti ,kj 1 National Center for Supercomputing Applications Laplace 2D: Single Node Performance OpenACC v. OpenMP Wall Clock [sec] Laplace 2D (4096x4096), Lower = Better 100 90 80 14x Speedup ! 70 60 50 40 Blue Waters XK7 (Interlagos/Kepler) 30 KIDS (Westmere/Fermi) 20 10 0 CPU Only (1 OMP) CPU Only (6 OMP) GPU(OpenACC) National Center for Supercomputing Applications Multinode Performance Hybrid Laplace 2D solvers Distributed 2D Laplace Hybrid Solvers, 8196x8196 grid size Higher = Better MPI Ranks Speedup wrt serial CPU run 250 200 150 100 XE6 (MPI+OpenMP) XK7 (MPI+OpenACC) 50 0 1 2 4 8 16 Number of Nodes National Center for Supercomputing Applications OMP or OACC Threads OpenACC - Today and Tomorrow OpenACC compilers are still in development (had to use a[i*ncol+j] instead of a[i][j] etc.) GPU (CUDA)-Aware MPI, sending buffer (pointers) to MPI from Device (GPU) directly, instead of staging GPU buffers to Host (CPU) GPU/CPU Load Balancing, distribute domain unequally and let GPU work on the largest chunk, while CPU threads work on smaller chunks to keep other CPU cores on a node busy OpenACC programming for multiple GPUs attached to a CPU node OpenACC merging with OpenMP Standard with Xeon-Phi Support ? National Center for Supercomputing Applications Multinode GPU Acceleration Abaqus/Standard 6.11, Cluster Compatibility Mode S4B Benchmark (5.23M Dofs), Higher=Better Parallel Speedup wrt Serial CPU run 30 25 20 15 Cray XE6 (CPU only) Cray XK7(CPU+GPU) 10 5 0 0.5 1 2 4 6 Nodes National Center for Supercomputing Applications NDEMC Public-Private Partnership •US OEMs have gained a competitive edge through the use of high performance computing (HPC) with modeling simulation and analysis (MS&A). •US Council of competitiveness recognized that small and medium sized enterprises (SMEs) are not able to take advantage of HPC •In Fall of 2011 a pilot program was started in the Midwestern supply base. National Center for Supercomputing Applications NDEMC: Multiphysics Simulation of CAC Objective: Study fatigue life of a charge air (CAC) cooler due to thermal stresses for NDEMC project. Description: Three-Step Sequentially Coupled Simulation (1) CFD Analysis of turbulent fluid flow through CAC coupled with advective HT provide thermal BC-s for FEA. (2) FEA analysis of the thermo-mechanical provides transient thermal stresses in solid part during the thermal cycle for Fatigue Analysis. (3) Fatigue Model uses history of thermal stresses estimates the cycle life at critical 15M points nodes XSEDE ECSS Project 3D Study of Elastic-Plastic Transition and Fractal Patterns of 1 million Grain Cube of grade 316-Steel (2010-2012) (M. Ostoja-Starzewski, Jun Li, S. Koric, A. Saharan, Philosophical Magazine, 2012 ) Largest Nonhomogenous FEA simulations to date Every of 1 Million Elements (Grains) has a different material property Fractal dimension can be used to estimate level of plasticity for damage assessment for various structures Now aiming at (much) larger simulations on Blue Waters with ParaFEM! National Center for Supercomputing Applications Continuous Casting Consortium at UIUC, Steel Dynamics Inc., NCSA • Molten steel freezes against water-cooled walls of a copper mold to form a solid shell. • Initial solidification occurs at the meniscus and is responsible for the surface quality of the final product. • Thermal strains arise due to volume changes caused by temp changes and phase transformations. • Inelastic Strains develop due to both strain-rate independent plasticity and time dependant creep. • Superheat flux from turbulent fluid flow mix in liquid pool • Ferrostatic pressure pushes against the shell, causing it bulge outwards. • Mold distortion and mold taper (slant of mold walls to compensate for shell shrinkage) affects mold shape and interfacial gap size. Objective: multiphysics approach of simulating all 3 phenomena (fluid flow, heat transfer, and stress) Thermo-Mechanical Model Breakout Shell Thickness Comparison Between Model and Plant Data Hibbeler, Koric, Thomas, Xu, Spangler, CCC-UIUC, SD Inc., 2009 Mismatchdue to uneven superheat distribution ! Power of Multiphysics (Thermo-Mechanical-CFD Model) Less Superheat at WF Koric, Hibbeler, Liu,Thomas, CCC-UIUC, 2011 The HPC Innovation Excellence Award 2011 0 0.1 Distance from Meniscus (m) 0.2 0.3 0.4 0.5 X Y 0.6 Z National Center for Supercomputing Applications Special Thanks • • • • • • • • • • Prof. Martin Ostoja-Starzewski (MechSE, UIUC) Prof Brian G. Thomas and CCC Dr. Ahmed Taha (NCSA) CRAY 2 PSP Partner Companies (NDA) NDEMC LSTC IBM Watson (Dr. Anshul Gupta) Simulia Dassault Systems Blue Waters Team National Center for Supercomputing Applications