GPU - ANSYS
Transcription
GPU - ANSYS
Stan Posey NVIDIA, Santa Clara, CA, USA; [email protected] GPUs Now Mainstream HPC Technology The buyer plans for including accelerators in their next technical computing server purchase has more than doubled from 29% to over 65% in last 20 months. IDC Market Research April, 2013 2 GPUs for Servers: Now as Common as CPUs ® 3 NVIDIA GPUs Accelerate CAE at Any Scale Same GPU Technology from MAXIMUS Workstations to TITAN at ORNL 20+ PetaFlops 18,688 NVIDIA Tesla K20x TITAN — #2 at Top500.org MAXIMUS Workstation Key Application S3D for Turbulent Combustion How to efficiently burn next gen diesel and bio fuels? 4 NVIDIA HPC Technology and CAE Strategy Technology Development of professional GPUs as co-processing accelerators for x86 CPUs Strategy Strategic Alliances Business and technical collaboration with ISVs; Industry customers; Research organizations Applications Engineering Technical collaboration with ISVs like ANSYS for development of GPU-accelerated solvers Software Development NVIDIA linear solver toolkit (implicit iterative solvers) , CUDA libraries, GPU compilers GPU System Integration HP, Dell, IBM, Cray, SGI, Fujitsu, others; Kepler K20 based-systems available since 2012 5 NVIDIA Leadership in Remote Visualization VIRTUAL MACHINE VIRTUAL DESKTOPS NVIDIA GRID Enabled Virtual Desktop NVIDIA Driver NVIDIA GRID ENABLED Hypervisor VDI NVIDIA GRID GPU 6 GPU Motivation: CAE Cost Trends Over 20 Years Cost Trends : Hardware is Cheap, People and Software Continue Cost Increase • Historically hardware very expensive vs. ISV software and people • ISV software budgets are now 4x vs. hardware • Increasingly important that hardware choices drive cost-performance efficiency in people and ISV software 7 NVIDIA Uses ANSYS CAE in Product Engineering ANSYS Icepak – active and passive cooling of IC packages ANSYS Mechanical – large deflection bending of PCBs ANSYS Mechanical – comfort and fit of 3D emitter glasses ANSYS Mechanical – shock & vib of solder ball assemblies 8 Progress Summary for GPU-Parallel CAE (I) Strong GPU investments by commercial CAE vendors (ISVs) GPU adoption led by implicit FEA and CEM, followed by CFD Recent CFD breakthroughs in linear solvers (AMG) and preconditioners GPUs now production-HPC for leading CAE end-user sites Led by automotive, electronics, and aerospace industries GPUs contributing to fast growth in emerging CAE applications New developments in particle-based CFD (LBM, SPH, DEM, etc.) Rapid growth for range of CEM applications and GPU adoption 9 Progress Summary for GPU-Parallel CAE (II) Every ISV has GPU-based products available or undergoing evaluation The 4 largest ISVs have products based on GPUs, some at 3rd generation #1 ANSYS, #2 DS SIMULIA, #3 MSC Software, and #4 Altair ANSYS 15.0 will have multiphysics capability on GPUs from 3 domains ANSYS Mechanical – 4th gen; ANSYS Fluent – 2nd gen; ANSYS HFSS (transient) – 1st gen The top 4 out of 5 ISV applications are available on GPUs today ANSYS Fluent, ANSYS Mechanical, Abaqus/Standard, MSC Nastran, LS-DYNA implicit only Several new ISVs were founded with GPUs as a competitive strategy Prometech (JP), FluiDyna (DE), Vratis (PL), IMPETUS (SE), Turbostream (UK) 10 GPU Focus on Acceleration of Implicit Solvers ANSYS Application Software Read input, matrix Set-up Implicit Sparse Matrix Operations GPU Implicit Solver 50% - 75% of Profile time Implicit Sparse Matrix Operations (Investigating OpenACC for more tasks on GPU) - Hand-CUDA Parallel - GPU Libraries, CUBLAS - OpenACC Directives CPU Global solution, write output + 11 Basics of GPU Computing for ANSYS Software GPUs are an accelerator that attach to an x86 CPU GPUs cannot operate without an x86 CPU present Most ANSYS GPU acceleration is user-transparent Only requirement is to inform ANSYS of how many GPUs to use Schematic of a CPU with an attached GPU accelerator CPU begins/ends job, GPU manages heavy computations 1 DDR GDDR GDDR Cache CPU DDR 2 I/O Hub PCI-Express 4 GPU Schematic of an x86 CPU with a GPU accelerator 1. 2. 3. 4. ANSYS job launched on CPU Solver operations sent to GPU GPU sends results back to CPU ANSYS job completes on CPU 3 12 ANSYS and NVIDIA Collaboration Roadmap Release ANSYS Mechanical 13.0 SMP, Single GPU, Sparse and PCG/JCG Solvers Dec 2010 14.0 ANSYS Fluent ANSYS EM ANSYS Nexxim + Distributed ANSYS; + Multi-node Support Radiation Heat Transfer (beta) ANSYS Nexxim + Radiation HT; + GPU AMG Solver (beta), Single GPU ANSYS Nexxim Nov 2012 + Multi-GPU Support; + Hybrid PCG; + Kepler GPU Support 15.0 + CUDA 5 Kepler Tuning + Multi-GPU AMG Solver; + CUDA 5 Kepler Tuning ANSYS Nexxim ANSYS HFSS (Transient) Dec 2011 14.5 Q4-2013 13 ANSYS 15.0 License Scheme for GPUs One HPC Task Required to Unlocked One GPU Examples: 1 x ANSYS HPC Pack Total 8 HPC Tasks (4 GPUs Max) 2 x ANSYS HPC Pack Total 32 HPC Tasks (16 GPUs Max) Valid Configurations: 6 CPU Cores + 2 GPUs 4 CPU Cores + 4 GPUs Total Use of 2 Servers: 24 CPU Cores + 8 GPUs (3:1 Ratio) . . . (Applies to all schemes: HPC, HPC Pack, HPC Workgroup, HPC Enterprise, etc.) 14 ANSYS Mechanical Number of Jobs Per Day ANSYS Mechanical 14.5 GPU Acceleration 500 Results for Distributed ANSYS 14.5 with 8-Core CPUs and single GPUs V14sp-5 Model 400 CPU + GPU CPU Only Higher is Better 300 200 210 164 100 Westmere Sandy Bridge 0 Xeon X5690 3.47 GHz 8 Cores + Tesla C2075 Xeon E5-2687W 3.10 GHz 8 Cores + Tesla K20 Turbine geometry 2,100,000 DOF SOLID187 FEs Static, nonlinear One iteration (final solution requires 25) Distributed ANSYS 14.5 Direct sparse solver Results from Supermicro X9DR3-F, 64GB memory 15 ANSYS Mechanical Number of Jobs Per Day ANSYS Mechanical 14.5 GPU Acceleration 500 Results for Distributed ANSYS 14.5 with 8-Core CPUs and single GPUs V14sp-5 Model 400 300 CPU + GPU CPU Only Higher is Better 395 341 K20 = 1.9x Acceleration 200 C2075 = 2.1x Acceleration 210 164 100 Westmere Sandy Bridge 0 Xeon X5690 3.47 GHz 8 Cores + Tesla C2075 Xeon E5-2687W 3.10 GHz 8 Cores + Tesla K20 Turbine geometry 2,100,000 DOF SOLID187 FEs Static, nonlinear One iteration (final solution requires 25) Distributed ANSYS 14.5 Direct sparse solver Results from Supermicro X9DR3-F, 64GB memory 16 ANSYS and NVIDIA Collaboration Roadmap Release ANSYS Mechanical 13.0 SMP, Single GPU, Sparse and PCG/JCG Solvers Dec 2010 14.0 ANSYS Fluent ANSYS EM ANSYS Nexxim + Distributed ANSYS; + Multi-node Support Radiation Heat Transfer (beta) ANSYS Nexxim + Radiation HT; + GPU AMG Solver (beta), Single GPU ANSYS Nexxim Nov 2012 + Multi-GPU Support; + Hybrid PCG; + Kepler GPU Support 15.0 + CUDA 5 Kepler Tuning + Multi-GPU AMG Solver; + CUDA 5 Kepler Tuning ANSYS Nexxim ANSYS HFSS (Transient) Dec 2011 14.5 Q4-2013 17 GPU Acceleration in ANSYS Fluent Beta Release in14.5, Full Product Support in15.0 (Dec 2013) GPU-based Model: Radiation Heat Transfer using OptiX, Product in 14.5 GPU-based Solver: Coupled Algebraic Multigrid (AMG) PBNS linear solver Operating Systems: Both Linux and Win64 for workstations and servers Parallel Methods: Shared memory in 14.5; Distributed ANSYS in 15.0 Multi-GPU Support: Single GPU for 14.5; full multi-GPU, multi-node 15.0 Model Suitability: Size of 3M cells or less in 14.5, unlimited in 15.0 18 ANSYS Fluent 14.5 and Radiation HT on GPU VIEWFAC Utility: Use on CPUs, GPUs or both ~2x speedup Radiation HT Applications: - Underhood cooling - Cabin comfort HVAC - Furnace simulations RAY TRACING Utility: - Solar loads on buildings Uses OptiX library from NVIDIA with up - Combustor in turbine to ~15x speedup (Use on GPU only) - Electronics passive cooling 19 GPU-based AMG Solver for ANSYS Fluent 15.0 New ANSYS Fluent AMG based on NVIDIA-developed solver toolkit Developed with support for MPI across multiple nodes and multiple GPUs Solver collaboration on coupled pressure-based Navier-Stokes, others to follow Early results published at Parallel CFD 2013, 20-24 May, Changsha, CN GPU-Accelerated Algebraic Multigrid for Applied CFD 20 ANSYS Fluent Profile for Coupled PBNS Solver Non-linear iterations Assemble Linear System of Equations Runtime: ~ 35% ~ 65% Solve Linear System of Equations: Ax = b Accelerate this first Converged ? No Stop Yes 21 ANSYS Fluent Performance for Single Tesla K20X ANSYS Fluent AMG Solver Time per Iteration (Sec) ANSYS Fluent 14.5 Performance – Results by NVIDIA, Nov 2012 Airfoil and Aircraft Models with Hexahedral Cells 9 Tesla K20X Xeon E5_2680 Lower is Better 2 x E5_2680 CPUs, Only 8 Cores Used 6 1.9x Solver settings: CPU Fluent solver: F-cycle, agg8, DILU, 0pre, 3post 3 1.8x GPU nvAMG solver: V-cycle, agg8, MC-DILU, 0pre, 3post 0 Airfoil (hex 784K) Aircraft (hex 1798K) NOTE: Times for solver only 22 Multi-GPU Preview ANSYS Fluent 15.0 Available Late 2013 (Preview 3 from Aug 2013) 23 GPUs and Distributed Cluster Computing Geometry decomposed: partitions put on independent cluster nodes; CPU distributed parallel processing Partition on CPU 1 Nodes distributed parallel using MPI 2 3 4 N1 N1 N2 N3 N4 Global Solution 24 GPUs and Distributed Cluster Computing Geometry decomposed: partitions put on independent cluster nodes; CPU distributed parallel processing Partition on CPU 1 Nodes distributed parallel using MPI 2 3 4 N1 N1 N2 N3 N4 G1 G2 G3 G4 Global Solution 1 Execution on CPU + GPU GPUs shared memory parallel using OpenMP under distributed parallel 25 ANSYS Fluent Solver Times for 2 CPUs + 2 GPUs ANSYS Fluent 15.0 Preview Performance – Results by NVIDIA, Feb 2013 2 x E5_2680 SB CPUs, 16 cores total, only 2 cores used with GPUs 1,5 2 x Tesla K20X 2 x Xeon E5_2680 Lower is Better 1 2.1x Solver settings: CPU Fluent solver: F-cycle, agg8, DILU, 0pre, 3post 0,5 1.7x GPU nvAMG solver: V-cycle, agg8, MC-DILU, 0pre, 3post 0 Helix (tet 1173K) Airfoil (hex 784K) NOTE: Times for solver only 26 ANSYS Fluent Solver Times for Sedan – 4 GPUs Multi-GPU Acceleration of 2.9X Solver Speedup 3.6 M Mixed cells 16-Core ANSYS Fluent 15.0 Steady, k-e turbulence (Preview) External Aero Coupled PBNS, DP AMG F-cycle on CPU Xeon E5-2667 + 4 x Tesla K20X GPUs AMG V-cycle on GPU CPU Configuration CPU + GPU Configuration 16-Core Server Node 8-Cores G1 G2 8-Cores G3 G4 27 ANSYS Fluent Solution Times for Sedan Case ANSYS Fluent Number of Jobs Per Day ANSYS Fluent 15.0 Preview 3 Performance – Results by NVIDIA, Sep 2013 Sedan Model 30 CPU-Only CPU+GPU Higher is Better 27 3.6 M Mixed cells Steady, k-e turbulence Coupled PBNS, DP AMG F-cycle on CPU AMG V-cycle on GPU 1.9x 20 15 10 12 8 Cores 8 Cores 8 Cores + 2 GPUs 2 x E5_2680 SB CPUs, 16 cores total, only 8 cores used in study 0 Segregated Solver Coupled Solver NOTE: All results fully converged 28 ANSYS Fluent Convergence for Truck Case Truck Body Model Coupled PBNS and stable convergence for drag coefficient at ~500 iterations 14 M Mixed cells Steady, k-e turbulence PBNS, double precision Default URF’s for each CPU: AMG F-cycle GPU: FGMRES with AMG Preconditioner Segregated PBNS and oscillating behavior for drag coefficient, not converged after ~6000 iterations 29 ANSYS Fluent Solution Times for Truck Case • Same solution times: 64 cores vs. 32 cores + 8 GPUs • Frees up 32 CPUs and HPC licenses for additional job(s) • Approximate 56% increase in overall productivity for 25% increase in cost ANSYS Fluent Number of Jobs Per Day ANSYS Fluent 15.0 Preview 3 Performance – Results by NVIDIA, Sep 2013 25 Truck Body Model Higher is Better 20 15 16 16 64 Cores 32 Cores + 8 GPUs 4 x Nodes x 2 CPUs (64 Cores Total) 2 x Nodes x 2 CPUs (32 Cores Total) 8 GPUs (4 each Node) 10 5 14 M Mixed cells Steady, k-e turbulence Coupled PBNS, DP Total solution times CPU: AMG F-cycle GPU: FGMRES with AMG Preconditioner 0 NOTE: All results fully converged 30 Additional Information Configuration Details on Workstation or Server www.nvidia.com/teslawtb www.nvidia.com/workstationwtb Test-drive GPU Computing with Your ANSYS Simulations Contact ANSYS or email [email protected] Read More About ANSYS and GPU Computing ANSYS Unveils GPU Computing for Accelerated Engineering Simulations Speed Up Simulations with a GPU, Article in ANSYS Advantage Magazine Speeding to a Solution, Article in ANSYS Advantage Magazine HPC Delivers a 3-D View, Article in ANSYS Advantage Magazine For More Information on NVIDIA and ANSYS Solutions www.nvidia.com/ansys 31 Acknowledgements ANSYS Mr. Jeff Beisheim, ANSYS Mechanical Parallel Solver Development Dr. Sunil Sathe, ANSYS Fluent Parallel Solver Development Dr. Prasad Alavilli, Manager, ANSYS Fluent Parallel HPC Development www.ansys.com NVIDIA Mr. Jon Cohen, Manager, NVIDIA Computational Library Development Dr. Joe Eaton, Manager, NVIDIA Linear Solver Tool Kit Development Dr. Steve Rennich, Applications Engineer, Developer Technology Group Dr. Bhushan Desam, ANSYS Alliances and CFD Market Development Mr. Vijay Sellappan, Applications Engineer, CAE Technology www.nvidia.com 32 Stan Posey NVIDIA, Santa Clara, CA, USA; [email protected]