GPU - ANSYS

Transcription

GPU - ANSYS
Stan Posey
NVIDIA, Santa Clara, CA, USA; [email protected]
GPUs Now Mainstream HPC Technology
The buyer plans for including accelerators in their
next technical computing server purchase has more
than doubled from 29% to over 65% in last 20 months.
IDC Market Research
April, 2013
2
GPUs for Servers: Now as Common as CPUs
®
3
NVIDIA GPUs Accelerate CAE at Any Scale
Same GPU Technology from
MAXIMUS Workstations to
TITAN at ORNL 20+ PetaFlops
18,688 NVIDIA Tesla K20x
TITAN — #2 at Top500.org
MAXIMUS Workstation
Key Application S3D for
Turbulent Combustion
How to efficiently burn next
gen diesel and bio fuels?
4
NVIDIA HPC Technology and CAE Strategy
Technology
Development of professional GPUs as co-processing accelerators for x86 CPUs
Strategy
Strategic Alliances
Business and technical collaboration with ISVs; Industry customers; Research organizations
Applications Engineering
Technical collaboration with ISVs like ANSYS for development of GPU-accelerated solvers
Software Development
NVIDIA linear solver toolkit (implicit iterative solvers) , CUDA libraries, GPU compilers
GPU System Integration
HP, Dell, IBM, Cray, SGI, Fujitsu, others; Kepler K20 based-systems available since 2012
5
NVIDIA Leadership in Remote Visualization
VIRTUAL MACHINE
VIRTUAL
DESKTOPS
NVIDIA GRID Enabled
Virtual Desktop
NVIDIA Driver
NVIDIA GRID ENABLED
Hypervisor
VDI
NVIDIA GRID GPU
6
GPU Motivation: CAE Cost Trends Over 20 Years
Cost Trends : Hardware is Cheap, People and Software Continue Cost Increase
• Historically hardware
very expensive vs. ISV
software and people
• ISV software budgets
are now 4x vs. hardware
• Increasingly important
that hardware choices
drive cost-performance
efficiency in people and
ISV software
7
NVIDIA Uses ANSYS CAE in Product Engineering
ANSYS Icepak – active and passive cooling of IC packages
ANSYS Mechanical – large deflection bending of PCBs
ANSYS Mechanical – comfort and fit of 3D emitter glasses
ANSYS Mechanical – shock & vib of solder ball assemblies
8
Progress Summary for GPU-Parallel CAE (I)
Strong GPU investments by commercial CAE vendors (ISVs)
GPU adoption led by implicit FEA and CEM, followed by CFD
Recent CFD breakthroughs in linear solvers (AMG) and preconditioners
GPUs now production-HPC for leading CAE end-user sites
Led by automotive, electronics, and aerospace industries
GPUs contributing to fast growth in emerging CAE applications
New developments in particle-based CFD (LBM, SPH, DEM, etc.)
Rapid growth for range of CEM applications and GPU adoption
9
Progress Summary for GPU-Parallel CAE (II)
Every ISV has GPU-based products available or undergoing evaluation
The 4 largest ISVs have products based on GPUs, some at 3rd generation
#1 ANSYS, #2 DS SIMULIA, #3 MSC Software, and #4 Altair
ANSYS 15.0 will have multiphysics capability on GPUs from 3 domains
ANSYS Mechanical – 4th gen; ANSYS Fluent – 2nd gen; ANSYS HFSS (transient) – 1st gen
The top 4 out of 5 ISV applications are available on GPUs today
ANSYS Fluent, ANSYS Mechanical, Abaqus/Standard, MSC Nastran, LS-DYNA implicit only
Several new ISVs were founded with GPUs as a competitive strategy
Prometech (JP), FluiDyna (DE), Vratis (PL), IMPETUS (SE), Turbostream (UK)
10
GPU Focus on Acceleration of Implicit Solvers
ANSYS Application Software
Read input, matrix Set-up
Implicit Sparse
Matrix Operations
GPU
Implicit Solver
50% - 75% of
Profile time
Implicit Sparse
Matrix Operations
(Investigating OpenACC
for more tasks on GPU)
- Hand-CUDA Parallel
- GPU Libraries, CUBLAS
- OpenACC Directives
CPU
Global solution, write output
+
11
Basics of GPU Computing for ANSYS Software
GPUs are an accelerator that attach to an x86 CPU
GPUs cannot operate without an x86 CPU present
Most ANSYS GPU acceleration is user-transparent
Only requirement is to inform ANSYS of how many GPUs to use
Schematic of a CPU with an attached GPU accelerator
CPU begins/ends job, GPU manages heavy computations
1
DDR
GDDR
GDDR
Cache
CPU
DDR
2
I/O
Hub
PCI-Express
4
GPU
Schematic of an x86 CPU
with a GPU accelerator
1.
2.
3.
4.
ANSYS job launched on CPU
Solver operations sent to GPU
GPU sends results back to CPU
ANSYS job completes on CPU
3
12
ANSYS and NVIDIA Collaboration Roadmap
Release
ANSYS Mechanical
13.0
SMP, Single GPU, Sparse
and PCG/JCG Solvers
Dec 2010
14.0
ANSYS Fluent
ANSYS EM
ANSYS Nexxim
+ Distributed ANSYS;
+ Multi-node Support
Radiation Heat Transfer
(beta)
ANSYS Nexxim
+ Radiation HT;
+ GPU AMG Solver (beta),
Single GPU
ANSYS Nexxim
Nov 2012
+ Multi-GPU Support;
+ Hybrid PCG;
+ Kepler GPU Support
15.0
+ CUDA 5 Kepler Tuning
+ Multi-GPU AMG Solver;
+ CUDA 5 Kepler Tuning
ANSYS Nexxim
ANSYS HFSS (Transient)
Dec 2011
14.5
Q4-2013
13
ANSYS 15.0 License Scheme for GPUs
One HPC Task Required to Unlocked One GPU
Examples:
1 x ANSYS HPC Pack
Total 8 HPC Tasks (4 GPUs Max)
2 x ANSYS HPC Pack
Total 32 HPC Tasks (16 GPUs Max)
Valid Configurations:
6 CPU Cores + 2 GPUs
4 CPU Cores + 4 GPUs
Total Use of 2 Servers:
24 CPU Cores + 8 GPUs (3:1 Ratio)
.
.
.
(Applies to all schemes: HPC, HPC Pack,
HPC Workgroup, HPC Enterprise, etc.)
14
ANSYS Mechanical Number of Jobs Per Day
ANSYS Mechanical 14.5 GPU Acceleration
500
Results for Distributed ANSYS 14.5 with 8-Core CPUs and single GPUs
V14sp-5 Model
400
CPU + GPU
CPU Only
Higher
is
Better
300
200
210
164
100
Westmere
Sandy Bridge
0
Xeon X5690 3.47 GHz
8 Cores + Tesla C2075
Xeon E5-2687W 3.10 GHz
8 Cores + Tesla K20
Turbine geometry
2,100,000 DOF
SOLID187 FEs
Static, nonlinear
One iteration (final
solution requires 25)
Distributed ANSYS 14.5
Direct sparse solver
Results from Supermicro
X9DR3-F, 64GB memory
15
ANSYS Mechanical Number of Jobs Per Day
ANSYS Mechanical 14.5 GPU Acceleration
500
Results for Distributed ANSYS 14.5 with 8-Core CPUs and single GPUs
V14sp-5 Model
400
300
CPU + GPU
CPU Only
Higher
is
Better
395
341
K20 = 1.9x
Acceleration
200
C2075 = 2.1x
Acceleration
210
164
100
Westmere
Sandy Bridge
0
Xeon X5690 3.47 GHz
8 Cores + Tesla C2075
Xeon E5-2687W 3.10 GHz
8 Cores + Tesla K20
Turbine geometry
2,100,000 DOF
SOLID187 FEs
Static, nonlinear
One iteration (final
solution requires 25)
Distributed ANSYS 14.5
Direct sparse solver
Results from Supermicro
X9DR3-F, 64GB memory
16
ANSYS and NVIDIA Collaboration Roadmap
Release
ANSYS Mechanical
13.0
SMP, Single GPU, Sparse
and PCG/JCG Solvers
Dec 2010
14.0
ANSYS Fluent
ANSYS EM
ANSYS Nexxim
+ Distributed ANSYS;
+ Multi-node Support
Radiation Heat Transfer
(beta)
ANSYS Nexxim
+ Radiation HT;
+ GPU AMG Solver (beta),
Single GPU
ANSYS Nexxim
Nov 2012
+ Multi-GPU Support;
+ Hybrid PCG;
+ Kepler GPU Support
15.0
+ CUDA 5 Kepler Tuning
+ Multi-GPU AMG Solver;
+ CUDA 5 Kepler Tuning
ANSYS Nexxim
ANSYS HFSS (Transient)
Dec 2011
14.5
Q4-2013
17
GPU Acceleration in ANSYS Fluent
Beta Release in14.5, Full Product Support in15.0 (Dec 2013)
GPU-based Model:
Radiation Heat Transfer using OptiX, Product in 14.5
GPU-based Solver:
Coupled Algebraic Multigrid (AMG) PBNS linear solver
Operating Systems:
Both Linux and Win64 for workstations and servers
Parallel Methods:
Shared memory in 14.5; Distributed ANSYS in 15.0
Multi-GPU Support:
Single GPU for 14.5; full multi-GPU, multi-node 15.0
Model Suitability:
Size of 3M cells or less in 14.5, unlimited in 15.0
18
ANSYS Fluent 14.5 and Radiation HT on GPU
VIEWFAC Utility:
Use on CPUs, GPUs
or both ~2x speedup
Radiation HT Applications:
- Underhood cooling
- Cabin comfort HVAC
- Furnace simulations
RAY TRACING Utility:
- Solar loads on buildings
Uses OptiX library
from NVIDIA with up
- Combustor in turbine
to ~15x speedup
(Use on GPU only)
- Electronics passive cooling
19
GPU-based AMG Solver for ANSYS Fluent 15.0
New ANSYS Fluent AMG based on NVIDIA-developed solver toolkit
Developed with support for MPI across multiple nodes and multiple GPUs
Solver collaboration on coupled pressure-based Navier-Stokes, others to follow
Early results published at Parallel CFD 2013, 20-24 May, Changsha, CN
GPU-Accelerated Algebraic Multigrid for Applied CFD
20
ANSYS Fluent Profile for Coupled PBNS Solver
Non-linear iterations
Assemble Linear System of Equations
Runtime:
~ 35%
~ 65%
Solve Linear System of Equations: Ax = b
Accelerate
this first
Converged ?
No
Stop
Yes
21
ANSYS Fluent Performance for Single Tesla K20X
ANSYS Fluent AMG Solver Time per Iteration (Sec)
ANSYS Fluent 14.5 Performance – Results by NVIDIA, Nov 2012
Airfoil and Aircraft Models with Hexahedral Cells
9
Tesla K20X
Xeon E5_2680
Lower
is
Better
2 x E5_2680 CPUs,
Only 8 Cores Used
6
1.9x
Solver settings:
CPU Fluent solver:
F-cycle, agg8, DILU,
0pre, 3post
3
1.8x
GPU nvAMG solver:
V-cycle, agg8, MC-DILU,
0pre, 3post
0
Airfoil (hex 784K)
Aircraft (hex 1798K)
NOTE: Times
for solver only
22
Multi-GPU Preview
ANSYS Fluent 15.0
Available Late 2013
(Preview 3 from Aug 2013)
23
GPUs and Distributed Cluster Computing
Geometry decomposed: partitions
put on independent cluster nodes;
CPU distributed parallel processing
Partition on CPU
1
Nodes distributed
parallel using MPI
2 3
4
N1
N1
N2
N3
N4
Global Solution
24
GPUs and Distributed Cluster Computing
Geometry decomposed: partitions
put on independent cluster nodes;
CPU distributed parallel processing
Partition on CPU
1
Nodes distributed
parallel using MPI
2 3
4
N1
N1
N2
N3
N4
G1
G2
G3
G4
Global Solution
1
Execution on
CPU + GPU
GPUs shared memory
parallel using OpenMP
under distributed parallel
25
ANSYS Fluent Solver Times for 2 CPUs + 2 GPUs
ANSYS Fluent 15.0 Preview Performance – Results by NVIDIA, Feb 2013
2 x E5_2680 SB CPUs,
16 cores total, only 2
cores used with GPUs
1,5
2 x Tesla K20X
2 x Xeon E5_2680
Lower
is
Better
1
2.1x
Solver settings:
CPU Fluent solver:
F-cycle, agg8, DILU,
0pre, 3post
0,5
1.7x
GPU nvAMG solver:
V-cycle, agg8, MC-DILU,
0pre, 3post
0
Helix (tet 1173K)
Airfoil (hex 784K)
NOTE: Times
for solver only
26
ANSYS Fluent Solver Times for Sedan – 4 GPUs
Multi-GPU Acceleration of
2.9X Solver Speedup
3.6 M Mixed cells
16-Core ANSYS Fluent 15.0
Steady, k-e turbulence
(Preview) External Aero
Coupled PBNS, DP
AMG F-cycle on CPU
Xeon E5-2667 + 4 x Tesla K20X GPUs
AMG V-cycle on GPU
CPU Configuration
CPU + GPU Configuration
16-Core Server Node
8-Cores
G1
G2
8-Cores
G3
G4
27
ANSYS Fluent Solution Times for Sedan Case
ANSYS Fluent Number of Jobs Per Day
ANSYS Fluent 15.0 Preview 3 Performance – Results by NVIDIA, Sep 2013
Sedan Model
30
CPU-Only
CPU+GPU
Higher
is
Better
27
3.6 M Mixed cells
Steady, k-e turbulence
Coupled PBNS, DP
AMG F-cycle on CPU
AMG V-cycle on GPU
1.9x
20
15
10
12
8 Cores
8 Cores
8 Cores
+ 2 GPUs
2 x E5_2680 SB CPUs,
16 cores total, only
8 cores used in study
0
Segregated Solver
Coupled Solver
NOTE: All results
fully converged
28
ANSYS Fluent Convergence for Truck Case
Truck Body Model
Coupled PBNS and
stable convergence
for drag coefficient
at ~500 iterations
14 M Mixed cells
Steady, k-e turbulence
PBNS, double precision
Default URF’s for each
CPU: AMG F-cycle
GPU: FGMRES with
AMG Preconditioner
Segregated PBNS and oscillating
behavior for drag coefficient, not
converged after ~6000 iterations
29
ANSYS Fluent Solution Times for Truck Case
• Same solution times:
64 cores vs.
32 cores + 8 GPUs
• Frees up 32 CPUs
and HPC licenses for
additional job(s)
• Approximate 56%
increase in overall
productivity for 25%
increase in cost
ANSYS Fluent Number of Jobs Per Day
ANSYS Fluent 15.0 Preview 3 Performance – Results by NVIDIA, Sep 2013
25
Truck Body Model
Higher
is
Better
20
15
16
16
64 Cores
32 Cores
+ 8 GPUs
4 x Nodes x 2 CPUs
(64 Cores Total)
2 x Nodes x 2 CPUs
(32 Cores Total)
8 GPUs (4 each Node)
10
5
14 M Mixed cells
Steady, k-e turbulence
Coupled PBNS, DP
Total solution times
CPU: AMG F-cycle
GPU: FGMRES with
AMG Preconditioner
0
NOTE: All results
fully converged
30
Additional Information
Configuration Details on Workstation or Server
www.nvidia.com/teslawtb
www.nvidia.com/workstationwtb
Test-drive GPU Computing with Your ANSYS Simulations
Contact ANSYS or email [email protected]
Read More About ANSYS and GPU Computing
ANSYS Unveils GPU Computing for Accelerated Engineering Simulations
Speed Up Simulations with a GPU, Article in ANSYS Advantage Magazine
Speeding to a Solution, Article in ANSYS Advantage Magazine
HPC Delivers a 3-D View, Article in ANSYS Advantage Magazine
For More Information on NVIDIA and ANSYS Solutions
www.nvidia.com/ansys
31
Acknowledgements
ANSYS
Mr. Jeff Beisheim, ANSYS Mechanical Parallel Solver Development
Dr. Sunil Sathe, ANSYS Fluent Parallel Solver Development
Dr. Prasad Alavilli, Manager, ANSYS Fluent Parallel HPC Development
www.ansys.com
NVIDIA
Mr. Jon Cohen, Manager, NVIDIA Computational Library Development
Dr. Joe Eaton, Manager, NVIDIA Linear Solver Tool Kit Development
Dr. Steve Rennich, Applications Engineer, Developer Technology Group
Dr. Bhushan Desam, ANSYS Alliances and CFD Market Development
Mr. Vijay Sellappan, Applications Engineer, CAE Technology
www.nvidia.com
32
Stan Posey
NVIDIA, Santa Clara, CA, USA; [email protected]