GPU multiprocessing - Prace Training Portal

Transcription

GPU multiprocessing - Prace Training Portal
GPU multiprocessing
Manuel Ujaldón Martínez
Computer Architecture Department
University of Malaga (Spain)
Outline
1. Multichip solutions [10 slides]
2. Multicard solutions [2 slides]
3. Multichip + multicard [3]
4. Performance on matrix decompositions [2]
5. CUDA programming [5]
6. Scalability on 3DFD [4]
A world of possibilities
From lower to higher cost, we have:
1. Multichip: Voodoo5 (3Dfx), 3D1 (Gigabyte).
2. Multicard: SLI(Nvidia) / CrossFire(ATI).
Gigabyte (2005)
NVIDIA
(2007)
ATI
(2007)
NVIDIA
(2008)
3. Combination:
Evans &
Two chips/card and/or
Sutherland (2004):
two cards/connector.
3
I. Multichip solutions
4
First choice: Multichip. A retrospective:
Voodoo 5
5500
3Dfx
(1999)
Volari V8
Duo
XGI
(2002)
Rage Fury
Maxx
ATI
(2000)
2 Rad9800
(prototype)
Sapphire
(2003)
5
First choice: Multichip.
Example 1: 3D1 (Gigabyte - 2005).
 A double GeForce 6600GT GPU on
the same card (december 2005).
 Each GPU endowed with 128 MB of
memory and a 128 bits bus width.
6
First choice: Multichip.
Example 2: GeForce 7950 GX2 (Nvidia – 2006)
7
First choice: Multichip.
Example 3: GeForce 9800 GX2 (Nvidia - 2008)
 Double GeForce 8800 GPU,
double printed circuit board and
double video memory of 512 MB.
A single PCI-express connector.
8
First choice: Multichip.
3D1 (Gigabyte). Cost and performance
3DMark 2003
3DMark 2005
1024x768
1600x1200
1024x768
1600x1200
GeForce 6600 GT
8234
2059
3534
2503
3D1 using a single GPU
8529
2063
3572
2262
GeForce 6800 GT
11493
3846
4858
3956
GeForce 6600 GT SLI
14049
3924
6122
3542
3D1 using two GPUs
14482
4353
6307
3609
Card
Cost: row 3 > row 4 > row 5 > row 1 > row 1
9
First choice: Multichip.
3D1 (Gigabyte). Analysis.
 As compared to a single GeForce 6800 GT, 3D1 has:
 Lower cost.
 Higher arithmetic performance. Better at poorer resolution and
software innovations (shaders).
 Similar bandwidth.
 Lower memory space and usability:
 Vertices and textures must be replicated.
 A GPU cannot see the memory of its twin.
 As compared to two GeForce 6600 GT connected through
SLI:
 Slightly lower cost.
 Greater performance without demanding CPU bandwidth.
 Less versatile: Future expansion and/or single-card use.
10
First choice: Multichip. GeForce 7950 GX2
(2006)
 GPU developed by Nvidia in June 2006. The GPU has
“twin soul” (duality affects design).
 Clocks are slower than the single-GPU model:
 GPU: 500 MHz (twin) versus 650 MHz (stand alone).
 Memory: 2x600 MHz (twin) versus 2x800 MHz (stand alone).
 Drivers were released almost a year later, which
penalized initially the popularity of this card.
 It allows to use 48 pixel processors (24 on each GPU)
and a video memory of 1 GB (512 MB connected to each
GPU through a couple of buses 256 bits wide).
11
First choice: Multichip (2006). Transistors.

A smaller chip with smaller transistors allows growing
through a GPU replication
12
First choice: Multichip (2006). Frequency.

A double GPU allows to relax clocks,
with less heat and power consumption.
13
First choice: Multichip (2006).
Bandwidth.

Two GPUs placed on
parallel planes make it
easier to duplicate the
bus width to 512 bits.
14
II. Multicard solutions
15
Second choice: Multicard.
A couple of GPUs
 SLI (Nvidia on GeForces)
 CrossFire (ATI on Radeons)
16
Second choice: Multicard.
SLI (Nvidia). Elements.
- The motherboard must have several slots PCI-express 2.0
and PCI-express x16:
- The power supply must reach at least 700 Watts.
- Performance issues: A twin card may increment
performance 60-80%. A new generation of GPUs may
increment even more. Time frame becomes crucial!
17
III. Multichip + multicard
18
1+2 choice: Multichip+multicard
 First solution available on the marketplace: Gigabyte
(2005) based on GeForce 6 GPUs. It allows heterogeneous
graphics cards, but workload balance gets complicated.
19
1+2 choice: Multichip+multicard.
Implementation details
20
1+2 choice: Multichip+multitarjeta.
Newer designs
 It combines a number of GeForce 9800 GX2 GPUs and a
multi-socket motherboard to configure up to quad-SLI:
2 GPUs/card x up to 4 cards = 8 GPUs.
2 GPUs
4 GPUs
8 GPUs
21
IV. Performance on
matrix decompositions
22
Multicard performance versus a newer
generation (LU decomposition)
 A second (twin) GPU improves 1.6x, but does not reach
the performance of a single card coming from the next
generation.
23
CPU+GPU performance versus
a single quad-core CPU (more on this later)
 The benchmark is composed of three popular
matrix decompositions used in linear algebra
24
V. CUDA programming
for multi-GPU applications
25
Device Management
 CPU can query and select GPU devices





cudaGetDeviceCount( int *count )
cudaSetDevice( int device )
cudaGetDevice( int *current_device )
cudaGetDeviceProperties( cudaDeviceProp* prop, int device )
cudaChooseDevice( int *device, cudaDeviceProp* prop )
 Multi-GPU setup:
 device 0 is used by default
 one CPU thread can control only one GPU
 multiple CPU threads can control the same GPU
 calls are serialized by the driver
41
26
Multiple CPU Threads and CUDA
 CUDA resources allocated by a CPU thread can be
consumed only by CUDA calls from the same CPU thread.
 Violation example:
 CPU thread 2 allocates GPU memory, stores address in p
 thread 3 issues a CUDA call that accesses memory via p
42
27
When using several GPUs, the
implementation gets complicated
 GPUs don’t share video memory, so programmer must
move data around PCI-express (even when GPUs belong
to the same graphics card, as in the GeForce 9800 GX2).
 Steps to follow:
 Copy data from GPU A to CPU thread A.
 Copy data from CPU thread A to CPU thread B using MPI.
 Copy data from CPU thread B to GPU B.
 We can use asynchronous copies to overlap the kernel
execution on the GPU with data copies, and “pinned
memory” to share copies among CPU threads
(use cudaHostAlloc())
28
Host Synchronization
 All kernel launches are asynchronous
 control returns to CPU immediately
 kernel executes after all previous CUDA calls have completed
 cudaMemcpy is synchronous
 control returns to CPU after copy completes
 copy starts after all previous CUDA calls have completed
 cudaThreadSynchronize()
 blocks until all previous CUDA calls complete
39
29
CPU↔GPU interactions: Conclusions
 CPU↔GPU mem BW much lower than GPU mem BW.
 Use page-locked host memory (cudaMallocHost()) for
maximum CPU ↔ GPU bandwidth
 3.2 GB/s common on PCI-e x16.
 ~4 GB/s measured on nForce 680i chipsets (8 GB/s for PCI-e 2.0).
 Be cautious however since allocating too much page-locked memory can
reduce overall system performance.
 Minimize CPU ↔ GPU data transfers by moving more
code from CPU to GPU:
 Even if that means running kernels with low parallelism.
 Intermediate data structs. can be allocated, operated on, and
deallocated without ever copying them to CPU memory.
 Group data transfers:
 One large transfer much better than many small ones.
30
VI. Scalability for 3DFD
(Nvidia code)
31
Example: Multi-GPU implementation
for 3DFD
 3DFD is a finite differences code for the discretization of
the seismic wave equation.
 8th order in space, 2nd order in time.
 Using a regular mesh.
 Fixed X and Y dimensions, varying Z.
 Data is partitioned among GPUs along Z axis.
 Computation increases with z, communication (per node) stays
constant.
 A GPU has to exchange 4 xy-planes (ghost nodes) with each of its
neighbors.
 Executed on a cluster of 2 GPUS per node and Infiniband
SDR network.
32
Performance for a couple of GPUs
 Linear scaling is achieved when computation time exceeds
communication time.
33
Three or more cluster nodes
 Times are per cluster node.
 At least one cluster node needs two MPI communications,
one with each of the neighbors.
34
Performance with 8 GPUs
 8x improvement factor is sustained at Z>1300, exactly
where computation exceeds communication.
35