GPU Programming with CUDA

Transcription

GPU Programming with CUDA
GPU Programming with CUDA David McCaughan, HPC Software Analyst SHARCNET, University of Guelph This material is adapted from notes originally written by Lukasz Wawrzyniak (PhD, University of Guelph) Overview •  Introduction to GPU programming •  Introduction to CUDA •  Hands-­‐on –  learn by doing: development of a “Hello, world!” application •  CUDA extensions to the C programming language •  Beyond the basics… Summer School 2010
D. McCaughan
Introduction to GPU programming •  A graphics processing unit (GPU) is a processor whose main job is to accelerate the rendering of 3D graphics primitives  
Modern GPUs feature a programmable graphics pipeline, which means that it is possible to write custom programs that execute on the GPU to process graphics data  
these are called shaders, because the main task of GPU programs is to apply lighting to 3D scenes Summer School 2010
D. McCaughan
A brief tour of graphics programming •  2D textures are wrapped around 3D meshes to assign colour to individual pixels on screen •  Lighting and shadow are applied to bring out 3D features •  Shaders allow programmers to deYine custom shadow and lighting techniques –  can also combine multiple textures in interesting ways •  Resulting pixels get sent to a framebuffer for display on the monitor Summer School 2010
D. McCaughan
Introduction to GPGPU •  With the introduction of programmable graphics pipelines, people realized that a GPU can be used for more than graphics –  General Purpose computing on Graphics Processing Units •  We could disguise some arbitrary data as “textures”, upload and execute an arbitrary program as a “shader”, then have the results sent to an off-­‐screen “framebuffer” •  Programs could be created using shading languages such as GLSL, Cg and HLSL •  Communication with the device could be managed using a graphics API like OpenGL or Direct3D Summer School 2010
D. McCaughan
Introduction to GPGPU (cont.) •  This was cool for a while, but because all computations occurred within the graphics pipeline, there were limitations: –  limited inputs/outputs –  limited data types, graphics-­‐speciYic semantics –  memory and processor optimized for short vectors, 2D textures –  lack of communication or synchronization between “threads” –  no writes to random memory locations (scatter) –  no in-­‐out textures (had to ping-­‐pong in multi-­‐pass algorithms) –  graphics API overhead –  graphics API learning curve Summer School 2010
D. McCaughan
Beyond the Graphics APIs •  NVIDIA now offers CUDA while AMD has moved toward OpenCL (although there are lower-­‐level APIs available for AMD) •  These new computing platforms bypass the graphics pipeline and expose the raw computational capabilities of the hardware •  Modern GPU architectures provide Ylexible addressing modes (gather/
scatter), shared memory and thread synchronization primatives •  Coming over the horizon is OpenCL –  a vendor-­‐agnostic computing platform –  supports vendor-­‐speciYic extensions akin to OpenGL –  goal is to support a range of hardware architectures including GPUs, CPUs, Cell processors, Larrabee and DSPs using a standard low-­‐level API Summer School 2010
D. McCaughan
The appeal of GPGPU •  “Supercomputing for the masses” –  signiYicant computational horsepower at an attractive price point –  readily accessible hardware •  Scalability –  programs can execute without modiYication on a run-­‐of-­‐the-­‐
mill PC with a $150 graphics card or a dedicated multi-­‐card supercomputer worth thousands of dollars •  Bright future – the computational capability of GPUs doubles each year –  more thread processors, faster clocks, faster DRAM, … –  “GPUs are getting faster, faster” Summer School 2010
D. McCaughan
Comparing GPUs and CPUs •  CPU – 
– 
– 
– 
– 
“Jack of all trades” task parallelism (diverse tasks) minimize latency multithreaded some SIMD •  GPU – 
– 
– 
– 
– 
excel at number crunching data parallelism (single task) maximize throughput super-­‐threaded large-­‐scale SIMD Summer School 2010
D. McCaughan
Stream computing •  A parallel processing model where a computational kernel is applied to a set of data (a stream) –  the kernel is applied to stream elements in parallel Input stream 5 1 3 8 2 3 6 7 7 3 4 5 Kernel Output stream 6 2 4 9 3 4 7 8 8 4 5 6 •  GPUs excel at this thanks to a large number of processing units and a parallel architecture Summer School 2010
D. McCaughan
Beyond stream computing •  Current GPUs offer functionality that goes beyond mere stream computing •  Shared memory and thread synchronization primitives eliminate the need for data independence •  Gather and scatter operations allow kernels to read and write data at arbitrary locations Summer School 2010
D. McCaughan
CUDA •  “Compute UniYied Device Architecture •  A platform that exposes NVIDIA GPUs as general purpose compute devices •  Is CUDA considered GPGPU? –  yes and no •  CUDA can execute on devices with no graphics output capabilities (the NVIDIA Tesla product line) – these are not “GPUs”, per se •  however, if you are using CUDA to run some generic algorithms on your graphics card, you are indeed performing some General Purpose computation on your Graphics Processing Unit… Summer School 2010
D. McCaughan
What is CUDA used for? •  CUDA has been used in many different areas – 
– 
– 
– 
– 
– 
– 
options pricing in Yinance electromagnetic simulations Yluid dynamics GIS geophysical data processing 3D visualization solutions … •  See http://www.nvidia.com/object/cuda_home.htm
–  long list of projects and speedups achieved Summer School 2010
D. McCaughan
Speedup •  What kind of speedup can I expect? –  0x – 2000x reported –  10x – 80x considered typical –  >= 30x considered worthwhile •  Speedup depends on –  problem structure •  need many identical independent calculations •  preferably sequential memory access –  level of intimacy with hardware –  time investment Summer School 2010
D. McCaughan
CUDA programming model •  The main CPU is referred to as the host •  The compute device is viewed as a coprocessor capable of executing a large number of lightweight threads in parallel •  Computation on the device is performed by kernels, functions executed in parallel on each data element •  Both the host and the device have their own memory –  the host and device cannot directly access each other’s memory, but data can be transferred using the runtime API •  The host manages all memory allocations on the device, data transfers, and the invocation of kernels on the device Summer School 2010
D. McCaughan
GPU applications •  The GPU can be utilized in different capacities •  One is to use the GPU as a massively parallel coprocessor for number crunching applications – 
– 
– 
– 
upload data and kernel to GPU execute kernel download results CPU and GPU can execute asynchronously •  Some applications use the GPU for both data crunching and visualization –  CUDA has bindings for OpenGL and Direct3D Summer School 2010
D. McCaughan
GPU as coprocessor host device Kernel execution is asynchronous Asynchronous memory transfers also available •  Basic paradigm –  host uploads inputs to device –  host remains busy while device performs computation •  prepare next batch of data, process previous results, etc. –  host downloads results •  Can be iterative or multi-­‐stage Summer School 2010
D. McCaughan
Simulation + visualization host device •  Basic paradigm –  host uploads inputs to device –  host may remain busy while device performs computation •  prepare next batch of data, etc. –  results used on device for rendering, no download to host Summer School 2010
D. McCaughan
Simulation + visualization (cont.) Fluids Demo N-­‐Body Demo Summer School 2010
D. McCaughan
HANDS-­ON: SHARCNET systems •  See also: https://www.sharcnet.ca/help/index.php/
GPU_Accelerated_Computing
•  angel.sharcnet.ca –  11 NVIDIA Tesla S1070 GPU servers •  each with 4 GPUs + 16GB of global memory •  each GPU server connected to two compute nodes (2 4-­‐core Xeon CPUs + 8GB RAM each) •  1 GPU per quad-­‐core CPU; 1:1 memory ratio between GPUs/CPUs –  CUDA installed in /opt/sharcnet/cuda-toolkit+sdk/current
export CUDA_INSTALL_PATH=/opt/sharcnet/cuda-toolkit+sdk/current
–  sample projects in ${CUDA}/C
•  copy to your work space (e.g. /work/username/cuda_sdk) & compile cp –R ${CUDA}/C /work/username/cuda_projects export CUDA_BIN=/work/username/cuda_projects/bin/linux/release
Summer School 2010
D. McCaughan
Querying device parameters •  Source code: ${CUDA_INSTALL_PATH}/C/deviceQuery
–  if you copied this to your work space as described on previous page, you have a copy of this code •  Run: ${CUDA_BIN}/deviceQuery –  note output running on login node –  now try again on one of the compute nodes (submit it) Summer School 2010
D. McCaughan
Submitting GPU jobs •  See GPU Accelerated Computing article on training wiki for maximum detail –  note: queue details (mpi vs. gpu – test queue oddities) •  For Summer School purposes we have a queue set up for you to be able to submit single-­‐process, single-­‐GPU jobs –  essential usage: sqsub –q ss –n 1 –o output.file –r 5 ./executable
Summer School 2010 D. McCaughan Thread batching •  To take advantage of the multiple multiprocessors, kernels are executed as a grid of threaded blocks •  All threads in a thread block are executed by a single multiprocessor •  The resources of a multiprocessor are divided among the threads in a block (registers, shared memory, etc.) –  this has several important implications that will be discussed later Summer School 2010
D. McCaughan
Hardware basics •  The compute device is composed of a number of multiprocessors, each of which contains a number of SIMD processors –  G80 has 16 multiprocessors, GTX280 has 30 •  A multiprocessor can execute K threads in parallel physically, where K is called the warp size –  thread = instance of kernel –  warp size on current hardware is 32 threads •  Each multiprocessor contains a large number of 32-­‐bit registers which are divided among the active threads –  G80 has 8192, GTX280 has 16384 Summer School 2010
D. McCaughan
Hardware architecture •  Each multiprocessor has: Device Multiprocessor 0 –  registers (on-­‐chip) Shared Memory Registers –  shared memory (16KB on-­‐
chip) Thread Local Memory –  access to fast DRAM • 
• 
• 
• 
local memory global memory constant memory texture memory Summer School 2010
Host Global Memory Constant Memory Texture Memory D. McCaughan
Registers Thread Local Memory Multiprocessor 1
Shared Memory Registers Thread Local Memory Registers Thread Local Memory Thread batching: 1D example Grid Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Thread 4 ... Block 4 Thread 0 Thread 1 Summer School 2010
Thread 2 Thread 3 D. McCaughan
Block 6 Thread T Block 7 ... Block B Thread batching: 2D example Grid Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Summer School 2010
D. McCaughan
Thread batching (cont.) •  At runtime, a thread can determine the block that it belongs to, the block dimensions, and the thread index within the block •  These values can be used to compute indices into input and output arrays Summer School 2010
D. McCaughan
Introduction to GPU Programming: CUDA HELLO, CUDA! Summer School 2010
D. McCaughan
Language and compiler •  CUDA provides a set of extensions to the C programming language –  new storage quantiYiers, kernel invocation syntax, intrinsics, vector types, etc. •  CUDA source code saved in .cu Yiles –  host and device code and coexist in the same Yile –  storage qualiYiers determine type of code •  Compiled to object Yiles using nvcc compiler –  object Yiles contain executable host and device code •  Can be linked with object Yiles generated by other C/C++ compilers Summer School 2010
D. McCaughan
SAXPY •  SAXPY (Scalar Alpha X Plus Y) is a common linear algebra operation. It is a combination of scalar multiplication and vector addition: y = α ∙ x + y –  x and y are vectors, α is a scalar –  x and y can be arbitrarily large Summer School 2010
D. McCaughan
SAXPY: CPU version •  Here is SAXPY in vanilla C: void saxpy_cpu(float *vecY, float *vecX, float alpha, int n)
{
int i;
for (i = 0; i < n; i++)
vecY[i] = alpha * vecX[i] + vecY[i];
}
–  the CPU processes vector components sequentially using a for loop –  note that vecY is an in-­‐out parameter here Summer School 2010
D. McCaughan
HANDS-­ON: SAXPY: CUDA version •  Copy the saxpy_cpu program provided (use a .cu extension) –  add the following CUDA kernel function: __global__ void saxpy_gpu(float *vecY, float *vecX,
float alpha, int n)
{
int i;
i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
vecY[i] = alpha * vecX[i] + vecY[i];
}
•  The __global__ qualiYier identiYies this function as a kernel that executes on the device Summer School 2010
D. McCaughan
SAXPY: CUDA version (cont.) __global__ void saxpy_gpu(float *vecY, float *vecX,
float alpha, int n)
{
int i;
i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
vecY[i] = alpha * vecX[i] + vecY[i];
}
•  blockIdx, blockDim and threadIdx are built-­‐in variables that uniquely identify a thread’s position in the execution environment –  they are used to compute an offset into the data array Summer School 2010
D. McCaughan
SAXPY: CUDA version (cont.) __global__ void saxpy_gpu(float *vecY, float *vecX,
float alpha, int n)
{
int i;
i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
vecY[i] = alpha * vecX[i] + vecY[i];
}
•  The host speciYies the number of blocks and block size during kernel invocation: saxpy_gpu<<<numBlocks, blockSize>>>(y_d, x_d, alpha, n);
Summer School 2010
D. McCaughan
Computing the index i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
vecY[i] = alpha * vecX[i] + vecY[i];
Block 0 Block 1 Block 2 Block 3 Block 4 vecX ... vecY ... i = 3 * 8 + 5 = 29
blockIdx.x = 3
blockDim.x = 8
threadIdx.x = 5
Summer School 2010
D. McCaughan
Key differences void saxpy_cpu(float *vecY, float *vecX, float alpha, int n)
{
int i;
for (i = 0; i < n; i++)
vecY[i] = alpha * vecX[i] + vecY[i];
}
__global__ void saxpy_gpu(float *vecY, float *vecX,
float alpha, int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
vecY[i] = alpha * vecX[i] + vecY[i];
}
• 
• 
No need to explicitly loop over array elements – each element is processed in a separate thread The element index is computed based on block index, block width and thread index within the block Summer School 2010
D. McCaughan
Key differences __global__ void saxpy_gpu(float *vecY, float *vecX,
float alpha, int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
vecY[i] = alpha * vecX[i] + vecY[i];
}
• 
Could avoid testing whether i < n if we knew n is a multiple of block size (e.g. use padded arrays -­‐-­‐-­‐ recall MPI_Scatter issues) __global__ void saxpy_gpu(float *vecY, float *vecX,
float alpha, int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
vecY[i] = alpha * vecX[i] + vecY[i];
}
Summer School 2010
D. McCaughan
HANDS-­ON: Host code: overview •  The host performs the following operations: 1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 
initialize device allocate and initialize input arrays in host DRAM allocate memory on device upload input data to device execute kernel on device download results check results clean-­‐up Summer School 2010
D. McCaughan
Host code: initialization #include <cuda.h>
#include <cutil.h>
/* CUDA runtime API */
/* useful utilities */
int main(int argc, char *argv[])
{
float *x_host, *y_host;
/* arrays for computation on host */
float *x_dev, *y_dev;
/* arrays for computation on device */
float *y_shadow;
/* host-side copy of device results */
int n = 16 * 1024 * 1024;
float alpha = 0.5f;
size_t memsize;
int i, blockSize, nBlocks;
/*
* find compute device and initialize it
*/
CUT_DEVICE_INIT(argc,argv);
…
Summer School 2010
D. McCaughan
Host code: memory allocation …
memsize = n * sizeof(float);
/*
* allocate arrays on host
*/
x_host = malloc(memsize);
y_host = malloc(memsize);
y_shadow = malloc(memsize);
/*
* allocate arrays on device
*/
CUDA_SAFE_CALL(cudaMalloc((void **)&x_dev, memsize));
CUDA_SAFE_CALL(cudaMalloc((void **)&y_dev, memsize));
…
Summer School 2010
D. McCaughan
Host code: upload data …
/*
* initialize
*/
for (i = 0; i
{
x_host[i]
y_host[i]
}
arrays on host
< n; i++)
= rand() / (float)RAND_MAX;
= rand() / (float)RAND_MAX;
/*
* copy arrays to device memory (synchronous)
*/
CUDA_SAFE_CALL(cudaMemcpy(x_dev, x_host,
memsize, cudaMemcpyHostToDevice));
CUDA_SAFE_CALL(cudaMemcpy(y_dev, y_host,
memsize, cudaMemcpyHostToDevice));
…
Summer School 2010
D. McCaughan
Host code: kernel execution …
/*
* set up device execution configuration
*/
blockSize = 512;
nBlocks = n / blockSize + (n % blockSize > 0);
/*
* execute kernel (asychronous!)
*/
saxpy_gpu<<<nBlocks, blockSize>>>(y_dev, x_dev, alpha, n);
/* check if kernel invocation generated an error */
CUT_CHECK_ERROR(“Kernel execution failed”);
/* execute host version (i.e. baseline reference results) */
saxpy_cpu(y_host, x_host, alpha n);
…
Summer School 2010
D. McCaughan
Host code: download results …
/*
* retrieve results from device (synchronous)
*/
CUDA_SAFE_CALL(cudaMemcpy(y_shadow, y_dev,
memsize, cudaMemcpyDeviceToHost));
/*
* check results
*/
if (cutComparef(y_shadow, y_host, n))
printf(“PASSED!\n”);
else
printf(“FAILED!\n”);
…
Summer School 2010
D. McCaughan
Host code: clean-­‐up …
/*
* deallocate arrays on host
*/
free(x_host);
free(y_host);
free(y_shadow);
/*
* deallocate arrays on device
*/
CUDA_SAFE_CALL(cudaFree(x_dev));
CUDA_SAFE_CALL(cudaFree(y_dev));
return(0);
} /* main */
Summer School 2010
D. McCaughan
SAXPY MakeYile TARGET = saxpy_cuda
CUDA_INSTALL_PATH ?=“/opt/sharcnet/cuda-toolkit+sdk/current”
CUDA_PROJECT_PATH ?=“/work/username/cuda_projects”
# cuda.h
INCLUDES += -I$(CUDA_INSTALL_PATH)/include
# cutil.h
INCLUDES += -I$(CUDA_PROJECT_PATH)/common/inc
LIBS += -L$(CUDA_INSTALL_PATH)/lib64 –L$(CUDA_PROJECT_PATH)/lib
LIBS += -lcuda –lcudart –lcutil
$(TARGET): saxpy_cuda.o
gcc –fPic –o $(TARGET) saxpy_cuda.o $(LIBS)
saxpy.o: saxpy.cu
nvcc $(INCLUDES) –c saxpy_cuda.cu –o saxpy_cuda.o
clean:
rm –f *.o $(TARGET)
Summer School 2010
D. McCaughan
SAXPY using classical GPGPU •  It might be fun to take a look at a “classical” GPGPU implementation using OpenGL •  An excellent tutorial by Dominik Gӧddeke can be found here: http://www.mathematik.uni-dortmund.de
/~goeddeke/gpgpu/tutorial.html
Summer School 2010
D. McCaughan
Introduction to GPU Programming: CUDA CUDA EXTENSION TO THE C PROGRAMMING LANGUAGE Summer School 2010
D. McCaughan
Storage class qualiYiers Functions __global__
Device kernels callable from host
__device__
Device functions (only callable from device)
__host__
Host functions (only callable from host)
Data __shared__
Memory shared by a block of threads executing on
a multiprocessor.
__constant__
Special memory for constants (cached)
Summer School 2010
D. McCaughan
CUDA data types •  C primatives: –  char, int, float, double, … •  Short vectors: –  int2, int3, int4, uchar2, uchar4, float2, float3, float4, … –  no built-­‐in vector math (although a utility header, cutil_math.h, deYines some common operations) •  Special type used to represent dimensions –  dim3
•  Support for user-­‐deYined structures, e.g.: struct particle
{
float3 position, velocity, acceleration;
float mass;
};
Summer School 2010
D. McCaughan
Library functions available to kernels •  Math library functions: –  sin, cos, tan, sqrt, pow, log, … –  sinf, cosf, tanf, sqrtf, powf, logf, … •  ISA intrinsics –  __sinf, __cosf, __tanf, __powf, __logf, … –  __mul24, __umul24, … •  Intrinsic versions of math functions are faster but less precise Summer School 2010
D. McCaughan
Built-­‐in kernel variables dim3 gradDim
–  number of blocks in grid dim3 blockDim
–  number of threads per block dim3 blockIdx
–  number of current block within grid dim3 threadIdx
–  index of current thread within block Summer School 2010
D. McCaughan
CUDA kernels: limitations •  No recursion •  No variable argument lists •  No dynamic memory allocation •  No pointers-­‐to-­‐functions •  No static variables inside kernels Summer School 2010
D. McCaughan
Launching kernels •  Launchable kernels must be declared as ‘__global__ void’ __global__ void myKernel(paramList);
•  Kernel calls must specify device execution environment –  grid deYinition – number of blocks in grid –  block deYinition – number of threads per block –  optionally, may specify amount of shared memory per block (more on that later) •  Kernel launch syntax: myKernel<<<GridDef, BlockDef>>>(paramList);
Summer School 2010
D. McCaughan
Thread addressing •  Kernel launch syntax: myKernel<<<GridDef, BlockDef>>>(paramlist);
•  GridDef and BlockDef can be speciYied as dim3
objects –  grids can be 1D or 2D –  blocks can be 1D, 2D or 3D •  This makes it easy to set up different memory addressing for multi-­‐dimensional data. Summer School 2010
D. McCaughan
Thread addressing (cont.) •  1D addressing example: 100 blocks with 256 threads per block: dim3 gridDef1(100,1,1);
dim3 blockDef1(256,1,1);
kernel1<<<gridDef1, blockDef1>>>(paramList);
•  2D addressing example: 10x10 blocks with 16x16 threads per block: dim3 gridDef2(10,10,1);
dim3 blockDef2(16,16,1);
kernel2<<<gridDef2, blockDef2>>>(paramList);
•  Both examples launch the same number of threads, but block and thread indexing is different –  kernel1 uses blockIdx.x, blockDim.x and threadIdx.x
–  kernel2 uses blockIdx.[xy], blockDim.[xy], threadIdx.[xy]
Summer School 2010
D. McCaughan
Thread addressing (cont.) •  One-­‐dimensional addressing example: __global__ void kernel1(float *idata, float *odata)
{
int i;
i = blockIdx.x * blockDim.x + threadIdx.x;
odata[i] = func(idata[i]);
}
•  Two-­‐dimensional addressing example: __global__ void kernel2(float *idata, float *odata, int pitch)
{
int x, y, i;
x = blockIdx.x * blockDim.y + threadIdx.x;
y = blockIdx.y * blockDim.y + threadIdx.y;
i = y * pitch + x;
odata[i] = func(idata[i]);
}
Summer School 2010
D. McCaughan
Thread addressing (cont.) __global__ void kernel1(float *idata, float *odata)
{
int i;
i = blockIdx.x * blockDim.x + threadIdx.x;
odata[i] = func(idata[i]);
}
…
dim3 gridDef1(100,1,1);
dim3 gridDef1(256,1,1);
kernel1<<<gridDef1, blockDef1>>>(paramList);
Grid Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Thread 4 ... Block 4 Thread 0 Thread 1 Summer School 2010
Thread 2 Thread 3 D. McCaughan
Block 6 Thread 255 Block 7 ... Block 99 Thread addressing (cont.) __global__ void kernel2(float *idata, float *odata, int pitch)
{
int x, y, i;
Grid x = blockIdx.x * blockDim.y + threadIdx.x;
y = blockIdx.y * blockDim.y + threadIdx.y;
i = y * pitch + x;
odata[i] = func(idata[i]);
}
…
dim3 gridDef1(10,10,1);
dim3 blockDef1(16,16,1);
kernel2<<<gridDef2, blockDef2>>>(paramList);
Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread Thread Thread Thread Thread (0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread (0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread (0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Summer School 2010
D. McCaughan
Beyond the basics… •  Memory address coalescing •  Shared memory •  Thread synchronization Summer School 2010
D. McCaughan
Introduction to GPU Programming: CUDA OPTIMIZATION CASE STUDY: MATRIX TRANSPOSE Summer School 2010
D. McCaughan
Matrix transpose •  Write the rows of matrix A as columns of matrix AT •  A bandwidth-­‐limited problem –  most of the time is spent moving data in memory rather than number crunching •  Utilizing the memory architecture effectively tends to be the biggest challenge in CUDA-­‐fying algorithms •  See ${CUDA_INSTALL_PATH}/C/src/transpose
–  ${CUDA_PROJECT_PATH}/src/transpose
Summer School 2010
D. McCaughan
Matrix transpose (cont.) 1 2 3 4 5 6 Input rows are written as columns in the output matrix 1 2 3 4 5 6 Summer School 2010
D. McCaughan
The naïve matrix tranpose __global__ void transpose_naive(float *odata, float *idata,
int width, int height)
{
unsigned int xIndex, yIndex, index_in, index_out;
xIndex = blockDim.x * blockIdx.x + threadIdx.x;
yIndex = blockDim.y * blockIdx.y + threadIdx.y;
if (xIndex < width && yIndex < height)
{
index_in = xIndex + width * yIndex;
index_out = yIndex + height * xIndex;
odata[index_out] = idata[index_in];
}
}
Summer School 2010
D. McCaughan
Kernel execution speed vs. CPU version •  Tesla S870, Xeon @ 2.0GHz (tope.sharcnet.ca) •  Averaged over 100 runs: Size
CPU
GPU (kernel)
Speedup
128x128
0.080 ms
0.023 ms
x 3.5
256x256
0.313 ms
0.074 ms
x 4.2
512x512
1.265 ms
0.261 ms
x 4.8
1024x1024
12.450 ms
2.503 ms
x 5.0
256x4096
11.932 ms
3.112 ms
x 3.8
–  timings don’t include data transfers!!! Summer School 2010
D. McCaughan
Kernel + data transfers vs. CPU version •  Tesla S870, Xeon @ 2.0GHz (tope.sharcnet.ca) •  Averaged over 100 runs: Size
CPU
GPU (kernel +
transfers)
Speedup
128x128
0.080 ms
0.171 ms
x 0.5
256x256
0.313 ms
0.511 ms
x 0.6
512x512
1.265 ms
1.983 ms
x 0.6
1024x1024
12.450 ms
8.032 ms
x 1.6
256x4096
11.932 ms
8.591 ms
x 1.4
–  not so good… Summer School 2010
D. McCaughan
Remarks •  Matrix transpose is not an ideal GPU application –  arithmetic intensity way too low (arithmetic instructions only used for computing offsets) –  bandwidth-­‐limited •  Data transfer overhead •  Kernel launch overhead •  Still, we could do a little better Summer School 2010
D. McCaughan
Naïve matrix transpose 1 2 3 4 5 6 Input rows are written as columns in the output matrix 1 2 3 4 5 6 Summer School 2010
D. McCaughan
Naïve matrix transpose (cont.) Since the matrices are stored as 1D arrays, here’s what is actually happening: … 1 2 3 4 1 Summer School 2010
2 3 D. McCaughan
4 … Naïve matrix transpose (cont.) … 1 2 3 4 1 2 3 4 •  This kind of memory access pattern is very inefYicient •  Reading and/or writing non-­‐consecutive locations from with a thread warp results in poor bandwidth •  Memory access coalescing (read/write) is a common strategy for optimizing global memory access Summer School 2010
D. McCaughan
… Coalesced reads and writes •  The DRAM interface is optimized for reading and writing consecutive locations in memory (a remnant of pixel shader heritage) •  To take advantage of this architecture, consecutive threads in a thread warp should access consecutive locations •  The compiler should be able to detect that pattern and generate optimal memory transfer instructions •  In the naïve matrix transpose example, the reads are coalesced, but the writes aren’t •  Coalescing the writes can yield up to an order of magnitude speed-­‐up! –  will require the use of shared memory… Summer School 2010
D. McCaughan
Shared memory •  Each multiprocessor has some fast on-­‐chip shared memory Device Multiprocessor 0 •  Threads within a thread block can communicate using the shared memory •  Each thread in a thread block has R/W access to all of the shared memory allocated to a block •  Threads can synchronize using the intrinsic Shared Memory Registers Thread Local Memory Global Memory Constant Memory __syncthreads();
Texture Memory Summer School 2010
D. McCaughan
Registers Thread Local Memory Multiprocessor 1
Shared Memory Registers Thread Local Memory Registers Thread Local Memory Using shared memory •  To coalesce the writes, we will partition the matrix into 16x16 tiles, each processed by a different thread block •  A thread block will temporarily stage its tile in shared memory by copying ti from the input matrix using coalesced reads •  Each tile is then transposed as it is written out to its properly location in the output matrix •  The main difference here is that the tile is written out using coalesced writes Summer School 2010
D. McCaughan
Optimized matrix transpose Copy of tile in shared memory Input matrix copy Output matrix transpose on write (0,0) (1,0) (2,0) (0,0) (0,1) (0,2) (0,1) (1,1) (2,1) (1,0) (1,1) (1,2) (0,2) (1,2) (2,2) (2,0) (2,1) (2,2) Summer School 2010
D. McCaughan
Optimized matrix transpose (1) __global__ void transpose(float *odata, float *idata,
int width, int height)
{
__shared__ float block[BLOCK_DIM][BLOCK_DIM];
unsigned int xIndex, yIndex, index_in, index_out;
/* read the matrix tile into shared memory */
xIndex = blockDim.x * BLOCK_DIM + threadIdx.x;
yIndex = blockDim.y * BLOCK_DIM + threadIdx.y;
if ((xIndex < width) && (yIndex < height))
{
index_in = yIndex * width + xIndex;
block[threadIdx.y][threadIdx.x] = idata[index_in];
}
__syncthreads();
/* write the transposed matrix tile to global memory */
xIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;
yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;
if ((xIndex < height) && (yIndex < width))
{
index_out = yIndex * width + xIndex;
odata[index_out] = block[threadIdx.x][threadIdx.y];
}
}
Summer School 2010
D. McCaughan
Optimized matrix transpose (cont.) Copy of tile in shared memory Copy Input tile (1,0) The input tile is copied into shared memory row-­‐wise (coalesced reads). Each row of the input tile becomes a row in the shared memory tile. Summer School 2010
D. McCaughan
Optimized matrix transpose (cont.) Copy of tile in shared memory Output tile Transpose The output tile is written row-­‐wise (coalesced writes) by copying the corresponding columns from shared memory. Summer School 2010
D. McCaughan
(1,0) Naïve vs. optimized GPU transpose •  Averaged over 100 runs Size
GPU (Naïve)
GPU (Optimized)
Speedup
128x128
0.023 ms
0.015 ms
x 1.5
256x256
0.074 ms
0.033 ms
x 2.2
512x512
0.261 ms
0.107 ms
x 2.4
1024x1024
2.503 ms
0.401 ms
x 6.2
256x4096
3.112 ms
0.402 ms
x 7.7
•  Lesson: optimizing global memory access is important! Summer School 2010
D. McCaughan
Shared memory is great, right? •  Yes. But… •  This is a good Yirst stab at an optimized matrix transpose •  However, shared memory bank conPlicts occur when the tile in shared memory is accessed column-­‐wise •  Say what…? Summer School 2010
D. McCaughan
Shared memory banks •  To facilitate high memory bandwidth, the shared memory on each multiprocessor is organized into equally-­‐sized banks which can be accessed simultaneously •  However, if more than one thread tries to access the same bank, the accesses must be serialized, causing delays –  this situation is called a bank conPlict •  The banks are organized such that consecutive 32-­‐bit words are assigned to consecutive banks Summer School 2010
D. McCaughan
Shared memory banks (cont.) •  There are 16 banks, thus: bank# = address % 16
•  The number of shared memory banks is closely tied to the warp size •  Shared memory accesses are serviced such that the threads in the Yirst half of a warp and the threads in the second half of the warp will not cause conYlicts •  Thus we have NUM_BANKS = WARP_SIZE / 2
Summer School 2010
D. McCaughan
Bank conYlict solution •  In the matrix transpose example, bank conYlicts occur when the shared memory is accessed column-­‐wise as the tile is being written •  The threads in each half-­‐warp access addresses which are offset from each other by BLOCK_DIM elements (with BLOCK_DIM = 16) •  Given 16 shared memory banks, that means that all accesses hit the same bank! Summer School 2010
D. McCaughan
Bank conYlict solution •  The solution is surprisingly simple – instead of allocating a BLOCK_DIM x BLOCK_DIM shared memory tile, we allocate a BLOCK_DIM x (BLOCK_DIM+1) tile •  The extra padding breaks the pattern and forces concurrent threads to access different banks of shared memory –  the columns are no longer aligned on 16-­‐word offsets –  no additional changes to the device code are needed Summer School 2010
D. McCaughan
Optimized matrix transpose (2) __global__ void transpose(float *odata, float *idata,
int width, int height)
{
__shared__ float block[BLOCK_DIM][BLOCK_DIM + 1];
unsigned int xIndex, yIndex, index_in, index_out;
/* read the matrix tile into shared memory */
xIndex = blockDim.x * BLOCK_DIM + threadIdx.x;
yIndex = blockDim.y * BLOCK_DIM + threadIdx.y;
if ((xIndex < width) && (yIndex < height))
{
index_in = yIndex * width + xIndex;
block[threadIdx.y][threadIdx.x] = idata[index_in];
}
__syncthreads();
/* write the transposed matrix tile to global memory */
xIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;
yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;
if ((xIndex < height) && (yIndex < width))
{
index_out = yIndex * width + xIndex;
odata[index_out] = block[threadIdx.x][threadIdx.y];
}
}
Summer School 2010
D. McCaughan
Evaluating performance •  Precept of the day: –  easy to program in CUDA… [ O(minutes, hours, days ] –  …less easy to get it right [ O(days, weeks, months) ] •  Often, maximizing performance in CUDA comes down to benchmarking several alternative solutions •  The matrix transpose example shows how to create and use timers –  cutCreateTimer(&timer)
–  cutStartTimer(timer), cutStopTimer(timer),
cutResetTimer(timer)
–  cutGetTimerValue(timer)
/* returns ms */
Summer School 2010
D. McCaughan
Eliminating bank conYlicts •  Averaged over 100 runs: Size
GPU (Optimized,
bank conflicts)
GPU (Optimized,
no bank confilcts)
Speedup
Total
speedup
128x128
0.015 ms
0.013 ms
x 1.15
x 1.77
256x256
0.033 ms
0.025 ms
x 1.32
x 2.96
512x512
0.107 ms
0.076 ms
x 1.41
x 3.43
1024x1024
0.401 ms
0.282 ms
x 1.42
x 8.88
256x4096
0.402 ms
0.291 ms
x 1.38
x 10.69
•  Order of magnitude speed-­‐up over naïve GPU version Summer School 2010
D. McCaughan
GPU vs. CPU (kernel execution speed) •  Averaged over 100 runs: Size
CPU
GPU (Optimized)
Speedup
128x128
0.080 ms
0.013 ms
x 6.1
256x256
0.313 ms
0.025 ms
x 12.5
512x512
1.265 ms
0.076 ms
x 16.6
1024x1024
12.450 ms
0.282 ms
x 44.1
256x4096
11.932 ms
0.291 ms
x 41.0
•  Does not include data transfers! Summer School 2010
D. McCaughan
GPU vs. CPU (with data transfers) •  Averaged over 100 runs: Size
CPU
GPU (Optimized)
Speedup
128x128
0.080 ms
0.160 ms
x 0.5
256x256
0.313 ms
0.461 ms
x 0.8
512x512
1.265 ms
1.703 ms
x 0.7
1024x1024
12.450 ms
5.810 ms
x 2.1
256x4096
11.932 ms
5.818 ms
x 2.1
Summer School 2010
D. McCaughan
Final remarks •  Matrix transpose is not the ideal GPU application •  Arithmetic intensity too low •  However, can achieve order of magnitude speedup over naïve GPU implementation –  moreover, computation signiYicantly faster on GPU if data transfers not included in timings –  useful if the transpose is a part of a multi-­‐stage algorithm Summer School 2010
D. McCaughan
General strategies •  Think B.I.G. –  GPU needs thousands of active threads for full occupancy –  SIMD pipelines can execute some threads while others wait for memory access (latency hiding) •  Maximize arithmetic intensity –  maximize ratio of ALU instructions to memory access instructions •  global memory is uncached •  latency measured in hundreds of clock cycles per access –  large volumes of threads can be used to hide latency, but this requires enough ALU instructions to make it work well Summer School 2010
D. McCaughan
General strategies (cont.) •  Avoid host-­‐device memory transfers as much as possible –  do as much work on the device as possible –  device has very fast, high-­‐bandwidth DRAM –  for example, rather than uploading arrays from the host, write a kernel that initializes arrays directly on the device •  e.g. pseudo-­‐random number generation •  Avoid divergent branches –  if threads diverge within a thread warp, both branches must execute in series –  branch cost is minimal if all threads in a warp take the same path Summer School 2010
D. McCaughan
General strategies (cont.) •  Use coalesced memory accesses –  shared memory can be used to temporarily stage data –  be careful to avoid shared memory bank conYlicts, but these are far less costly than non-­‐coalesced global memory access –  shared memory can be as fast as registers when there are no bank conYlicts •  Optimize register and shared memory usage –  minimize the number of registers and shared memory needed by each thread –  the less resources each thread uses, the more threads can be active simultaneously in the multiprocessor pipelines Summer School 2010
D. McCaughan
General strategies: thread batching •  We must have at least as many thread blocks as there are multiprocessors, otherwise some multiprocessors will be dormant during kernel execution –  the number of blocks should be a multiple of the number of multiprocessors for balanced work distribution –  in practice, there should be hundreds or thousands of thread blocks to ensure full occupancy on current and future hardware •  Each thread block should contain a multiple of 32 threads (32 being the warp size), otherwise the multiprocessors are underutilized –  256 is a good number. 512 is maximum. Summer School 2010
D. McCaughan
General strategies (cont.) •  Constant and texture memory is cached –  both are read-­‐only –  texture memory/cache optimized for 2D access patterns •  Take advantage of asynchronous calls to overlap CPU and GPU computations –  all kernel invocations are asynchronous –  cudaMemcpyAsync can be used for asynchronous memory transfers –  use cudaSyncThread to wait for all GPU operations to Yinish Summer School 2010
D. McCaughan
A note regarding multi-­‐card setups •  SLI = “Scalable Link Interface” –  used to distribute work across multiple GPUs during graphics rendering •  No SLI with CUDA –  each device must be managed independently –  host responsible for dividing work –  one CPU thread per device •  Contradictory paradigms: –  SLI = automagical scalability –  CUDA = low-­‐level access to hardware → you do the magic! Summer School 2010
D. McCaughan
Further reading •  CUDA Programming Guide •  CUDA sample projects –  many contain extended documentation –  similarity to the matrix transpose, the reduction project is an excellent step-­‐by-­‐step walkthrough of how to optimize code for the hardware (read/write coalescing, shared memory, bank conYlicts, etc.) •  Lots of documentation/presentations/tutorials online –  one more time, Google is your friend! Summer School 2010
D. McCaughan