CUDA Advanced Memory Usage and Op miza on

Transcription

CUDA Advanced Memory Usage and Op miza on
CUDA Advanced Memory Usage and Op5miza5on Yukai Hung [email protected]
Department of Mathema>cs Na>onal Taiwan University
Register as Cache? Vola5le Qualifier !   Vola>le qualifier __global__ void kernelFunc(int* result)
{
int temp1;
int temp2;
if(threadIdx.x<warpSize)
{
temp1=array[threadIdx.x]
array[threadIdx.x+1]=2;
}
iden>cal reads compiler op>mized this read away temp2=array[threadIdx.x]
result[threadIdx.x]=temp1*temp2;
}
3 Vola5le Qualifier !   Vola>le qualifier __global__ void kernelFunc(int* result)
{
int temp1;
int temp2;
if(threadIdx.x<warpSize)
{
int temp=array[threadIdx.x];
}
}
temp1=temp; array[threadIdx.x+1]=2;
temp2=temp; result[threadIdx.x]=temp1*temp2;
4 Vola5le Qualifier !   Vola>le qualifier __global__ void kernelFunc(int* result)
{
int temp1;
int temp2;
if(threadIdx.x<warpSize)
{
temp1=array[threadIdx.x]*1;
array[threadIdx.x+1]=2;
__syncthreads();
temp2=array[threadIdx.x]*2;
result[threadIdx.x]=temp1*temp2;
}
}
5 Vola5le Qualifier !   Vola>le qualifier __global__ void kernelFunc(int* result)
{
volatile int temp1;
volatile int temp2;
if(threadIdx.x<warpSize)
{
temp1=array[threadIdx.x]*1;
array[threadIdx.x+1]=2;
}
temp2=array[threadIdx.x]*2;
result[threadIdx.x]=temp1*temp2;
}
6 Data Prefetch Data Prefetch !   Hide memory latency by overlapping loading and compu>ng -­‐ double buffer is tradi>onal soQware pipeline technique load blue block to shared memory Nd
compute blue block on shared memory and load next block to shared memory Pd
Md
Pdsub
8 Data Prefetch !   Hide memory latency by overlapping loading and compu>ng -­‐ double buffer is tradi>onal soQware pipeline technique for loop { load data from global to shared memory synchronize block compute data in the shared memory synchronize block } 9 Data Prefetch !   Hide memory latency by overlapping loading and compu>ng -­‐ double buffer is tradi>onal soQware pipeline technique load data from global memory to registers for loop { store data from register to shared memory synchronize block load data from global memory to registers compute data in the shared memory synchronize block } very small overhead compu>ng and loading overlap both memory are very fast register and shared are independent 10 Data Prefetch !   Matrix-­‐matrix mul>plica>on 11 Constant Memory Constant Memory !   Where is constant memory? -­‐ data is stored in the device global memory -­‐ read data through mul>processor constant cache -­‐ 64KB constant memory and 8KB cache for each mul>processor !   How about the performance? -­‐ op>mized when warp of threads read same loca>on -­‐ 4 bytes per cycle through broadcas>ng to warp of threads -­‐ serialized when warp of threads read in different loca>on -­‐ very slow when cache miss (read data from global memory) -­‐ access latency can range from one to hundreds clock cycles 13 Constant Memory !   How to use constant memory? -­‐ declare constant memory on the file scope (global variable) -­‐ copy data to constant memory by host (because it is constant!!) //declare constant memory
__constant__ float cst_ptr[size];
//copy data from host to constant memory
cudaMemcpyToSymbol(cst_ptr,host_ptr,data_size);
14 Constant Memory //declare constant memory
__constant__ float cangle[360];
int main(int argc,char** argv)
{
int size=3200;
float* darray;
float hangle[360];
//allocate device memory
cudaMalloc((void**)&darray,sizeof(float)*size);
//initialize allocated memory
cudaMemset(darray,0,sizeof(float)*size);
//initialize angle array on host
for(int loop=0;loop<360;loop++)
hangle[loop]=acos(-1.0f)*loop/180.0f;
//copy host angle data to constant memory
cudaMemcpyToSymbol(cangle,hangle,sizeof(float)*360);
15 Constant Memory //execute device kernel
test_kernel<<<size/64,64>>>(darray);
//free device memory
cudaFree(darray);
}
return 0;
__global__ void test_kernel(float* darray)
{
int index;
//calculate each thread global index
index=blockIdx.x*blockDim.x+threadIdx.x;
#pragma unroll 10
for(int loop=0;loop<360;loop++)
darray[index]=darray[index]+cangle[loop];
}
return;
16 Texture Memory Texture Memory !   Texture mapping
18 Texture Memory !   Texture mapping
19 Texture Memory !   Texture filtering
nearest-­‐neighborhood interpola>on 20 Texture Memory !   Texture filtering
linear/bilinear/trilinear interpola>on 21 Texture Memory !   Texture filtering
two >mes bilinear interpola>on 22 Texture Memory Host
Input Assembler
Setup / Rstr / ZCull
SP
SP
SP
TF
SP
SP
TF
L1
TF
L1
SP
SP
SP
Pixel Thread Issue
SP
SP
TF
L1
L1
L2
FB
SP
TF
TF
L1
L2
FB
SP
Work Distribution
SP
TF
L1
L2
FB
SP
L1
L2
FB
these units perform graphical texture opera>ons 23 SP
TF
L1
L2
FB
SP
Thread Processor
Vtx Thread Issue
L2
FB
Texture Memory two SMs are cooperated as texture processing cluster scalable units on graphics texture specific unit only available for texture 24 Texture Memory texture specific unit texture address units compute texture addresses texture filtering units compute data interpola>on read only texture L1 cache 25 Texture Memory Host
Input Assembler
Setup / Rstr / ZCull
SP
SP
SP
TF
SP
SP
TF
L1
TF
L1
SP
SP
SP
Pixel Thread Issue
SP
SP
TF
L1
L1
L2
FB
SP
TF
TF
L1
L2
FB
SP
Work Distribution
SP
TF
L1
L2
FB
SP
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor
Vtx Thread Issue
L2
FB
read only texture L2 cache for all TPC read only texture L1 cache for each TPC 26 Texture Memory texture specific units 27 Texture Memory !   Texture is an object for reading data -­‐ data is stored on the device global memory -­‐ global memory is bound with texture cache SP
SP
TF
SP
SP
TF
L1
TF
L1
L2
FB
SP
SP
SP
SP
TF
TF
L1
L2
FB
SP
SP
TF
L1
L1
SP
L2
SP
TF
L1
28 L2
FB
SP
SP
TF
L1
L2
FB global memory FB
SP
Thread Processor
SP
L1
L2
What is the advantages of texture? Texture Memory !   Data caching -­‐ helpful when global memory coalescing is the main bocleneck SP
SP
TF
SP
SP
TF
L1
TF
L1
SP
SP
SP
SP
SP
TF
L1
L1
L2
FB
SP
TF
TF
L1
L2
FB
SP
SP
TF
L1
L2
FB
30 SP
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor
SP
L2
FB
Texture Memory !   Data filtering -­‐ support linear/bilinear and trilinear hardware interpola>on texture specific unit intrinsic interpola>on cudaFilterModePoint cudaFilterModeLinear 31 Texture Memory !   Accesses modes -­‐ clamp and wrap memory accessing for out-­‐of-­‐bound addresses wrap boundary texture specific unit cudaAddressModeWrap clamp boundary cudaAddressModeClamp 32 Texture Memory !   Bound to linear memory -­‐ only support 1-­‐dimension problems -­‐ only get the benefits from texture cache -­‐ not support addressing modes and filtering !   Bound to cuda array -­‐ support float addressing -­‐ support addressing modes -­‐ support hardware interpola>on -­‐ support 1/2/3-­‐dimension problems 33 Texture Memory !   Host code -­‐ allocate global linear memory or cuda array -­‐ create and set the texture reference on file scope -­‐ bind the texture reference to the allocated memory -­‐ unbind the texture reference to free cache resource !   Device code -­‐ fetch data by indica>ng texture reference -­‐ fetch data by using texture fetch func>on 34 Texture Memory !   Texture memory constrain 1D texture linear memory
Compute capability 1.3
Compute capability 2.0
8192
31768
1D texture cuda array
1024x128
2D texture cuda array (65536,32768)
(65536,65536)
3D texture cuda array
(2048,2048,2048)
(4096,4096,4096)
35 Texture Memory !   Measuring texture cache miss or hit number -­‐ latest visual profiler can count cache miss or hit -­‐ need device compute capability higher than 1.2 36 Example: 1-­‐dimension linear memory Texture Memory //declare texture reference
texture<float,1,cudaReadModeElementType> texreference;
int main(int argc,char** argv)
{
int size=3200;
float* harray;
float* diarray;
float* doarray;
//allocate host and device memory
harray=(float*)malloc(sizeof(float)*size);
cudaMalloc((void**)&diarray,sizeof(float)*size);
cudaMalloc((void**)&doarray,sizeof(float)*size);
//initialize host array before usage
for(int loop=0;loop<size;loop++)
harray[loop]=(float)rand()/(float)(RAND_MAX-1);
//copy array from host to device memory
cudaMemcpy(diarray,harray,sizeof(float)*size,cudaMemcpyHostToDevice);
38 Texture Memory //bind texture reference with linear memory
cudaBindTexture(0,texreference,diarray,sizeof(float)*size);
//execute device kernel
kernel<<<(int)ceil((float)size/64),64>>>(doarray,size);
//unbind texture reference to free resource
cudaUnbindTexture(texreference);
//copy result array from device to host memory
cudaMemcpy(harray,doarray,sizeof(float)*size,cudaMemcpyDeviceToHost);
//free host and device memory
free(harray);
cudaFree(diarray);
cudaFree(doarray);
return 0;
}
39 Texture Memory __global__ void kernel(float* doarray,int size)
{
int index;
//calculate each thread global index
index=blockIdx.x*blockDim.x+threadIdx.x;
//fetch global memory through texture reference
doarray[index]=tex1Dfetch(texreference,index);
}
return;
40 Texture Memory __global__ void offsetCopy(float* idata,float* odata,int offset)
{
//compute each thread global index
int index=blockIdx.x*blockDim.x+threadIdx.x;
}
//copy data from global memory
odata[index]=idata[index+offset];
41 Texture Memory __global__ void offsetCopy(float* idata,float* odata,int offset)
{
//compute each thread global index
int index=blockIdx.x*blockDim.x+threadIdx.x;
}
//copy data from global memory
odata[index]=tex1Dfetch(texreference,index+offset);
42 Example: 2-­‐dimension cuda array Texture Memory #define size 3200
//declare texture reference
texture<float,2,cudaReadModeElementType> texreference;
int main(int argc,char** argv)
{
dim3 blocknum;
dim3 blocksize;
float* hmatrix;
float* dmatrix;
cudaArray* carray;
cudaChannelFormatDesc channel;
//allocate host and device memory
hmatrix=(float*)malloc(sizeof(float)*size*size);
cudaMalloc((void**)&dmatrix,sizeof(float)*size*size);
//initialize host matrix before usage
for(int loop=0;loop<size*size;loop++)
hmatrix[loop]=float)rand()/(float)(RAND_MAX-1);
44 Texture Memory //create channel to describe data type
channel=cudaCreateChannelDesc<float>();
//allocate device memory for cuda array
cudaMallocArray(&carray,&channel,size,size);
//copy matrix from host to device memory
bytes=sizeof(float)*size*size;
cudaMemcpyToArray(carray,0,0,hmatrix,bytes,cudaMemcpyHostToDevice);
//set texture filter mode property
//use cudaFilterModePoint or cudaFilterModeLinear
texreference.filterMode=cudaFilterModePoint;
//set texture address mode property
//use cudaAddressModeClamp or cudaAddressModeWrap
texreference.addressMode[0]=cudaAddressModeWrap;
texreference.addressMode[1]=cudaaddressModeClamp;
45 Texture Memory //bind texture reference with cuda array
cudaBindTextureToArray(texreference,carray);
blocksize.x=16;
blocksize.y=16;
blocknum.x=(int)ceil((float)size/16);
blocknum.y=(int)ceil((float)size/16);
//execute device kernel
kernel<<<blocknum,blocksize>>>(dmatrix,size);
//unbind texture reference to free resource
cudaUnbindTexture(texreference);
//copy result matrix from device to host memory
cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost);
//free host and device memory
free(hmatrix);
cudaFree(dmatrix);
cudaFreeArray(carray);
}
return 0;
46 Texture Memory __global__ void kernel(float* dmatrix,int size)
{
int xindex;
int yindex;
//calculate each thread global index
xindex=blockIdx.x*blockDim.x+threadIdx.x;
yindex=blockIdx.y*blockDim.y+threadIdx.y;
//fetch cuda array through texture reference
dmatrix[yindex*size+xindex]=tex2D(texreference,xindex,yindex);
return;
}
47 Example: 3-­‐dimension cuda array Texture Memory #define size 256
//declare texture reference
texture<float,3,cudaReadModeElementType> texreference;
int main(int argc,char** argv)
{
dim3 blocknum;
dim3 blocksize;
float* hmatrix;
float* dmatrix;
cudaArray* cudaarray;
cudaExtent volumesize;
cudaChannelFormatDesc channel;
cudaMemcpy3DParms copyparms={0};
//allocate host and device memory
hmatrix=(float*)malloc(sizeof(float)*size*size*size);
cudaMalloc((void**)&dmatrix,sizeof(float)*size*size*size);
49 Texture Memory //initialize host matrix before usage
for(int loop=0;loop<size*size*size;loop++)
hmatrix[loop]=(float)rand()/(float)(RAND_MAX-1);
//set cuda array volume size
volumesize=make_cudaExtent(size,size,size);
//create channel to describe data type
channel=cudaCreateChannelDesc<float>();
//allocate device memory for cuda array
cudaMalloc3DArray(&cudaarray,&channel,volumesize);
//set cuda array copy parameters
copyparms.extent=volumesize;
copyparms.dstArray=cudaarray;
copyparms.kind=cudaMemcpyHostToDevice;
copyparms.srcPtr=
make_cudaPitchPtr((void*)hmatrix,sizeof(float)*size,size,size);
cudaMemcpy3D(&copyparms);
50 Texture Memory //set texture filter mode property
//use cudaFilterModePoint or cudaFilterModeLinear
texreference.filterMode=cudaFilterModePoint;
//set texture address mode property
//use cudaAddressModeClamp or cudaAddressModeWrap
texreference.addressMode[0]=cudaAddressModeWrap;
texreference.addressMode[1]=cudaAddressModeWrap;
texreference.addressMode[2]=cudaaddressModeClamp;
//bind texture reference with cuda array
cudaBindTextureToArray(texreference,carray,channel);
blocksize.x=8;
blocksize.y=8;
blocksize.z=8;
blocknum.x=(int)ceil((float)size/8);
blocknum.y=(int)ceil((float)size/8);
//execute device kernel
kernel<<<blocknum,blocksize>>>(dmatrix,size);
51 Texture Memory //unbind texture reference to free resource
cudaUnbindTexture(texreference);
//copy result matrix from device to host memory
cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost);
//free host and device memory
free(hmatrix);
cudaFree(dmatrix);
cudaFreeArray(carray);
}
return 0;
52 Texture Memory __global__ void kernel(float* dmatrix,int size)
{
int loop;
int xindex;
int yindex;
int zindex;
//calculate each thread global index
xindex=threadIdx.x+blockIdx.x*blockDim.x;
yindex=threadIdx.y+blockIdx.y*blockDim.y;
for(loop=0;loop<size;loop++)
{
zindex=loop;
//fetch cuda array via texture reference
dmatrix[zindex*size*size+yindex*size+xindex]=
tex3D(texreference,xindex,yindex,zindex);
}
}
return;
53 Performance comparison: image projec5on Texture Memory image projec>on or ray cas>ng 55 Texture Memory global memory accessing is very close to random intrinsic interpola>on units is very powerful trilinear interpola>on on nearby 8 pixels 56 Texture Memory object size 512 x 512x 512 / ray number 512 x 512 Method Time
Speedup
global
global/locality
1.891
0.198
-­‐
9.5
texture/point
texture/linear
0.072
0.037
26.2
51.1
texture/linear/locality
texture/linear/locality/fast math
0.012
0.011
157.5
171.9
57 Why texture memory is so powerful? Texture Memory !   CUDA Array is reordered to something like space filling Z-­‐order -­‐ soQware driver supports reordering data -­‐ hardware supports spa>al memory layout 59 Why only readable texture cache? Texture Memory !   Texture cache cannot detect the dirty data lazy update for write-­‐back modified by other threads cache perform some opera>ons on cache load from reload from memory to memory to cache cache host memory 61 float array Texture Memory !   Write data to global memory directly without texture cache -­‐ only suitable for global linear memory not cuda array tex1Dfetch(texreference,index)
read data through texture cache darray[index]=value;
cache texture cache may not be updated write data to global memory directly device memory 62 float array How about the texture data locality? Texture Memory Why CUDA distributes the work blocks in horizontal direc>on? all blocks get scheduled round-­‐robin based on the number of shaders 64 Texture Memory load balancing on overall texture cache data locality, SMs, suppose consecu>ve suppose consecu>ve blocks blocks use similar nearby data have very similar work load 65 Texture Memory reorder the block index fiing into z-­‐order to take advantage of texture L1 cache 66 Texture Memory concurrent execu>on for independent units streaming processors temp1=a/b+sin(c)
special func>on units temp2[loop]=__cos(d)
texture opera>on units temp3=tex2D(ref,x,y)
67 Texture Memory Memory
Loca>on
Cache
Speed
Access
global
off-­‐chip
no
hundreds all threads
constant
off-­‐chip
yes
one ~ hundreds
all threads
texture
off-­‐chip
yes
one ~ hundreds
all threads
shared
on-­‐chip
-­‐
one
block threads local
off-­‐chip
no
very slow single thread
register
on-­‐chip
-­‐
one
single thread
instruc>on
off-­‐chip
yes
-­‐
invisible
68 Texture Memory Memory
Read/Write
Property
global
read/write
input or output
constant
read
no structure texture
read
locality structure
shared
read/write
shared within block
local
read/write
-­‐
register
read/write
local temp variable 69 !   Reference -­‐ Mark Harris http://www.markmark.net/ -­‐ Wei-­‐Chao Chen http://www.cs.unc.edu/~ciao/ -­‐ Wen-­‐Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php 70