CUDA Advanced Memory Usage and Op miza on
Transcription
CUDA Advanced Memory Usage and Op miza on
CUDA Advanced Memory Usage and Op5miza5on Yukai Hung [email protected] Department of Mathema>cs Na>onal Taiwan University Register as Cache? Vola5le Qualifier ! Vola>le qualifier __global__ void kernelFunc(int* result) { int temp1; int temp2; if(threadIdx.x<warpSize) { temp1=array[threadIdx.x] array[threadIdx.x+1]=2; } iden>cal reads compiler op>mized this read away temp2=array[threadIdx.x] result[threadIdx.x]=temp1*temp2; } 3 Vola5le Qualifier ! Vola>le qualifier __global__ void kernelFunc(int* result) { int temp1; int temp2; if(threadIdx.x<warpSize) { int temp=array[threadIdx.x]; } } temp1=temp; array[threadIdx.x+1]=2; temp2=temp; result[threadIdx.x]=temp1*temp2; 4 Vola5le Qualifier ! Vola>le qualifier __global__ void kernelFunc(int* result) { int temp1; int temp2; if(threadIdx.x<warpSize) { temp1=array[threadIdx.x]*1; array[threadIdx.x+1]=2; __syncthreads(); temp2=array[threadIdx.x]*2; result[threadIdx.x]=temp1*temp2; } } 5 Vola5le Qualifier ! Vola>le qualifier __global__ void kernelFunc(int* result) { volatile int temp1; volatile int temp2; if(threadIdx.x<warpSize) { temp1=array[threadIdx.x]*1; array[threadIdx.x+1]=2; } temp2=array[threadIdx.x]*2; result[threadIdx.x]=temp1*temp2; } 6 Data Prefetch Data Prefetch ! Hide memory latency by overlapping loading and compu>ng -‐ double buffer is tradi>onal soQware pipeline technique load blue block to shared memory Nd compute blue block on shared memory and load next block to shared memory Pd Md Pdsub 8 Data Prefetch ! Hide memory latency by overlapping loading and compu>ng -‐ double buffer is tradi>onal soQware pipeline technique for loop { load data from global to shared memory synchronize block compute data in the shared memory synchronize block } 9 Data Prefetch ! Hide memory latency by overlapping loading and compu>ng -‐ double buffer is tradi>onal soQware pipeline technique load data from global memory to registers for loop { store data from register to shared memory synchronize block load data from global memory to registers compute data in the shared memory synchronize block } very small overhead compu>ng and loading overlap both memory are very fast register and shared are independent 10 Data Prefetch ! Matrix-‐matrix mul>plica>on 11 Constant Memory Constant Memory ! Where is constant memory? -‐ data is stored in the device global memory -‐ read data through mul>processor constant cache -‐ 64KB constant memory and 8KB cache for each mul>processor ! How about the performance? -‐ op>mized when warp of threads read same loca>on -‐ 4 bytes per cycle through broadcas>ng to warp of threads -‐ serialized when warp of threads read in different loca>on -‐ very slow when cache miss (read data from global memory) -‐ access latency can range from one to hundreds clock cycles 13 Constant Memory ! How to use constant memory? -‐ declare constant memory on the file scope (global variable) -‐ copy data to constant memory by host (because it is constant!!) //declare constant memory __constant__ float cst_ptr[size]; //copy data from host to constant memory cudaMemcpyToSymbol(cst_ptr,host_ptr,data_size); 14 Constant Memory //declare constant memory __constant__ float cangle[360]; int main(int argc,char** argv) { int size=3200; float* darray; float hangle[360]; //allocate device memory cudaMalloc((void**)&darray,sizeof(float)*size); //initialize allocated memory cudaMemset(darray,0,sizeof(float)*size); //initialize angle array on host for(int loop=0;loop<360;loop++) hangle[loop]=acos(-1.0f)*loop/180.0f; //copy host angle data to constant memory cudaMemcpyToSymbol(cangle,hangle,sizeof(float)*360); 15 Constant Memory //execute device kernel test_kernel<<<size/64,64>>>(darray); //free device memory cudaFree(darray); } return 0; __global__ void test_kernel(float* darray) { int index; //calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x; #pragma unroll 10 for(int loop=0;loop<360;loop++) darray[index]=darray[index]+cangle[loop]; } return; 16 Texture Memory Texture Memory ! Texture mapping 18 Texture Memory ! Texture mapping 19 Texture Memory ! Texture filtering nearest-‐neighborhood interpola>on 20 Texture Memory ! Texture filtering linear/bilinear/trilinear interpola>on 21 Texture Memory ! Texture filtering two >mes bilinear interpola>on 22 Texture Memory Host Input Assembler Setup / Rstr / ZCull SP SP SP TF SP SP TF L1 TF L1 SP SP SP Pixel Thread Issue SP SP TF L1 L1 L2 FB SP TF TF L1 L2 FB SP Work Distribution SP TF L1 L2 FB SP L1 L2 FB these units perform graphical texture opera>ons 23 SP TF L1 L2 FB SP Thread Processor Vtx Thread Issue L2 FB Texture Memory two SMs are cooperated as texture processing cluster scalable units on graphics texture specific unit only available for texture 24 Texture Memory texture specific unit texture address units compute texture addresses texture filtering units compute data interpola>on read only texture L1 cache 25 Texture Memory Host Input Assembler Setup / Rstr / ZCull SP SP SP TF SP SP TF L1 TF L1 SP SP SP Pixel Thread Issue SP SP TF L1 L1 L2 FB SP TF TF L1 L2 FB SP Work Distribution SP TF L1 L2 FB SP SP TF L1 L2 FB SP L1 L2 FB Thread Processor Vtx Thread Issue L2 FB read only texture L2 cache for all TPC read only texture L1 cache for each TPC 26 Texture Memory texture specific units 27 Texture Memory ! Texture is an object for reading data -‐ data is stored on the device global memory -‐ global memory is bound with texture cache SP SP TF SP SP TF L1 TF L1 L2 FB SP SP SP SP TF TF L1 L2 FB SP SP TF L1 L1 SP L2 SP TF L1 28 L2 FB SP SP TF L1 L2 FB global memory FB SP Thread Processor SP L1 L2 What is the advantages of texture? Texture Memory ! Data caching -‐ helpful when global memory coalescing is the main bocleneck SP SP TF SP SP TF L1 TF L1 SP SP SP SP SP TF L1 L1 L2 FB SP TF TF L1 L2 FB SP SP TF L1 L2 FB 30 SP SP TF L1 L2 FB SP L1 L2 FB Thread Processor SP L2 FB Texture Memory ! Data filtering -‐ support linear/bilinear and trilinear hardware interpola>on texture specific unit intrinsic interpola>on cudaFilterModePoint cudaFilterModeLinear 31 Texture Memory ! Accesses modes -‐ clamp and wrap memory accessing for out-‐of-‐bound addresses wrap boundary texture specific unit cudaAddressModeWrap clamp boundary cudaAddressModeClamp 32 Texture Memory ! Bound to linear memory -‐ only support 1-‐dimension problems -‐ only get the benefits from texture cache -‐ not support addressing modes and filtering ! Bound to cuda array -‐ support float addressing -‐ support addressing modes -‐ support hardware interpola>on -‐ support 1/2/3-‐dimension problems 33 Texture Memory ! Host code -‐ allocate global linear memory or cuda array -‐ create and set the texture reference on file scope -‐ bind the texture reference to the allocated memory -‐ unbind the texture reference to free cache resource ! Device code -‐ fetch data by indica>ng texture reference -‐ fetch data by using texture fetch func>on 34 Texture Memory ! Texture memory constrain 1D texture linear memory Compute capability 1.3 Compute capability 2.0 8192 31768 1D texture cuda array 1024x128 2D texture cuda array (65536,32768) (65536,65536) 3D texture cuda array (2048,2048,2048) (4096,4096,4096) 35 Texture Memory ! Measuring texture cache miss or hit number -‐ latest visual profiler can count cache miss or hit -‐ need device compute capability higher than 1.2 36 Example: 1-‐dimension linear memory Texture Memory //declare texture reference texture<float,1,cudaReadModeElementType> texreference; int main(int argc,char** argv) { int size=3200; float* harray; float* diarray; float* doarray; //allocate host and device memory harray=(float*)malloc(sizeof(float)*size); cudaMalloc((void**)&diarray,sizeof(float)*size); cudaMalloc((void**)&doarray,sizeof(float)*size); //initialize host array before usage for(int loop=0;loop<size;loop++) harray[loop]=(float)rand()/(float)(RAND_MAX-1); //copy array from host to device memory cudaMemcpy(diarray,harray,sizeof(float)*size,cudaMemcpyHostToDevice); 38 Texture Memory //bind texture reference with linear memory cudaBindTexture(0,texreference,diarray,sizeof(float)*size); //execute device kernel kernel<<<(int)ceil((float)size/64),64>>>(doarray,size); //unbind texture reference to free resource cudaUnbindTexture(texreference); //copy result array from device to host memory cudaMemcpy(harray,doarray,sizeof(float)*size,cudaMemcpyDeviceToHost); //free host and device memory free(harray); cudaFree(diarray); cudaFree(doarray); return 0; } 39 Texture Memory __global__ void kernel(float* doarray,int size) { int index; //calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x; //fetch global memory through texture reference doarray[index]=tex1Dfetch(texreference,index); } return; 40 Texture Memory __global__ void offsetCopy(float* idata,float* odata,int offset) { //compute each thread global index int index=blockIdx.x*blockDim.x+threadIdx.x; } //copy data from global memory odata[index]=idata[index+offset]; 41 Texture Memory __global__ void offsetCopy(float* idata,float* odata,int offset) { //compute each thread global index int index=blockIdx.x*blockDim.x+threadIdx.x; } //copy data from global memory odata[index]=tex1Dfetch(texreference,index+offset); 42 Example: 2-‐dimension cuda array Texture Memory #define size 3200 //declare texture reference texture<float,2,cudaReadModeElementType> texreference; int main(int argc,char** argv) { dim3 blocknum; dim3 blocksize; float* hmatrix; float* dmatrix; cudaArray* carray; cudaChannelFormatDesc channel; //allocate host and device memory hmatrix=(float*)malloc(sizeof(float)*size*size); cudaMalloc((void**)&dmatrix,sizeof(float)*size*size); //initialize host matrix before usage for(int loop=0;loop<size*size;loop++) hmatrix[loop]=float)rand()/(float)(RAND_MAX-1); 44 Texture Memory //create channel to describe data type channel=cudaCreateChannelDesc<float>(); //allocate device memory for cuda array cudaMallocArray(&carray,&channel,size,size); //copy matrix from host to device memory bytes=sizeof(float)*size*size; cudaMemcpyToArray(carray,0,0,hmatrix,bytes,cudaMemcpyHostToDevice); //set texture filter mode property //use cudaFilterModePoint or cudaFilterModeLinear texreference.filterMode=cudaFilterModePoint; //set texture address mode property //use cudaAddressModeClamp or cudaAddressModeWrap texreference.addressMode[0]=cudaAddressModeWrap; texreference.addressMode[1]=cudaaddressModeClamp; 45 Texture Memory //bind texture reference with cuda array cudaBindTextureToArray(texreference,carray); blocksize.x=16; blocksize.y=16; blocknum.x=(int)ceil((float)size/16); blocknum.y=(int)ceil((float)size/16); //execute device kernel kernel<<<blocknum,blocksize>>>(dmatrix,size); //unbind texture reference to free resource cudaUnbindTexture(texreference); //copy result matrix from device to host memory cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost); //free host and device memory free(hmatrix); cudaFree(dmatrix); cudaFreeArray(carray); } return 0; 46 Texture Memory __global__ void kernel(float* dmatrix,int size) { int xindex; int yindex; //calculate each thread global index xindex=blockIdx.x*blockDim.x+threadIdx.x; yindex=blockIdx.y*blockDim.y+threadIdx.y; //fetch cuda array through texture reference dmatrix[yindex*size+xindex]=tex2D(texreference,xindex,yindex); return; } 47 Example: 3-‐dimension cuda array Texture Memory #define size 256 //declare texture reference texture<float,3,cudaReadModeElementType> texreference; int main(int argc,char** argv) { dim3 blocknum; dim3 blocksize; float* hmatrix; float* dmatrix; cudaArray* cudaarray; cudaExtent volumesize; cudaChannelFormatDesc channel; cudaMemcpy3DParms copyparms={0}; //allocate host and device memory hmatrix=(float*)malloc(sizeof(float)*size*size*size); cudaMalloc((void**)&dmatrix,sizeof(float)*size*size*size); 49 Texture Memory //initialize host matrix before usage for(int loop=0;loop<size*size*size;loop++) hmatrix[loop]=(float)rand()/(float)(RAND_MAX-1); //set cuda array volume size volumesize=make_cudaExtent(size,size,size); //create channel to describe data type channel=cudaCreateChannelDesc<float>(); //allocate device memory for cuda array cudaMalloc3DArray(&cudaarray,&channel,volumesize); //set cuda array copy parameters copyparms.extent=volumesize; copyparms.dstArray=cudaarray; copyparms.kind=cudaMemcpyHostToDevice; copyparms.srcPtr= make_cudaPitchPtr((void*)hmatrix,sizeof(float)*size,size,size); cudaMemcpy3D(©parms); 50 Texture Memory //set texture filter mode property //use cudaFilterModePoint or cudaFilterModeLinear texreference.filterMode=cudaFilterModePoint; //set texture address mode property //use cudaAddressModeClamp or cudaAddressModeWrap texreference.addressMode[0]=cudaAddressModeWrap; texreference.addressMode[1]=cudaAddressModeWrap; texreference.addressMode[2]=cudaaddressModeClamp; //bind texture reference with cuda array cudaBindTextureToArray(texreference,carray,channel); blocksize.x=8; blocksize.y=8; blocksize.z=8; blocknum.x=(int)ceil((float)size/8); blocknum.y=(int)ceil((float)size/8); //execute device kernel kernel<<<blocknum,blocksize>>>(dmatrix,size); 51 Texture Memory //unbind texture reference to free resource cudaUnbindTexture(texreference); //copy result matrix from device to host memory cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost); //free host and device memory free(hmatrix); cudaFree(dmatrix); cudaFreeArray(carray); } return 0; 52 Texture Memory __global__ void kernel(float* dmatrix,int size) { int loop; int xindex; int yindex; int zindex; //calculate each thread global index xindex=threadIdx.x+blockIdx.x*blockDim.x; yindex=threadIdx.y+blockIdx.y*blockDim.y; for(loop=0;loop<size;loop++) { zindex=loop; //fetch cuda array via texture reference dmatrix[zindex*size*size+yindex*size+xindex]= tex3D(texreference,xindex,yindex,zindex); } } return; 53 Performance comparison: image projec5on Texture Memory image projec>on or ray cas>ng 55 Texture Memory global memory accessing is very close to random intrinsic interpola>on units is very powerful trilinear interpola>on on nearby 8 pixels 56 Texture Memory object size 512 x 512x 512 / ray number 512 x 512 Method Time Speedup global global/locality 1.891 0.198 -‐ 9.5 texture/point texture/linear 0.072 0.037 26.2 51.1 texture/linear/locality texture/linear/locality/fast math 0.012 0.011 157.5 171.9 57 Why texture memory is so powerful? Texture Memory ! CUDA Array is reordered to something like space filling Z-‐order -‐ soQware driver supports reordering data -‐ hardware supports spa>al memory layout 59 Why only readable texture cache? Texture Memory ! Texture cache cannot detect the dirty data lazy update for write-‐back modified by other threads cache perform some opera>ons on cache load from reload from memory to memory to cache cache host memory 61 float array Texture Memory ! Write data to global memory directly without texture cache -‐ only suitable for global linear memory not cuda array tex1Dfetch(texreference,index) read data through texture cache darray[index]=value; cache texture cache may not be updated write data to global memory directly device memory 62 float array How about the texture data locality? Texture Memory Why CUDA distributes the work blocks in horizontal direc>on? all blocks get scheduled round-‐robin based on the number of shaders 64 Texture Memory load balancing on overall texture cache data locality, SMs, suppose consecu>ve suppose consecu>ve blocks blocks use similar nearby data have very similar work load 65 Texture Memory reorder the block index fiing into z-‐order to take advantage of texture L1 cache 66 Texture Memory concurrent execu>on for independent units streaming processors temp1=a/b+sin(c) special func>on units temp2[loop]=__cos(d) texture opera>on units temp3=tex2D(ref,x,y) 67 Texture Memory Memory Loca>on Cache Speed Access global off-‐chip no hundreds all threads constant off-‐chip yes one ~ hundreds all threads texture off-‐chip yes one ~ hundreds all threads shared on-‐chip -‐ one block threads local off-‐chip no very slow single thread register on-‐chip -‐ one single thread instruc>on off-‐chip yes -‐ invisible 68 Texture Memory Memory Read/Write Property global read/write input or output constant read no structure texture read locality structure shared read/write shared within block local read/write -‐ register read/write local temp variable 69 ! Reference -‐ Mark Harris http://www.markmark.net/ -‐ Wei-‐Chao Chen http://www.cs.unc.edu/~ciao/ -‐ Wen-‐Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php 70