Auto-tuning a High-level Language Targeted to
Transcription
Auto-tuning a High-level Language Targeted to
Auto-tuning a High-level Language! Targeted to GPU Codes! ! Scott Grauer-Gray, Robert Searles, Lifan Xu, Sudhee Ayalasomayajula, John Cavazos! Dept of Computer & Information Sciences! University of Delaware! Graphics Processing Units (GPUs) General Purpose GPU (GPGPU) Op6mizing GPU Code • Constantly “tweaking” GPU code – Lots of low level details • Resul6ng code is briEle – Op6miza6ons are applica6on (and inputs) and device specific! High-‐level language for GPUs • High-‐level languages – Good produc6vity, but low performance ? High-‐level language Manual Code Performance New Research Area: Autotuning • Interes6ng new technique : – Search space of op6mized programs Solu6on: Autotuning + HLL + GPUs Goals of Project: High-Level Languages HLL Low-Level Performance HLL Optimized GPU Program Best Optimized GPU Program HMPP WORKBENCH • High-‐Level Language for GPUs • Similar to OpenMP, but for GPUs – Modify code through direc6ves • Generates CUDA/OpenCL kernels HMPP WORKBENCH (cont’d) • Ini6a6ve to make an Open Standard – OpenHMPP (also OpenACC) • Commercial product available here: www.caps-‐entreprise.com/hmpp.html HMPP WORKBENCH (cont’d) • Direc6ves also drive GPU op6miza6ons – Permuta6on – Tiling/unrolling – Fusion/fission • But, there is no tool that helps programmer decide which op6miza6ons to use Hard problem! HMPP WORKBENCH C or Fortran (sequential code) with HMPP directives HMPP WORKBENCH HMPP source-tosource translator HMPP WORKBENCH OpenCL or CUDA compiler HMPP Unroll Pragma • Unroll makes copy of loop body – Pragma specifies “con6guous” unroll w/ factor 2 HMPP Unroll Pragma • Resul6ng Code … B .. Thread Thread 1 2 Thread N HMPP Tiling Pragma • Pragma specifies 6ling w/ factor 2 HMPP Unroll Pragma • Pragma specifies 6ling w/ factor 2 New Inner loop for tiles PolyBench • Collec6on of scien6fic kernels – Available at http://www.cse.ohio-state.edu/~pouchet/software/ polybench/ – Converted 14 of these kernels to CUDA, OpenCL, and HMPP PolyBench • Kernels coverted to CUDA/OpenCL – Linear algebra • 2mm, 3mm, atax, bicg, gemm, gesummv, matmul, mvt, syr2k, syrk – Linear algebra solvers • gramschmidt – Datamining • correla6on, covariance – Stencils • fdtd-‐2d Op6miza6on Search Space Pragma Descrip6on Re-‐orders Permute loops in loop nest Parameter Values Depends on kernel. Different ordering of loops Unroll Unrolls loop at Unroll factors 1 through 8 given factor Tile Tiles loop at given factor Tiling factors 1 through 8 Blocksize Thread block dimensions Kept fixed for these experiments Case Study: Op6mizing GEMM • Sequen6al code: Pre-‐processed GEMM Code • Permuta6on of “i” and “j” loops Pre-‐processed GEMM Code • Unrolling and 6ling of “i”, “j” and “k” loops Best Op6mized GEMM version • Original permuta6on • No unrolling/6ling on “i’ and “j” loops • Unrolling with “con6guous” op6on on innermost “k” loop PolyBench Experiments • Experiments performed on C2050 GPU (Fermi) – 448 CUDA cores • Autotuned HMPP versions of PolyBench – Generated op6mized-‐version of CUDA and OpenCL • Compared against hand-‐coded CUDA and OpenCL Number of Op6mized Versions Program Op6mized Versions Program Op6mized Versions 2mm 97 gemm 168 3mm 118 gesummv 631 atax 67 matmul 337 bicg 161 gramschmidt 727 correla6on 153 mvt 108 covariance 448 syr2k 97 fdtd 141 syrk 281 Speedup over Default HMPP CUDA Autotuning HMPP / Manual CUDA 2.5 2 1.5 1 0.5 0 32.8 19.1 2.51 3.04 Opt HMPP CUDA Manual CUDA Speedup over Default HMPP CUDA Autotuning HMPP / Manual CUDA 2.5 2 1.5 32.8 19.1 2.51 3.04 Opt HMPP CUDA Manual CUDA 1 0.5 0 Autotuning benefits 6 HMPP programs Speedup over Default HMPP CUDA Autotuning HMPP / Manual CUDA 2.5 2 1.5 32.8 19.1 2.51 3.04 Opt HMPP CUDA Manual CUDA 1 0.5 0 Manual better than Optimized HMPP Speedup over Default HMPP CUDA Autotuning HMPP / Manual CUDA 2.5 2 1.5 32.8 19.1 2.51 3.04 Opt HMPP CUDA Manual CUDA 1 0.5 0 Autotuning did not help some cases! Autotuning HMPP/Manual OpenCL Speedup over Default HMPP OpenCL 2.5 2 1.5 46.4 37.9 5.51 5.51 3.94 2.82 2.78 Opt HMPP OpenCL Manual OpenCL 1 0.5 0 Autotuning benefits 4 HMPP programs targeted to OpenCL Autotuning HMPP/Manual OpenCL Speedup over Default HMPP OpenCL 2.5 2 1.5 46.4 37.9 5.51 5.51 3.94 2.82 2.78 Opt HMPP OpenCL Manual OpenCL 1 0.5 0 7 manual codes performed better than best-optimized HMPP Autotuning HMPP/Manual OpenCL Speedup over Default HMPP OpenCL 2.5 2 1.5 46.4 37.9 5.51 5.51 3.94 2.82 2.78 Opt HMPP OpenCL Manual OpenCL 1 0.5 0 6 manual OpenCL programs performed poorly. Summary Results CUDA Geo-‐ mean OpenCL Geo-‐ mean Best HMPP 1.46 Best HMPP 1.42 Manual .70 Manual 1.43 On average, best autotuned versions meet or exceed manual performance! Best Op6miza6ons Found Code Best Op6miza6ons Found HMPP CUDA HMPP OpenCL MVT Tile all four loops using factor 2 Tile first and third loops using factor 2 3MM Unroll 3rd, 6th, and 9th loops using “split” op6on with factor 3 Unroll 3rd, 6th, and 9th loops using “con6guous” op6on with factor 6 GEMM Unroll innermost loop using “con6guous” op6on with factor 64 Unroll innermost loop using “con6guous” op6on with factor 64 SYRK Unroll 3rd loop using “split” op6on with factor 2 Unroll 3rd loop using “split” op6on with factor 2 Predic6ve Modeling • Autotuning can be expensive • Model can predict best optimization to apply Program Characteriza6on … Op6miza6on sequence … Output: Predicted performance Op6mizing Belief Propaga6on Manual CUDA Op6mized HMPP .. 0 5 10 15 20 Default HMPP CUDA Imp. OpenCL Imp. 0 5 10 15 Speedup over CPU 20 Conclusions • Achieve low-‐level performance using high-‐level language (HMPP) – Autotuning HMPP comparable to hand-‐ op6mized GPU Code – Best op6miza6ons are different for CUDA and OpenCL Extra Slides Stereo Vision • Two cameras take pictures – Separated by a distance • Algorithm compares two images by shiming – Shifted amount is disparity HMPP Belief Propaga6on • Itera6ve algorithm • Applied to stereo vision – Input: stereo set of two images – Output: disparity map between images Image in Tsukuba stereo set Ground-truth disparity map BP Message-‐Passing Func6on • Disparity values computed for each pixel and passed to neighbors – Run6me dominated by this “message-‐ passing” step ml md ml mu ml mu mr mr md ml mr md mu ml mr ml mr ml mr md mu ml mu ml mr mr md ml md mu HMPP Belief Propaga6on • Input stereo set: – Tsukuba: • Dimensions: 384 X 288 • 15 possible dispari6es • Speedup over CPU implementa6on: – CUDA and OpenCl kernels generated by HMPP – 250 message-‐passing itera6ons