Auto-tuning a High-level Language Targeted to

Transcription

Auto-tuning a High-level Language Targeted to
Auto-tuning a High-level Language!
Targeted to GPU Codes!
!
Scott Grauer-Gray, Robert Searles, Lifan Xu, Sudhee
Ayalasomayajula, John Cavazos!
Dept of Computer & Information Sciences!
University of Delaware!
Graphics Processing Units (GPUs) General Purpose GPU (GPGPU) Op6mizing GPU Code •  Constantly “tweaking” GPU code –  Lots of low level details •  Resul6ng code is briEle –  Op6miza6ons are applica6on (and inputs) and device specific! High-­‐level language for GPUs •  High-­‐level languages – Good produc6vity, but low performance ?
High-­‐level language Manual Code Performance New Research Area: Autotuning •  Interes6ng new technique : – Search space of op6mized programs Solu6on: Autotuning + HLL + GPUs Goals of Project:
High-Level Languages
HLL
Low-Level Performance
HLL
Optimized
GPU
Program
Best
Optimized
GPU
Program
HMPP WORKBENCH •  High-­‐Level Language for GPUs •  Similar to OpenMP, but for GPUs – Modify code through direc6ves •  Generates CUDA/OpenCL kernels HMPP WORKBENCH (cont’d) •  Ini6a6ve to make an Open Standard – OpenHMPP (also OpenACC) •  Commercial product available here: www.caps-­‐entreprise.com/hmpp.html HMPP WORKBENCH (cont’d) •  Direc6ves also drive GPU op6miza6ons – Permuta6on – Tiling/unrolling – Fusion/fission •  But, there is no tool that helps programmer decide which op6miza6ons to use Hard problem! HMPP WORKBENCH C or Fortran
(sequential code) with
HMPP directives
HMPP WORKBENCH HMPP source-tosource translator
HMPP WORKBENCH OpenCL or CUDA
compiler
HMPP Unroll Pragma •  Unroll makes copy of loop body – Pragma specifies “con6guous” unroll w/ factor 2 HMPP Unroll Pragma •  Resul6ng Code …
B
.. Thread Thread
1
2
Thread
N
HMPP Tiling Pragma •  Pragma specifies 6ling w/ factor 2 HMPP Unroll Pragma •  Pragma specifies 6ling w/ factor 2 New Inner loop for tiles
PolyBench •  Collec6on of scien6fic kernels –  Available at
http://www.cse.ohio-state.edu/~pouchet/software/
polybench/
–  Converted 14 of these kernels to CUDA, OpenCL,
and HMPP
PolyBench •  Kernels coverted to CUDA/OpenCL – Linear algebra •  2mm, 3mm, atax, bicg, gemm, gesummv, matmul, mvt, syr2k, syrk – Linear algebra solvers •  gramschmidt – Datamining •  correla6on, covariance – Stencils •  fdtd-­‐2d Op6miza6on Search Space Pragma Descrip6on Re-­‐orders Permute loops in loop nest Parameter Values Depends on kernel. Different ordering of loops Unroll Unrolls loop at Unroll factors 1 through 8 given factor Tile Tiles loop at given factor Tiling factors 1 through 8 Blocksize Thread block dimensions Kept fixed for these experiments Case Study: Op6mizing GEMM •  Sequen6al code: Pre-­‐processed GEMM Code •  Permuta6on of “i” and “j” loops Pre-­‐processed GEMM Code •  Unrolling and 6ling of “i”, “j” and “k” loops Best Op6mized GEMM version •  Original permuta6on •  No unrolling/6ling on “i’ and “j” loops •  Unrolling with “con6guous” op6on on innermost “k” loop PolyBench Experiments •  Experiments performed on C2050 GPU
(Fermi)
– 448 CUDA cores
•  Autotuned HMPP versions of PolyBench – Generated op6mized-­‐version of CUDA and OpenCL •  Compared against hand-­‐coded CUDA and OpenCL Number of Op6mized Versions Program Op6mized Versions Program Op6mized Versions 2mm 97 gemm 168 3mm 118 gesummv 631 atax 67 matmul 337 bicg 161 gramschmidt 727 correla6on 153 mvt 108 covariance 448 syr2k 97 fdtd 141 syrk 281 Speedup over Default HMPP CUDA Autotuning HMPP / Manual CUDA 2.5 2 1.5 1 0.5 0 32.8
19.1
2.51
3.04
Opt HMPP CUDA Manual CUDA Speedup over Default HMPP CUDA Autotuning HMPP / Manual CUDA 2.5 2 1.5 32.8
19.1
2.51
3.04
Opt HMPP CUDA Manual CUDA 1 0.5 0 Autotuning benefits 6 HMPP programs
Speedup over Default HMPP CUDA Autotuning HMPP / Manual CUDA 2.5 2 1.5 32.8
19.1
2.51
3.04
Opt HMPP CUDA Manual CUDA 1 0.5 0 Manual better than Optimized HMPP
Speedup over Default HMPP CUDA Autotuning HMPP / Manual CUDA 2.5 2 1.5 32.8
19.1
2.51
3.04
Opt HMPP CUDA Manual CUDA 1 0.5 0 Autotuning did not help some cases!
Autotuning HMPP/Manual OpenCL Speedup over Default HMPP OpenCL 2.5 2 1.5 46.4
37.9
5.51
5.51
3.94
2.82 2.78
Opt HMPP OpenCL Manual OpenCL 1 0.5 0 Autotuning benefits 4 HMPP programs
targeted to OpenCL
Autotuning HMPP/Manual OpenCL Speedup over Default HMPP OpenCL 2.5 2 1.5 46.4
37.9
5.51
5.51
3.94
2.82 2.78
Opt HMPP OpenCL Manual OpenCL 1 0.5 0 7 manual codes performed better than
best-optimized HMPP
Autotuning HMPP/Manual OpenCL Speedup over Default HMPP OpenCL 2.5 2 1.5 46.4
37.9
5.51
5.51
3.94
2.82 2.78
Opt HMPP OpenCL Manual OpenCL 1 0.5 0 6 manual OpenCL programs
performed poorly.
Summary Results CUDA Geo-­‐
mean OpenCL Geo-­‐ mean Best HMPP 1.46 Best HMPP 1.42 Manual .70 Manual 1.43 On average, best autotuned versions
meet or exceed manual performance!
Best Op6miza6ons Found Code Best Op6miza6ons Found HMPP CUDA HMPP OpenCL MVT Tile all four loops using factor 2 Tile first and third loops using factor 2 3MM Unroll 3rd, 6th, and 9th loops using “split” op6on with factor 3 Unroll 3rd, 6th, and 9th loops using “con6guous” op6on with factor 6 GEMM Unroll innermost loop using “con6guous” op6on with factor 64 Unroll innermost loop using “con6guous” op6on with factor 64 SYRK Unroll 3rd loop using “split” op6on with factor 2 Unroll 3rd loop using “split” op6on with factor 2 Predic6ve Modeling •  Autotuning can be expensive
•  Model can predict best optimization to
apply
Program Characteriza6on …
Op6miza6on sequence …
Output: Predicted performance Op6mizing Belief Propaga6on Manual CUDA Op6mized HMPP .. 0 5 10 15 20 Default HMPP CUDA Imp. OpenCL Imp. 0 5 10 15 Speedup over CPU 20 Conclusions •  Achieve low-­‐level performance using high-­‐level language (HMPP) – Autotuning HMPP comparable to hand-­‐
op6mized GPU Code – Best op6miza6ons are different for CUDA and OpenCL Extra Slides Stereo Vision •  Two cameras take pictures – Separated by a distance •  Algorithm compares two images by shiming –  Shifted amount is disparity
HMPP Belief Propaga6on •  Itera6ve algorithm •  Applied to stereo vision – Input: stereo set of two images
– Output: disparity map between images
Image in Tsukuba stereo set
Ground-truth disparity map
BP Message-­‐Passing Func6on •  Disparity values computed for each pixel and passed to neighbors – Run6me dominated by this “message-­‐
passing” step ml
md
ml
mu
ml
mu
mr
mr
md
ml
mr
md
mu
ml
mr
ml
mr
ml
mr
md
mu
ml
mu
ml
mr
mr
md
ml
md
mu
HMPP Belief Propaga6on •  Input stereo set: – Tsukuba: •  Dimensions: 384 X 288 •  15 possible dispari6es •  Speedup over CPU implementa6on: –  CUDA and OpenCl kernels generated by HMPP –  250 message-­‐passing itera6ons