OpenACC Tools - Debugging and Profiling
Transcription
OpenACC Tools - Debugging and Profiling
Tools for Debugging & Profiling Member of the Helmholtz Association OpenACC Course 2016 Andreas Herten, Forschungszentrum Jülich, 24 October 2016 Contents What you will learn. Hopefully. OpenACC can greatly speedup porting to GPU But many details hidden from user → Compiler makes assumptions Programmer makes mistakes Member of the Helmholtz Association ⇒ Insight into program needed Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 2 20 Contents What you will learn. Hopefully. OpenACC can greatly speedup porting to GPU But many details hidden from user → Compiler makes assumptions Programmer makes mistakes Member of the Helmholtz Association ⇒ Insight into program needed Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 Introduction PGI Tools Runtime Measurements pgprof NVIDIA Tools cuda-memcheck cuda-gdb nvprof Visual Profiler Tasks Task 1 Task 2 # 2 20 Exposition $ ./spmv call to cuStreamSynchronize returned error 700: Illegal address during kernel ,→ execution Where does error come from? Is it an error at all? Member of the Helmholtz Association … and how do I find out? Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 3 20 General notes Member of the Helmholtz Association Building for debugging -g Add debug information to executable; adds overhead → program performs slower Usually, in host code, -g has little impact. -ta=tesla:lineinfo Add information to assembly to relate instructions to source code (light debug info) Check compiler output: -Minfo=accel Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 4 20 PGI Runtime Measurements For quick sanity checks Applications compiled with PGI compiler: Analyze via environment variables Maybe simplest/quickest check Member of the Helmholtz Association PGI_ACC_TIME Lightweight profiler for time of data movement and kernels PGI_ACC_NOTIFY Print information for GPU-related events. Set to number, to print … =1 … kernel launches only =2 … data transfers only =3 … kernel launches and data transfers =4 … region entry/exits only =5 … region entry/exits and kernel launches =8 … wait operations, synchronizations =16 … (de)allocation of device memory Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 5 20 PGI Runtime Measurements Member of the Helmholtz Association Usage: PGI_ACC_NOTIFY=3 ./app Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 6 20 PGPROF Graphical Performance Profiler PGI’s graphical profiler Graphical, interactive profiler Comes with PGI’s compiler collection Nice visualizations, quick insight For OpenACC, OpenMP, CUDA Close to NVIDIA Visual Profiler Member of the Helmholtz Association → https://www.pgroup.com/products/pgprof.htm Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 7 20 PGPROF Graphical Performance Profiler Member of the Helmholtz Association PGI’s graphical profiler Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 7 20 PGPROF Graphical Performance Profiler Member of the Helmholtz Association NVIDIA Visual Profiler Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 7 20 cuda-memcheck Command-line memory access analyzer Memory error detector; similar to Valgrind’s memcheck One of most helpful tools for error-finding — Out-of-bounds accesses — Kernels/API execution failures — Memory leaks Has sub-tools, via cuda-memcheck --tool NAME: Member of the Helmholtz Association — memcheck: Memory access checking (default) — racecheck: Shared memory hazard checking — Also: synccheck, initcheck Remember to compile program with debug information: -g → http://docs.nvidia.com/cuda/cuda-memcheck/ Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 8 20 cuda-memcheck Example Member of the Helmholtz Association Start via cuda-memcheck app Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 9 20 cuda-gdb Symbolic debugger Member of the Helmholtz Association Powerful symbolic debugger for CUDA code Built on top of gdb Full usage: own course needed Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 10 20 cuda-gdb Symbolic debugger Powerful symbolic debugger for CUDA code Built on top of gdb Full usage: own course needed cuda-gdb 101 run break L Create breakpoint L: function name, line number LN, or FILE:LN continue Continue running print i Print content of i info locals Member of the Helmholtz Association Starts application, give arguments with set args 1 2 … Print all currently set variables info cuda threads cuda thread N Print current thread configuration Switch context to thread number N Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 → cheat sheet # 10 20 cuda-gdb Symbolic debugger Powerful symbolic debugger for CUDA code Built on top of gdb Full usage: own course needed cuda-gdb 101 run break L Create breakpoint L: function name, line number LN, or FILE:LN continue Continue running print i Print content of i info locals Member of the Helmholtz Association Starts application, give arguments with set args 1 2 … Print all currently set variables info cuda threads cuda thread N Print current thread configuration Switch context to thread number N → cheat sheet → http://docs.nvidia.com/cuda/cuda-gdb/ Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 10 20 cuda-gdb With OpenACC cuda-gdb can be used for OpenACC as well! Problem: Name of OpenACC-generated kernel? → Recipe: strings ./app | grep .*_gpu | sort | uniq strings ./app grep .*_gpu sort | uniq Print occurrences of ≥4 printable characters Search for _gpu line endings Eliminate duplicates from list Examples of kernel names Member of the Helmholtz Association Pattern: function_line_gpu C main_42_gpu Fortran spmv_26_gpu Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 11 20 cuda-gdb Example Start via cuda-gdb app → run Member of the Helmholtz Association Set breakpoint with break func or break L or break file.c:L Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 12 20 nvprof / pgprof Command-line GPU profiler Profiles CUDA kernels and API calls; also CPU code! Suitable for OpenACC as well pgprof: Very similar to nvprof, but different default options Generate performance reports, timelines; measure events and metrics ⇒ Powerful complete tool for GPU application analysis Member of the Helmholtz Association → http://docs.nvidia.com/cuda/profiler-users-guide/ Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 13 20 nvprof Example Member of the Helmholtz Association Start via nvprof ./app Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 14 20 Visual Profiler Graphical analysis Timeline view of all things GPU (API calls, kernels, memory) → study stages and interplay of application View launch and run configurations Guided and unguided analysis, with (among others): Member of the Helmholtz Association — Performance limiters — Kernel and execution properties — Memory access patterns → https://developer.nvidia.com/nvidia-visual-profiler Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 15 20 Visual Profiler Example Member of the Helmholtz Association Start via nvvp → File ,→ New Session Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 16 20 Visual Profiler Example Start via nvvp → File ,→ New Session Timeline Selection Details Member of the Helmholtz Association Expert Analysis Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 16 20 Task 1 Location of tasks: ACCOUNT/Course/Debugging/ Member of the Helmholtz Association Tasks available in C and Fortran Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 17 20 Task 1 Location of tasks: ACCOUNT/Course/Debugging/ Tasks available in C and Fortran Task 1: Vector addition and reduction: ⃗a = ⃗b + ⃗c → γ = ∑ ci i Steps Run! make srun ./vecAddRed.bin Fix! cuda-memcheck; cuda-gdb; Result should be 1. Member of the Helmholtz Association Build! Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 17 20 Task 1 Location of tasks: ACCOUNT/Course/Debugging/ Tasks available in C and Fortran Task 1: Vector addition and reduction: ⃗a = ⃗b + ⃗c → γ = ∑ ci i Steps Build! Run! make srun ./vecAddRed.bin Fix! cuda-memcheck; cuda-gdb; Result should be 1. Member of the Helmholtz Association JURECA Getting Started module load PGI CUDA salloc --reservation=openacc --partition=gpus --nodes=1 --time=1:30:00 ,→ --gres=mem128,gpu:4 srun cuda-memcheck ./vecAddRed.bin Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 17 20 Task 2 Member of the Helmholtz Association Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 18 20 Task 2 Member of the Helmholtz Association Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y CSR data layout Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 18 20 Task 2 Member of the Helmholtz Association Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y CSR data layout 0 1 2 3 4 0 -2 1 0 0 0 1 1 -2 1 0 0 2 0 1 -2 1 0 3 0 0 1 -2 1 4 0 0 0 1 -2 Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 18 20 Task 2 Member of the Helmholtz Association Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y CSR data layout 0 1 2 3 4 0 1 2 3 4 0 -2 1 0 0 0 0 -2 1 1 1 -2 1 0 0 1 1 -2 1 2 0 1 -2 1 0 1 -2 1 2 3 0 0 1 -2 1 3 1 -2 1 4 0 0 0 1 -2 4 1 -2 Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 18 20 Task 2 Member of the Helmholtz Association Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y CSR data layout 0 1 2 3 4 0 1 2 3 4 0 -2 1 0 0 0 0 -2 1 1 1 -2 1 0 0 1 1 -2 1 2 0 1 -2 1 0 1 -2 1 2 3 0 0 1 -2 1 3 1 -2 1 4 0 0 0 1 -2 4 1 -2 val -2 1 1 -2 1 1 -2 1 1 -2 1 1 -2 Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 18 20 Task 2 Member of the Helmholtz Association Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y CSR data layout 0 1 2 3 4 0 1 2 3 4 0 -2 1 0 0 0 0 -2 1 1 1 -2 1 0 0 1 1 -2 1 2 0 1 -2 1 0 1 -2 1 2 3 0 0 1 -2 1 3 1 -2 1 4 0 0 0 1 -2 4 1 -2 val -2 1 1 -2 1 1 -2 1 1 -2 1 1 -2 Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 18 20 Task 2 Member of the Helmholtz Association Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y CSR data layout 0 1 2 3 4 0 1 2 3 4 0 -2 1 0 0 0 0 -2 1 1 1 -2 1 0 0 1 1 -2 1 2 0 1 -2 1 0 1 -2 1 2 3 0 0 1 -2 1 3 1 -2 1 4 0 0 0 1 -2 4 1 -2 col_ptr 0 1 0 1 2 1 2 3 2 3 4 3 4 val -2 1 1 -2 1 1 -2 1 1 -2 1 1 -2 Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 18 20 Task 2 Member of the Helmholtz Association Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y CSR data layout 0 1 2 3 4 0 1 2 3 4 0 -2 1 0 0 0 0 -2 1 1 1 -2 1 0 0 1 1 -2 1 2 0 1 -2 1 0 1 -2 1 2 3 0 0 1 -2 1 3 1 -2 1 4 0 0 0 1 -2 4 1 -2 row_ptr 0 2 5 8 11 13 col_ptr 0 1 0 1 2 1 2 3 2 3 4 3 4 val -2 1 1 -2 1 1 -2 1 1 -2 1 1 -2 Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 18 20 Task 2 Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y Fix! Member of the Helmholtz Association CSR data layout Run! Build! Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 19 20 Task 2 Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y CSR data layout Run! Build! Fix! JURECA Getting Started Member of the Helmholtz Association module load PGI CUDA salloc --reservation=openacc --partition=gpus --nodes=1 --time=1:30:00 ,→ --gres=mem128,gpu:4 srun cuda-memcheck ./spmv.bin Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 19 20 Summary & Conclusion All the CUDA debugging and performance measurement tools work — — — — — pgprof cuda-memcheck cuda-gdb nvprof Visual Profiler Member of the Helmholtz Association Sometimes, a little digging is needed to find automatically-generated function names Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 20 20 Summary & Conclusion All the CUDA debugging and performance measurement tools work — — — — — pgprof cuda-memcheck cuda-gdb nvprof Visual Profiler Sometimes, a little digging is needed to find automatically-generated function names Member of the Helmholtz Association bugging! Happy De -juelich.de fz a.herten@ Andreas Herten | Tools for Debugging & Profiling | 24 October 2016 # 20 20