OpenACC Tools - Debugging and Profiling

Transcription

OpenACC Tools - Debugging and Profiling
Tools for Debugging &
Profiling
Member of the Helmholtz Association
OpenACC Course 2016
Andreas Herten, Forschungszentrum Jülich, 24 October 2016
Contents
What you will learn. Hopefully.
OpenACC can greatly
speedup porting to GPU
But many details hidden from
user
→ Compiler makes assumptions
Programmer makes mistakes
Member of the Helmholtz Association
⇒ Insight into program needed
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 2 20
Contents
What you will learn. Hopefully.
OpenACC can greatly
speedup porting to GPU
But many details hidden from
user
→ Compiler makes assumptions
Programmer makes mistakes
Member of the Helmholtz Association
⇒ Insight into program needed
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
Introduction
PGI Tools
Runtime Measurements
pgprof
NVIDIA Tools
cuda-memcheck
cuda-gdb
nvprof
Visual Profiler
Tasks
Task 1
Task 2
# 2 20
Exposition
$ ./spmv
call to cuStreamSynchronize returned error 700: Illegal address during kernel
,→
execution
Where does error come from?
Is it an error at all?
Member of the Helmholtz Association
… and how do I find out?
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 3 20
General notes
Member of the Helmholtz Association
Building for debugging
-g Add debug information to executable;
adds overhead → program performs
slower
Usually, in host code, -g has little impact.
-ta=tesla:lineinfo Add information to assembly to relate
instructions to source code (light debug
info)
Check compiler output: -Minfo=accel
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 4 20
PGI Runtime Measurements
For quick sanity checks
Applications compiled with PGI compiler: Analyze via environment
variables
Maybe simplest/quickest check
Member of the Helmholtz Association
PGI_ACC_TIME
Lightweight profiler for time of data movement and
kernels
PGI_ACC_NOTIFY Print information for GPU-related events.
Set to number, to print …
=1 … kernel launches only
=2 … data transfers only
=3 … kernel launches and data transfers
=4 … region entry/exits only
=5 … region entry/exits and kernel launches
=8 … wait operations, synchronizations
=16 … (de)allocation of device memory
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 5 20
PGI Runtime Measurements
Member of the Helmholtz Association
Usage: PGI_ACC_NOTIFY=3 ./app
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 6 20
PGPROF Graphical Performance Profiler
PGI’s graphical profiler
Graphical, interactive profiler
Comes with PGI’s compiler collection
Nice visualizations, quick insight
For OpenACC, OpenMP, CUDA
Close to NVIDIA Visual Profiler
Member of the Helmholtz Association
→ https://www.pgroup.com/products/pgprof.htm
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 7 20
PGPROF Graphical Performance Profiler
Member of the Helmholtz Association
PGI’s graphical profiler
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 7 20
PGPROF Graphical Performance Profiler
Member of the Helmholtz Association
NVIDIA Visual Profiler
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 7 20
cuda-memcheck
Command-line memory access analyzer
Memory error detector; similar to Valgrind’s memcheck
One of most helpful tools for error-finding
— Out-of-bounds accesses
— Kernels/API execution failures
— Memory leaks
Has sub-tools, via cuda-memcheck --tool NAME:
Member of the Helmholtz Association
— memcheck: Memory access checking (default)
— racecheck: Shared memory hazard checking
— Also: synccheck, initcheck
Remember to compile program with debug information: -g
→ http://docs.nvidia.com/cuda/cuda-memcheck/
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 8 20
cuda-memcheck
Example
Member of the Helmholtz Association
Start via cuda-memcheck app
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 9 20
cuda-gdb
Symbolic debugger
Member of the Helmholtz Association
Powerful symbolic debugger for CUDA code
Built on top of gdb
Full usage: own course needed
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 10 20
cuda-gdb
Symbolic debugger
Powerful symbolic debugger for CUDA code
Built on top of gdb
Full usage: own course needed
cuda-gdb
101
run
break L
Create breakpoint
L: function name, line number LN, or FILE:LN
continue
Continue running
print i
Print content of i
info locals
Member of the Helmholtz Association
Starts application, give arguments with set args 1 2 …
Print all currently set variables
info cuda threads
cuda thread N
Print current thread configuration
Switch context to thread number N
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
→ cheat sheet
# 10 20
cuda-gdb
Symbolic debugger
Powerful symbolic debugger for CUDA code
Built on top of gdb
Full usage: own course needed
cuda-gdb
101
run
break L
Create breakpoint
L: function name, line number LN, or FILE:LN
continue
Continue running
print i
Print content of i
info locals
Member of the Helmholtz Association
Starts application, give arguments with set args 1 2 …
Print all currently set variables
info cuda threads
cuda thread N
Print current thread configuration
Switch context to thread number N
→ cheat sheet
→ http://docs.nvidia.com/cuda/cuda-gdb/
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 10 20
cuda-gdb
With OpenACC
cuda-gdb can be used for OpenACC as well!
Problem: Name of OpenACC-generated kernel?
→ Recipe:
strings ./app | grep .*_gpu | sort | uniq
strings ./app
grep .*_gpu
sort | uniq
Print occurrences of ≥4 printable characters
Search for _gpu line endings
Eliminate duplicates from list
Examples of kernel names
Member of the Helmholtz Association
Pattern: function_line_gpu
C main_42_gpu
Fortran spmv_26_gpu
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 11 20
cuda-gdb
Example
Start via cuda-gdb app → run
Member of the Helmholtz Association
Set breakpoint with break func or break L or break file.c:L
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 12 20
nvprof / pgprof
Command-line GPU profiler
Profiles CUDA kernels and API calls; also CPU code!
Suitable for OpenACC as well
pgprof:
Very similar to nvprof, but different default options
Generate performance reports, timelines; measure events and
metrics
⇒ Powerful complete tool for GPU application analysis
Member of the Helmholtz Association
→ http://docs.nvidia.com/cuda/profiler-users-guide/
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 13 20
nvprof
Example
Member of the Helmholtz Association
Start via nvprof ./app
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 14 20
Visual Profiler
Graphical analysis
Timeline view of all things GPU (API calls, kernels, memory)
→ study stages and interplay of application
View launch and run configurations
Guided and unguided analysis, with (among others):
Member of the Helmholtz Association
— Performance limiters
— Kernel and execution properties
— Memory access patterns
→ https://developer.nvidia.com/nvidia-visual-profiler
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 15 20
Visual Profiler
Example
Member of the Helmholtz Association
Start via nvvp → File ,→ New Session
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 16 20
Visual Profiler
Example
Start via nvvp → File ,→ New Session
Timeline
Selection Details
Member of the Helmholtz Association
Expert Analysis
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 16 20
Task 1
Location of tasks: ACCOUNT/Course/Debugging/
Member of the Helmholtz Association
Tasks available in C and Fortran
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 17 20
Task 1
Location of tasks: ACCOUNT/Course/Debugging/
Tasks available in C and Fortran
Task 1: Vector addition and reduction: ⃗a = ⃗b + ⃗c → γ =
∑
ci
i
Steps
Run!
make
srun ./vecAddRed.bin
Fix!
cuda-memcheck; cuda-gdb;
Result should be 1.
Member of the Helmholtz Association
Build!
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 17 20
Task 1
Location of tasks: ACCOUNT/Course/Debugging/
Tasks available in C and Fortran
Task 1: Vector addition and reduction: ⃗a = ⃗b + ⃗c → γ =
∑
ci
i
Steps
Build!
Run!
make
srun ./vecAddRed.bin
Fix!
cuda-memcheck; cuda-gdb;
Result should be 1.
Member of the Helmholtz Association
JURECA Getting Started
module load PGI CUDA
salloc --reservation=openacc --partition=gpus --nodes=1 --time=1:30:00
,→
--gres=mem128,gpu:4
srun cuda-memcheck ./vecAddRed.bin
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 17 20
Task 2
Member of the Helmholtz Association
Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 18 20
Task 2
Member of the Helmholtz Association
Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y
CSR data layout
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 18 20
Task 2
Member of the Helmholtz Association
Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y
CSR data layout
0 1 2 3 4
0 -2 1 0 0 0
1 1 -2 1 0 0
2 0 1 -2 1 0
3 0 0 1 -2 1
4 0 0 0 1 -2
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 18 20
Task 2
Member of the Helmholtz Association
Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y
CSR data layout
0 1 2 3 4
0 1 2 3 4
0 -2 1 0 0 0
0 -2 1
1 1 -2 1 0 0
1 1 -2 1
2 0 1 -2 1 0
1 -2 1
2
3 0 0 1 -2 1
3
1 -2 1
4 0 0 0 1 -2
4
1 -2
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 18 20
Task 2
Member of the Helmholtz Association
Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y
CSR data layout
0 1 2 3 4
0 1 2 3 4
0 -2 1 0 0 0
0 -2 1
1 1 -2 1 0 0
1 1 -2 1
2 0 1 -2 1 0
1 -2 1
2
3 0 0 1 -2 1
3
1 -2 1
4 0 0 0 1 -2
4
1 -2
val
-2 1 1 -2 1 1 -2 1 1 -2 1 1 -2
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 18 20
Task 2
Member of the Helmholtz Association
Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y
CSR data layout
0 1 2 3 4
0 1 2 3 4
0 -2 1 0 0 0
0 -2 1
1 1 -2 1 0 0
1 1 -2 1
2 0 1 -2 1 0
1 -2 1
2
3 0 0 1 -2 1
3
1 -2 1
4 0 0 0 1 -2
4
1 -2
val
-2 1 1 -2 1 1 -2 1 1 -2 1 1 -2
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 18 20
Task 2
Member of the Helmholtz Association
Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y
CSR data layout
0 1 2 3 4
0 1 2 3 4
0 -2 1 0 0 0
0 -2 1
1 1 -2 1 0 0
1 1 -2 1
2 0 1 -2 1 0
1 -2 1
2
3 0 0 1 -2 1
3
1 -2 1
4 0 0 0 1 -2
4
1 -2
col_ptr
0 1 0 1 2 1 2 3 2 3 4 3 4
val
-2 1 1 -2 1 1 -2 1 1 -2 1 1 -2
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 18 20
Task 2
Member of the Helmholtz Association
Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y
CSR data layout
0 1 2 3 4
0 1 2 3 4
0 -2 1 0 0 0
0 -2 1
1 1 -2 1 0 0
1 1 -2 1
2 0 1 -2 1 0
1 -2 1
2
3 0 0 1 -2 1
3
1 -2 1
4 0 0 0 1 -2
4
1 -2
row_ptr
0 2 5 8 11 13
col_ptr
0 1 0 1 2 1 2 3 2 3 4 3 4
val
-2 1 1 -2 1 1 -2 1 1 -2 1 1 -2
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 18 20
Task 2
Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y
Fix!
Member of the Helmholtz Association
CSR data layout
Run!
Build!
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 19 20
Task 2
Task 2: Sparse Matrix-Vector Product (SpMV): ⃗x = A⃗y
CSR data layout
Run!
Build!
Fix!
JURECA Getting Started
Member of the Helmholtz Association
module load PGI CUDA
salloc --reservation=openacc --partition=gpus --nodes=1 --time=1:30:00
,→
--gres=mem128,gpu:4
srun cuda-memcheck ./spmv.bin
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 19 20
Summary & Conclusion
All the CUDA debugging and performance measurement tools
work
—
—
—
—
—
pgprof
cuda-memcheck
cuda-gdb
nvprof
Visual Profiler
Member of the Helmholtz Association
Sometimes, a little digging is needed to find
automatically-generated function names
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 20 20
Summary & Conclusion
All the CUDA debugging and performance measurement tools
work
—
—
—
—
—
pgprof
cuda-memcheck
cuda-gdb
nvprof
Visual Profiler
Sometimes, a little digging is needed to find
automatically-generated function names
Member of the Helmholtz Association
bugging!
Happy De -juelich.de
fz
a.herten@
Andreas Herten | Tools for Debugging & Profiling | 24 October 2016
# 20 20