Folie 1

Transcription

Folie 1
OpenACC (PGI Compiler)
LRZ, 27.4.2015, Dr. Volker Weinberg, [email protected]
OpenACC
●
●
●
●
http://www.openacc-standard.org/
A CAPS, Cray, Nvidia and PGI initiative
AMD and PathScale joined the OpenACC Standards Group in 2014
Open Standard:
 OpenACC 1.0 Spec (Nov 2011):
http://www.openacc.org/sites/default/files/OpenACC.1.0_0.pdf
 OpenACC 2.0a Spec (Aug. 2013):
http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf
● OpenACC 1.0 quite similar to OpenMP 4.0
● OpenACC shares many features with former PGI accelerator directives
and the spirit of CAPS HMPP compiler
● Quick Reference Guide:
http://www.openacc.org/sites/default/files/213462%2010_OpenACC_AP
I_QRG_HiRes.pdf
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
OpenACC
●
●
●
27/04/2015
The OpenACC Application Program Interface describes a collection of
compiler directives to specify loops and regions of code in standard C,
C++ and Fortran to be offloaded from a host CPU to an attached
accelerator. OpenACC is designed for portability across operating
systems, host CPUs, and a wide range of accelerators, including APUs,
GPUs, and many-core coprocessors.
The directives and programming model defined in the OpenACC API
document allow programmers to create high-level host+accelerator
programs without the need to explicitly initialize the accelerator,
manage data or program transfers between the host and accelerator, or
initiate accelerator startup and shutdown.
All of these details are implicit in the programming model and are
managed by the OpenACC API-enabled compilers and runtimes. The
programming model allows the programmer to augment information
available to the compilers, including specification of data local to an
accelerator, guidance on mapping of loops onto an accelerator, and
similar performance-related details.
Intel MIC & GPU Programming Workshop, LRZ 2015
OpenACC
● OpenACC Compilers:
 PGI Compiler, Portland Group
http://www.pgroup.com/
 Support for NVIDIA & AMD GPUs
 Extension of x86 PGI compiler suite
 CAPS Compilers, CAPS Enterpise (until 2014)
Support for NVIDIA & AMD GPUs, Intel Xeon Phi
 Source-to-source compilers
 Cray Compiler
 Only for Cray systems
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Accelerator Block Diagram
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Offload Execution Model
●
Host:
 Executes most of the program
 Allocates memory on the accelerator device
 Initiates data copies from host memory to accelerator memory
 Sends the kernel code to the accelerator
 Waits for kernel completion
 Initiates data copy from the accelerator back to the host memory
 Deallocates memory
●
Accelerator:



27/04/2015
Only compute intensive regions should be executed on the accelerator
Executes kernels, one after the other
Concurrently may transfer data between host and accelerator
Intel MIC & GPU Programming Workshop, LRZ 2015
OpenACC Execution Model
●
●
●
●
27/04/2015
The OpenACC execution model has three levels:
gang, worker and vector.
The model target architecture is a collection of processing
elements or PEs, where each PE is multithreaded, and each
thread on the PE can execute vector instructions.
For an NVIDIA GPU, the PEs might map to the streaming
multiprocessors, multithreading might map to warps, and the
vector dimension might map to the threads within a warp. The
gang dimension would map across the PEs, the worker across
the multithreading dimension within a PE, and the vector
dimension to the vector instructions.
Mapping is compiler-dependent!
Intel MIC & GPU Programming Workshop, LRZ 2015
OpenACC Execution Model
● There is no support for any synchronization between
gangs, since current accelerators typically do not
support synchronization across PEs.
● A program should try to map parallelism that shares
data to workers within the same gang, since those
workers will be executed by the same PE, and will
share resources (such as data caches) that would
make access to the shared data more efficient.
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
OpenACC Programming Model
● Main OpenACC constructs:




Parallel Construct
Kernels Construct
Data Construct
Loop Construct
● Runtime Library Routines
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
OpenACC Syntax
● C:
 #pragma acc directive-name [clause [, clause] …]
{
…. // Offload Code
}
● Fortran:
 !$acc directive-name [clause [, clause] …]
! Offload Code
!$acc end directive-name
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Kernels Directive
● Kernels Directive
 An accelerator kernels construct surrounds loops to
be executed on the accelerator, typically as a
sequence of kernel operations.
 Typically every loop will be a distinct kernel
 Number of Gangs and Workers can be different for
each kernel
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Kernels Directive
● C:
 #pragma acc kernels [clause [, clause] …]
{
for(i=0;i<n;i++) { …} ;
1st kernel
for(j=0;j<n;j++) { …} ;
2nd kernel
}
● Fortran:
 !$acc kernels [clause [, clause] …]
DO i=1,n
…
END DO
DO j=1,n
…
END DO
1st kernel
2nd kernel
!$acc end kernels
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Important Kernels Clauses
27/04/2015
if(condition)
When condition is true, the kernels
region will execute on the acc;
otherwise on the host
async(expression)
The kernels region executes
asynchronously with the host.
Intel MIC & GPU Programming Workshop, LRZ 2015
Important Data Clauses
copy(list)
allocates data on the acc and copies
data: host ↔ accelerator
copyin(list)
allocates data on the acc and copies
data: host → accelerator
copyout(list)
allocates data on the acc and copies
data: host ← accelerator
create(list)
allocates data on the acc but does
not copy data: host ≠accelerator
present(list)
does not allocate data, but uses data
already allocated on the acc
present_or_copy/copyin/copyout/cre
ate (list)
If data is already present, that data is
used, otherwise like copy/copyin/…
Can be used on parallel constructs, kernel constructs, data constructs and others.
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Parallel Directive
● Parallel Directive
 An accelerator parallel construct launches a number
of gangs executing in parallel, where each gang may
support multiple workers, each with vector or SIMD
operations.
 Number of Gangs and Workers remains constant for
the parallel region.
 One worker in each gang begins executing the code in
the region.
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Parallel Directive
● C:
 #pragma acc parallel [clause [, clause] …]
{
Parallel region
}
● Fortran:
 !$acc parallel [clause [, clause] …]
Parallel region
!$acc end parallel
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Important Parallel Clauses
27/04/2015
if(condition)
When condition is true, the parallel region
will execute on the acc; otherwise on the
host
async(expression)
The parallel region executes
asynchronously with the host.
num_gangs(n)
Controls how many gangs are created
num_workers(n)
Controls how many workers are created
in each gang
vector_length(n)
Controls the vector length on each
worker
private(list)
A copy of each variable in list is allocated
for each gang.
firstprivate(list)
same as private, but data is initialised
with the value from the host.
reduction(operator:list)
Allows reduction operations.
Intel MIC & GPU Programming Workshop, LRZ 2015
Data Clauses specific to Data region
27/04/2015
if(condition)
When condition is false, no data will
be allocated or moved to/from the
accelerator.
async(expression)
Data movement between host and
accelerator occur asynchronously
with the host.
Intel MIC & GPU Programming Workshop, LRZ 2015
Loop Directive
● A loop directive applies to the immediately following
loop or nested loops, and describes the type of
accelerator parallelism to use to execute the
iterations of the loop.
● C:
 #pragma acc loop [clause [, clause] …]
● Fortran:
 !$acc loop [clause [, clause] …]
● Can also be combined “#pragma acc kernels loop”
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Loop Clauses
collapse(n)
Applies the directive to the following
n tightly nested loops.
seq
Executes this loop sequentially on
the accelerator.
private(list)
A copy of each variable in list is
created for each iteration of the loop.
gang[(num)]
Use at most num gangs.
worker[(num)]
Use at most num workers of a gang.
vector[(length)]
Executes the iterations of the loop in
SIMD vector-mode with max.
vectorlenth.
num/length only possible for loops in kernels regions, not within parallel region
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Runtime Library Routines
● Prototypes or interfaces for the runtime library
routines along with datatypes and enumeration types
are available as follows:
● C:
#include “openacc.h”
● Fortran:
use openacc or #include “openacc_lib.h”
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
OpenACC 1.0 Runtime Library Routines
acc_get_num_devices(devicetype)
Returns umber of acc devices of type
devicetype.
acc_set_device_type(devicetype)
Sets acc device type to use for this host
thread.
acc_get_device_type()
Returns acc device type that is being
used by this host thread.
acc_set_device_num(devicenum,
devicetype)
Sets the device number to use for this
host thread.
acc_get_device_num(devicetype)
Returns the acc device number that is
beingt used by this host thread.
acc_async_test(expression)
Returns nonzero or .TRUE. if all
asynchronous activities have been
completed; otherwise returns zero or
.FALSE.
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
OpenACC 1.0 Runtime Library Routines
acc_async_test_all()
Returns nonzero or .TRUE. if all
asynchronous activities have been
completed; otherwise returns
acc_async_wait(expression)
Waits until all asynchronous activities have
been completed.
acc_init(devicetype)
Initialized the runtime system and sets the
accelerator device type to use for this host
thread.
acc_shutdown
Disconnects this host thread from the
accelerator device.
acc_on_device(devicetype)
In a parallel or kernels region, this is used to
take different execution paths depending
whether the program is running on an
accelerator or on the host.
acc_malloc(size_t)
Returns the address of memory allocated on
the accelerator device.
acc_free(void*)
Frees memory allocated by acc_malloc.
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
acc_device_t
/lrz/sys/compilers/pgi/14/linux86-64/14.1/include/openacc.h:
typedef enum{
acc_device_none = 0,
acc_device_default = 1,
acc_device_host = 2,
acc_device_not_host = 3,
acc_device_nvidia = 4,
acc_device_radeon = 5,
acc_device_xeonphi = 6,
acc_device_pgi_opencl = 7,
acc_device_nvidia_opencl = 8,
acc_device_opencl = 9
}acc_device_t;
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Using PGI Compilers
●
At LRZ: 5 Floating Licenses of the PGI Compiler Suite
 https://www.lrz.de/services/software/programmierung/pgi_lic/
●
PGI OpenACC Docu:
 Resources:
http://www.pgroup.com/resources/accel.htm
 Getting started guide:
http://www.pgroup.com/doc/openACC_gs.pdf
 C:
http://www.pgroup.com/lit/articles/insider/v4n1a1b.htm
 Fortran:
http://www.pgroup.com/lit/articles/insider/v4n1a1a.htm
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Running programs on the GPU
●
●
●
●
●
27/04/2015
Login to GPU Cluster
 ssh lxlogin_gpu
Allocate 1 GPU
 salloc --gres=gpu:1 --reservation=gpu_course
Load PGI modules:
 module unload ccomp fortran
 module load ccomp/pgi/13.10
Compile
 pgcc -acc -Minfo=accel file.c
Run interactively
 srun --gres=gpu:1 ./a.out
 or: export RUN=“srun --gres=gpu:1”; $RUN ./a.out
Intel MIC & GPU Programming Workshop, LRZ 2015
Getting info about the Host CPU
lu65fok@lxa195:~> pgcpuid
vendor id
: AuthenticAMD
model name
: AMD Opteron(tm) Processor 6128 HE
cores
:8
cpu family
: 16
model
:9
stepping
:1
processor count : 8
clflush size : 8
L2 cache size : 512KB
L3 cache size : 10MB
flags
: abm apic cflush cmov cx8 de fpu fxsr fxsropt ht lm mca mce
flags
: mmx mmx-amd monitor msr mas mtrr nx pae pat pge pse pseg36
flags
: sep sse sse2 sse3 sse4a cx16 popcnt syscall tsc vme 3dnow
flags
: 3dnowext
type
: -tp istanbul-64
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Getting Infos about the GPU: pgaccelinfo
lu65fok@lxa195:~> $RUN pgaccelinfo
CUDA Driver Version:
5000
NVRM version:
NVIDIA UNIX x86_64 Kernel Module 304.54 Sat Sep 29 00:05:49 PDT 2012
Device Number:
0
Device Name:
Tesla X2070
Device Revision Number:
2.0
Global Memory Size:
5636554752
Number of Multiprocessors: 14
Number of Cores:
448
Concurrent Copy and Execution: Yes
Total Constant Memory:
65536
Total Shared Memory per Block: 49152
Registers per Block:
32768
Warp Size:
32
Maximum Threads per Block: 1024
Maximum Block Dimensions:
1024, 1024, 64
Maximum Grid Dimensions:
65535 x 65535 x 65535
Maximum Memory Pitch:
2147483647B
Texture Alignment:
512B
Clock Rate:
1147 MHz
Execution Timeout:
No
Integrated Device:
No
Can Map Host Memory:
Yes
Compute Mode:
default
Concurrent Kernels:
Yes
ECC Enabled:
Yes
Memory Clock Rate:
1548 MHz
Memory Bus Width:
384 bits
L2 Cache Size:
786432 bytes
Max Threads Per SMP:
1536
Async Engines:
2
Unified Addressing:
Yes
Initialization time:
6314891 microseconds
Current free memory:
5570027520
Upload time (4MB):
1494 microseconds ( 945 ms pinned)
Download time:
1410 microseconds (1065 ms pinned)
Upload bandwidth:
2807 MB/sec (4438 MB/sec pinned)
Download bandwidth:
2974 MB/sec (3938 MB/sec pinned)
PGI Compiler Option:
-ta=nvidia,cc20
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Building Programs
● C:
 pgcc –acc –Minfo=accel file.c –o file
● Fortran:
 pgfortran –acc –Minfo=accel file.f90 –o file
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Compiler Command Line Options
● PGI olds Accelerator Directives still supported, but not
conformant with OpenACC
● better use –acc=strict or –verystrict
● pgcc –help:
-acc[=[no]autopar|strict|verystrict]
Enable OpenACC directives
[no]autopar Enable (default) or disable loop autoparallelization within acc
parallel
strict
Issue warnings for non-OpenACC accelerator directives
verystrict
Fail with an error for any non-OpenACC accelerator directive
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Compiler Command Line Options
-ta=nvidia:{nofma|[no]flushz|keep|noL1|noL1cache|maxregcount:<n>|[no]rdc|tesla|cc1x|fermi|cc2x|
kepler|cc3x|fastmath|cuda5.0|cuda5.5}|radeon:{keep|tahiti|apu|buffercount:<n>}|host
Choose target accelerator
nvidia
Select NVIDIA accelerator target
nofma
Don't generate fused mul-add instructions
[no]flushz Enable flush-to-zero mode on the GPU
keep
Keep kernel files
noL1
Don't use the L1 hardware data cache to cache global variables
noL1cache
Don't use the L1 hardware data cache to cache global variables
maxregcount:<n>
Set maximum number of registers to use on the GPU
[no]rdc
Generate relocatable device code; disables cc1x and cuda4.2
tesla
Compile for Tesla architecture
cc1x
Compile for compute capability 1.x
fermi
Compile for Fermi architecture
cc2x
Compile for compute capability 2.x
kepler
Compile for Kepler architecture
cc3x
Compile for compute capability 3.x
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Compiler command line options
fastmath
Use fast math library
cuda5.0
Use CUDA 5.0 Toolkit compatibility
cuda5.5
Use CUDA 5.5 Toolkit compatibility
cuda5.5
Use CUDA 5.5 Toolkit compatibility
radeon
Select AMD Radeon GPU accelerator target
keep
Keep kernel source files
tahiti
Compile for Radeon Tahiti architecture
apu
Compile for Radeon APU architecture
buffercount:<n> Set max number of device buffers used by OpenCL kernel
host
Compile for the host, i.e., no accelerator target
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
C Example
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
int main( int argc, char* argv[] ) {
int n;
/* size of the vector */
float *restrict a; /* the vector */
float *restrict r; /* the results */
float *restrict e; /* expected results */
int i;
if( argc > 1 )
n = atoi( argv[1] );
else
n = 100000;
if( n <= 0 ) n = 100000;
27/04/2015
a = (float*)malloc(n*sizeof(float));
r = (float*)malloc(n*sizeof(float));
e = (float*)malloc(n*sizeof(float));
for( i = 0; i < n; ++i )
a[i] = (float)(i+1);
#pragma acc kernels
{
for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;
}
/* compute on the host to compare */
for( i = 0; i < n; ++i ) e[i] = a[i]*2.0f;
/* check the results */
for( i = 0; i < n; ++i )
assert( r[i] == e[i] );
printf( "%d iterations completed\n", n );
return 0;
}
Compiling a C Program
lu65fok@lxa195:~/openacc/pgi_tutorial/v1n1a1> pgcc -acc -Minfo=accel c1.c -o
c1
main:
23, Generating present_or_copyout(r[0:n]) !Mind r[start-index:nelements]
Generating present_or_copyin(a[0:n])
Generating NVIDIA code
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
25, Loop is parallelizable
Accelerator kernel generated
25, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Executing a program
lu65fok@lxa195:~/openacc/pgi_tutorial/v1n1a1> $RUN ./c1
100000 iterations completed
lu65fok@lxa195:~/openacc/pgi_tutorial/v1n1a1> export ACC_NOTIFY=1
lu65fok@lxa195:~/openacc/pgi_tutorial/v1n1a1> $RUN ./c1
launch CUDA kernel
file=/home/hpc/pr28fa/lu65fok/openacc/pgi_tutorial/v1n1a1/c1.c
function=main line=25 device=0 grid=782 block=128
100000 iterations completed
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Compiling a Fortran Program
lu65fok@lxa195:~/openacc/pgi_tutorial/v1n1a1> pgfortran -acc Minfo=accel f1.f90
main:
21, Generating present_or_copyin(a(1:n))
Generating present_or_copyout(r(1:n))
Generating NVIDIA code
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
22, Loop is parallelizable
Accelerator kernel generated
22, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Fortran Example
program main
integer :: n
! size of the vector
real,dimension(:),allocatable :: a ! the vector
real,dimension(:),allocatable :: r ! the results
real,dimension(:),allocatable :: e ! expected results
integer :: i
character(10) :: arg1
if( iargc() .gt. 0 )then
call getarg( 1, arg1 )
read(arg1,'(i10)') n
else
n = 100000
endif
if( n .le. 0 ) n = 100000
allocate(a(n))
allocate(r(n))
allocate(e(n))
do i = 1,n
a(i) = i*2.0
enddo
27/04/2015
!$acc kernels
do i = 1,n
r(i) = a(i) * 2.0
enddo
!$acc end kernels
do i = 1,n
e(i) = a(i) * 2.0
enddo
! check the results
do i = 1,n
if( r(i) .ne. e(i) )then
print *, i, r(i), e(i)
stop 'error found'
endif
enddo
print *, n, 'iterations completed'
end program
PGI Unified Binary
● PGI Unified Binary for Multiple Accelerator Types
 Compile PGI Unified Binary
 pgcc –ta=nvidia,host (Default for –acc)
 Run PGI Unified Binary:
 export ACC_DEVICE=nvidia; $RUN ./a.out
 export ACC_DEVICE_NUM=1; $RUN ./a.out
 export ACC_DEVICE=host; ./a.out
● PGI Unified Binary for Multiple Processor Types
 pgcc –tp=nehalem,sandybridge
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Insight into the Code
● To have insight into the generated CUDA code use:
 pgcc -acc -Mcuda=keepgpu -Minfo=accel c1.c
 File c1.n001.gpu will contain CUDA code
● Insight into host code:
 pgcc -S -acc -Minfo=accel c1.c
 Shows call of __pgi_uacc_* routines
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
Insight into __pgi_uacc_* calls
● pgcc -S -acc -Minfo=accel c1.c
call atoi
call malloc
call malloc
call malloc
call __pgi_uacc_begin
call __pgi_uacc_enter
call __pgi_uacc_dataona
call __pgi_uacc_dataona
call __pgi_uacc_datadone
call __pgi_uacc_launch
call __pgi_uacc_dataoffa
call __pgi_uacc_dataoffa
call __pgi_uacc_datadone
call __pgi_uacc_noversion
call __pgi_uacc_end
call __assert_fail
call printf
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015
#include "cuda_runtime.h"
#include "pgi_cuda_runtime.h"
extern "C" __global__
__launch_bounds__(128) void
main_25_gpu(
int tc2,
signed char* p3,
signed char* p4)
{
int _i_1;
unsigned int _ui_1;
float _r_1;
unsigned int e30;
int j39;
int j38;
int j37;
unsigned int e37;
27/04/2015
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
e30 = ((int)gridDim.x)*(128);
e37 = (e30)*(4);
if( ((0)>=(tc2))) goto _BB_6;
_ui_1 = ((int)gridDim.x)*(128);
j37 = ((tc2)-((int)(_ui_1)))+((int)(_ui_1));
j38 = 0;
j39 = 0;
_BB_8: ;
if( (((j39)-(tc2))>=0)) goto _BB_9;
if( ((((((int)((int)threadIdx.x))(tc2))+((int)(((int)blockIdx.x)*(128))))+(j39))>=0))
goto _BB_9;
_i_1 =
((int)((((int)threadIdx.x)+(((int)blockIdx.x)*(128)))
*(4)))+(j38);
_r_1 = (*( float*)((p3)+((long long)(_i_1))));
*( float*)((p4)+((long long)(_i_1))) = _r_1+_r_1;
_BB_9: ;
_ui_1 = ((int)gridDim.x)*(128);
j37 = (j37)+(-((int)(_ui_1)));
j38 = (j38)+((int)(e37));
j39 = (j39)+((int)(_ui_1));
if( ((j37)>0)) goto _BB_8;
_BB_6: ;
}
Lab: OpenACC
27/04/2015
Intel MIC & GPU Programming Workshop, LRZ 2015