Tutorial sessions 1 & 2 on OpenMP
Transcription
Tutorial sessions 1 & 2 on OpenMP
OpenMP Programming OpenMP – Open Multi-processing A standard API for writing multithreaded applications for a variety of shared memory architectures Comprises three major components Compiler Directives Runtime Library Routines Environment Variables C, C++ and Fortran support, JOMP for Java An explicit programming model, offering complete control to the user over parallelization OpenMP uses the Fork Join model for execution Master thread forks/creates a team of parallel threads which execute the accompanying block of statements in parallel Upon completion of task, the threads join leaving only the master Each thread is assigned a unique Id within the team, with the master being assigned the id = 0 A user can specify any number of logical threads to be created #include <omp.h> // file BasicOpenMP.c int main(void) { omp_set_num_threads(4); int tid = 0; #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf("Hello World from thread %d\n", tid); } return 0; } Compilation – for gcc, g++ compilers only need to enable the OpenMP flag gcc – fopenmp <filename> O/P Hello World from thread 0 Hello World from thread 1 Hello World from thread 2 Hello World from thread 3 Compiler Directives #pragma omp directive-name [clause, …] newline Applies to the immediately following statement /code block Can have other directives nested within it Parallel directive – causes the thread to fork into a team with itself being the master #pragma omp parallel [clause, …] newline {} The block following the directive is executed in parallel by all the teammates An implicit barrier at the end of the parallel section causes the threads to join leaving only the master Any part of the program to be executed in parallel is enclosed within this directive Clauses Clauses specify conditions imposed while executing the code block These include num_threads() : # threads to be created #pragma omp parallel num_threads(4) shared/private/ firstprivate/lastprivate : behavior of data variables for each thread #pragma omp parallel shared(list) schedule(type, size) : scheduling mechanism #pragma omp parallel for schedule(type, size) Runtime Library Routines Definitions of all the routines are in omp.h omp_set_num_threads() – Sets the #threads to be created for parallel execution The actual #threads created are implementation dependent omp_get_num_threads() – Gives the actual #threads created for a parallel section Most implementations provide a default value for #threads which is predefined or determined dynamically depending on the availability of the resources omp_get_dynamic() – If set to true(1), will dynamically determine the number of threads to be created at run-time. Default value is mostly true, can be changed however using omp_set_dynamic() #include <omp.h> int main(void) { if (omp_get_dynamic() == 1) { omp_set_dynamic(0); } omp_set_num_threads(10); #pragma omp parallel { O/P NumThreads = omp_get_num_threads(); #threads created 4 printf(“#threads created %d”, NumThreads); #threads created 10 } } Work Sharing Constructs Divides the execution of the enclosed code region among the thread team Do not create new threads, hence must be enclosed within a parallel region An implied barrier at the end of the construct For Directive – shares iterations of the loop among threads (Data Parallel) #pragma omp for [clauses, …] newline <for_loop> The loop variable must be an integer and is implicitly private for each thread The loop cannot contain any break, goto statements The chunk allocated to each thread depends on the schedule clause specified Beware of dependencies across iterations #define N 1000 #define chunk_size 100 int main(void) { omp_set_dynamic(0); omp_set_num_threads(10); int i = 0, res = 0; #pragma omp parallel { #pragma omp for \ schedule (static, chunk_size) \ reduction(+: res) for (i = 0; i < N; i++) { res = res + i; } } } #define size 100 Matrix Multiplication int A[size][size], B[size][size], C[size][size]; shared(j) what happens now ?? #pragma omp parallel for for (i = 0; i < size; i++) { #pragma omp parallel for for (j = 0; j < size; j++) { C[i][j] = 0; To enable nested parallelism sum = 0; use omp_set_nested(1) #pragma omp parallel for reduction(+:sum) for (k = 0; k < size; k++) sum += A[i][k] * B[k][j]; C[i][j] = sum; } } Schedule(type, chunk) : determines how the loop iterations will be divided among threads. Default is implementation dependent static – loop iterations are divided into contiguous groups of size chunk and statically assigned to threads dynamic – loop iterations are divided as above but allocated dynamically to the threads guided – similar to dynamic, except that the chunk size decreases each time an iteration group is assigned to a thread runtime – scheduling decision is left to the compiler and is determined at runtime Reduction(op : list) – The op is applied to the individual copies of the list variable generated by each thread an the result is stored in the original list variable The list variable can only be scalar and the op can only be a non-overloaded binary associative operator. int A[n][n][n]; for (k = 0; k < n; k++) { #pragma omp parallel for for (i = 0; i < n; i++) { #pragma omp parallel for for (j = 0; j < n; j++) { A[k][i][j] = min(A[k-1][i][j], (A[k-1][i][k] + A[k-1][k][j])); } } } Why the outermost loop cannot be parallelized ?? Sections Directive – Independent Section directives are nested within it. Each Section is executed by one of the threads in the team (Task Parallel) #pragma omp sections [clauses, …] { #pragma omp section <structured block> #pragma omp section <structured block> } Scheduling of sections among threads is dynamic. In case #threads < #sections, scheduling of extra sections is implementation dependent goto and jump statements cannot be used inside sections OpenMP pragmas must be encountered by all threads in a team or none at all, hence a condition based execution is not allowed Example on Sections QuickSort(int list[], int lower, int upper) { if (lower < upper) { int pos = partition(list, lower, upper); #pragma omp parallel sections { #pragma omp section { QuickSort(list, lower, pos-1); } #pragma omp section { QuickSort(list, pos, upper); } } } } Data Scope private(list) – declares all variables in its list as private for each thread shared(list) – declares variables in its list as shared among all threads. By default all variables are shared in most OpenMP implementations firstprivate(list) – same as private with automatic initialization of the variables with their original values before entering the shared construct lastprivate(list) – same as private with automatic assignment from the last loop iteration or last section in the program (not execution) to the original object default(scope|none) – allows the user to specify a default scope for each variable used in the construct. If specified as ‘none’, then the scope for each variable needs to be specified Synchronization Used to impose order constraints and protect access to shared data Atomic – allows a specific memory location to be updated atomically #pragma omp atomic newline <statement> #pragma omp parallel for for (i = 0 ; i < N; i++) { # pragma omp atomic res = res + i; } Barrier – synchronizes all threads in a team. A thread has to wait until all the threads within a team reach this point #pragma omp barrier newline Ordered – specifies that the iterations of the loop must be executed in a serial manner #pragma omp parallel { #pragma omp for ordered { for(i = 0; i < N; i ++) { #pragma ordered A[i] = A[i] + A[i-1]; } } } An ordered region must be closely nested inside a loop region with an ordered clause Critical – specifies that the enclosed block can only be executed by one thread at a time #pragma omp critical [name] newline <structured block> If a thread is currently executing a critical section and another thread reaches that critical region, then it will be blocked until the first thread completes its execution of the critical section Multiple critical regions can be defined using names. Critical regions with same names form a group and are mutually exclusive of each other. Anonymous critical regions together form one group The critical directive enforces exclusive access with respect to critical directives in all threads, not just the current team. buffer[N], size; Producer Consumer problem #pragma omp sections { #pragma omp section { if(size < N) { #pragma omp critical (prod_cons) { if(size < N) buffer[size++] = 1; } } } #pragma omp section { if (size > 0) { #pragma omp critical (prod_cons) { if(size > 0) buffer[size--] = 0; } } } } Nesting of Critical regions OpenMP does not allow nesting of critical regions of the same name to avoid potential deadlock scenarios Critical regions inside Subroutines Critical sections inside subroutines also have the same global scope Care must be taken to ensure that the code does not nest two critical sections RecProd() { #pragma omp critical { RecProd(); } } Not allowed as it results in nesting of critical sections even though the same thread is making the recursive call Comparing Synchronization Directives OpenMP provides fairly easy to use constructs for achieving synchronization. However, they can severely impact performance if not used wisely Atomic and Critical achieve the same objectives but atomic is much cheaper to implement internally Atomic must be preferred over critical for single memory updates. Use of critical makes sense when a bunch of statements needs to be executed one thread at a time Locks : low level synchronization OpenMP provides a set of runtime routines to support the use of locks Simple Locks - Used to protect resources. Can only be used if the lock is unset omp_init_lock() – used to initialize a lock associated with a lock variable omp_set_lock() - used to set the lock. If already set, then the thread is asked to wait omp_test_lock() – tests whether a lock is set or not, if unset then sets it omp_unset_lock() – used to release the lock for others to use omp_destroy_lock() – destroys the lock freeing the lock variable omp_lock_t lck; omp_init_lock(&lck); #pragma omp parallel private (tmp, id) { id = omp_get_thread_num(); tmp = do_lots_of_work(id); omp_set_lock(&lck); printf(“%d %d”, id, tmp); omp_unset_lock(&lck); } omp_destroy_lock(&lck); Nested Locks – same as simple locks except that they allow a thread currently holding a lock to re-acquire it. No other thread can acquire the set lock Same functions as above such as omp_init_nest_lock() Synchronization pragmas are easier to use than locks and significantly reduce the need to check for deadlocks and memory leaks. Most synchronization pragmas incur little to no overhead in terms of implementation However locks offer greater flexibility and control to the user than constructs such as critical in terms of nesting and subroutine calls Other Clauses & Directives nowait clause – removes the implicit barrier at the end of the work sharing construct threadprivate(list) – declares static and file scope variables as private for each thread if clause – used with the parallel directive, if evaluates to ‘true’ then new threads are created for parallel execution of the enclosed code block #pragma omp master – if used, the enclosed piece of code is only executed by the master #pragma omp single – if used, the enclosed piece of code is executed by only one of the threads in the team. An implicit barrier is placed at the end References https://computing.llnl.gov/tutorials/openMP/ https://msdn.microsoft.com/en-us/magazine/cc163717.aspx http://openmp.org/mp-documents/omp-hands-on-SC08.pdf http://www.viva64.com/en/a/0054/