Local Parallel Iteration in X10
Transcription
Local Parallel Iteration in X10
Local Parallel Iteration in X10 Josh Milthorpe IBM T.J. Watson Research Center, Yorktown Heights, NY, USA [email protected] Abstract X10 programs have achieved high efficiency on petascale clusters by making significant use of parallelism between places, however, there has been less focus on exploiting local parallelism within a place. This paper introduces a standard mechanism - foreach - for efficient local parallel iteration in X10, including support for workerlocal data. Library code transforms parallel iteration into an efficient pattern of activities for execution by X10’s work-stealing runtime. Parallel reductions and worker-local data help to avoid unnecessary synchronization between worker threads. The foreach mechanism is compared with leading programming technologies for shared-memory parallelism using kernel codes from high performance scientific applications. Experiments on a typical Intel multicore architecture show that X10 with foreach achieves parallel speedup comparable with OpenMP and TBB for several important patterns of iteration. foreach is composable with X10’s asynchronous partitioned global address space model, and therefore represents a step towards a parallel programming model that can express the full range of parallelism in modern high performance computing systems. Categories and Subject Descriptors D.1.3 [Concurrent Programming]: parallel programming Keywords X10, parallel iteration, loop transformations, work stealing 1. Introduction Data parallelism is the key to scalable parallel programs [6]. Although X10 programs have demonstrated high efficiency at petascale, these impressive results have made little use of parallelism within a place, focusing instead on parallelism between places [8]. As most scientific codes make heavy use of iteration using for loops, parallel iteration (sometimes called ‘parallel for loop’) is the most obvious approach for exploiting shared-memory parallelism. The foreach statement was a feature of early versions of the X10 language. The statement was defined as follows: The foreach statement is similar to the enhanced for statement. An activity executes a foreach statement in a similar fashion except that separate async activities are launched Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. X10’15, June 14, 2015, Portland, OR, USA. c 2015 ACM 978-1-4503-3586-7/15/06. . . $15.00. Copyright http://dx.doi.org/10.1145/2771774.2771781 in parallel in the local place of each object returned by the iteration. The statement terminates locally when all the activities have been spawned. The requirement that each iteration of the loop be executed as a separate async made the original definition of foreach unsuitable for typical data-parallel patterns for two reasons. Firstly, it allowed ordering dependencies between iterations, which prevented arbitrary reordering or coalescing multiple iterations into a single activity. For example, the loop in Figure 1 was previously a valid use of foreach, in which the (i+1)th iteration had to be completed before the ith iteration could begin. 1 2 3 4 5 6 val complete = new Rail [ Boolean ]( ITERS ) ; foreach ( i in 0..( ITERS -1) ) { when ( complete ( i +1) ) ; compute () ; atomic complete ( i ) = true ; } Figure 1: Loop with ordering dependence between iterations Secondly, any intermediate data structures had to be duplicated for each iteration of the loop to avoid data races between iterations. For example, the loop in Figure 2 contains a data race due to sharing of the array temp between threads. 1 2 3 4 5 6 7 8 9 val input : Rail [ Double ]; val output : Rail [ Double ]; val temp = new Rail [ Double ]( N ) ; foreach ( i in 0..( ITERS -1) ) { for ( j in 0..( N -1) ) { temp ( j ) = computeTemp (i , input ( j ) ) ; } output ( i ) = computeOutput (i , temp ) ; } Figure 2: Loop with ordering dependence between iterations For correctness, temp had to be made private to the body of the loop, requiring that the array be duplicated N times. For these reasons, the time to compute a parallel loop using foreach was often orders of magnitude greater than an equivalent sequential loop. Furthermore, foreach (p in region) S was trivially expressible in terms of simpler X10 constructs as finish for (p in region) async S. The foreach construct was therefore removed in X10 version 2.1. Despite its removal, there remains a strong need for an efficient mechanism for parallel iteration in the X10 language. The main contributions of this paper are: • a standard mechanism for local parallel iteration in the X10 language; • support for worker-local data in the X10 language; • experimental evaluation of these mechanisms on a multicore architecture typical of HPC compute nodes; • comparison with leading programming technologies for shared- memory parallelism, specifically, OpenMP and TBB. 2. Related Work The standard for shared-memory parallelism is OpenMP [1]. All leading C/C++ and Fortran compilers implement the OpenMP API, which provides efficient implementations of common parallel patterns. OpenMP parallel for loops support different scheduling choices including static scheduling for regular workloads, and dynamic and guided scheduling for irregular workloads. In addition, OpenMP supports the creation of explicit tasks, which allow expression of a broader range of parallel patterns. However, the interaction between explicit tasks and the implicit tasks is not fully defined, which makes them difficult to compose [9]. Intel Threading Building Blocks is a C++ template library for task parallelism [2]. In addition to efficient concurrency primitives, memory allocators, and a work-stealing task scheduler, TBB provides implementations of a range of common parallel algorithms and data structures. Parallel iteration and reduction are implemented as parallel_for and parallel_reduce, respectively. The TBB scheduler inspects the dynamic behavior of tasks to perform optimizations for cache locality and task size [7]. Unlike X10, neither OpenMP nor TBB provide support for distributed-memory parallelism. 3. Parallel Iteration With foreach A new construct for parallel iteration may be defined as follows: foreach ( Index in Itera tionSpac e ) Stmt The body Stmt is executed for each value of Index, making use of available parallelism. The iteration must be both serializable and parallelizable, in other words, it is correct to execute Stmt for each index in sequence, and it is also correct to execute Stmt in parallel for any subset of indices. The compiler applies one of a set of transformations (see section 4) to generate a set of parallel activities that implements the foreach statement. The transformations available to the compiler may depend on the type of the index expression, and the choice of transformation may be controlled by annotations. The index expression is evaluated on entry to the foreach statement to yield a set of indices, which must be invariant throughout the iteration. It is desirable that the index set should support recursive bisection. Dense, rectangular index sets (Range and DenseIterationSpace) are trivially bisectable; for other region types, we envisage the introduction of a new interface SplittableRegion defining a split operation, to allow bisection of other region types, similar to TBB’s splitting constructor. The body of the foreach statement must be expressible as a closure. In addition to the usual restrictions on closures – for example, var variables may not be captured – there are further restrictions specific to foreach. A conditional atomic statement (when) may not be included as it could introduce ordering dependencies between iterations. Unconditional atomic may be included as it cannot create an ordering dependency. These restrictions may be checked dynamically in the same manner as the X10 runtime currently enforces restrictions on atomic statements. Apart from these restrictions, foreach is composable with other X10 constructs including finish, async and at. Correct execution of foreach assumes no preemption of X10 activities; each activity created by foreach runs to completion on the worker thread which started it. There is an implied finish; all activities created by foreach must terminate before progressing to the next statement following the construct.= 3.1 Reduction Expression Along with parallel iteration, reduction is a key parallel pattern and a feature of many scientific codes. The foreach statement may be enhanced to provide a parallel reduction expression as follows: result : U = reduce [T , U ] ( reducer :( a :T , b : U ) = > U , identity : U ) foreach ( Index in Iter ationSpa ce ) { Stmt offer Exp : T ; }; An arbitrary reduction variable of type T is computed using the provided reducer function reducer:(a:T, b:T)=> T, and an identity value identity:T such that reducer(identity, x)== x. For example, the following code computes a vector dot product of arrays x and y: 1 2 3 4 5 6 7 3.2 val x : Rail [ Double ]; val y : Rail [ Double ]; val dotProd = reduce [ Double ]( ( a : Double , b : Double ) = > a +b , 0.0) foreach ( i in 0..( x . size -1) ) { offer ( x ( i ) * y ( i ) ) ; }; Worker Local Data A common feature of parallel iteration is the use of intermediate data structures. For example, in the loop in Figure 2, iterations of the loop that execute in parallel must operate on a separate copy of the intermediate array temp to avoid data races. Note that it is not necessary that each iteration of the loop has a private copy of the array, only that no two iterations that execute in parallel are allowed to share a copy. It is possible to simply allocate a separate copy of the intermediate data for each iteration, for example: 1 2 3 4 5 6 7 8 9 val input : Rail [ Double ]; val output : Rail [ Double ]; foreach ( i in 0..( ITERS -1) ) { val temp = new Rail [ Double ]( N ) ; for ( j in 0..( N -1) ) { temp ( j ) = computeTemp (i , input ( j ) ) ; } output ( i ) = computeOutput (i , temp ) ; } However, for data structures of any significant size, repeated allocation is unlikely to be efficient due to increased load on the garbage collector. An alternative option in Native X10 (using the C++ backend) is stack allocation, as follows: 1 2 3 4 5 6 7 8 9 val input : Rail [ Double ]; val output : Rail [ Double ]; foreach ( i in 0..( ITERS -1) ) { @Sta ckAlloca te val temp = @ S t a c k A l l o c a t e U n i n i t i a l i z e d new Rail [ Double ]( N ) ; for ( j in 0..( N -1) ) { temp ( j ) = computeTemp (i , input ( j ) ) ; } output ( i ) = computeOutput (i , temp ) ; } The annotation @StackAllocate indicates that a variable should be allocated on the stack, rather than the heap. The second annotation, @StackAllocateUninitialized, indicates that the constructor call should be elided, leaving the storage uninitialized. This avoids the cost of zeroing memory, but should be used with care to ensure values are not read before they are initialized. Stack allocation is a good choice for many applications, however it is limited to variables that will fit on the stack (no large arrays), and is not supported in Managed X10 (using the Java backend). As an alternative to either duplication or stack allocation, we propose a new class, x10.compiler.WorkerLocal, which provides a lazy-initialized worker-local store. A worker-local store is created with an initializer function; the first time a given worker thread accesses the store, the initializer is called to create its local copy of the data. The definition of foreach can be extended to support worker-local data as follows: foreach ( Index in Itera tionSpac e ) local ( val l1 = Initializer1 ; val l2 = Initializer2 ; ) { Stmt }; The value initializers in the local block may capture the environment of the parallel iteration, but may not reference any symbol defined inside the body. The body of the iteration may refer to any of the variables defined within the local block. Because the body may not include blocking statements each execution of the body must run to completion on the worker thread on which it began, therefore it has exclusive access to its worker local data for the entire duration. The x10.compiler.WorkerLocal class is very similar in design to TBB’s enumerable_thread_specific type, which is also a lazyinitialized thread-local store. 4. Implementation The foreach, reduce and local keywords can be supported in X10 by extending the language syntax, however, we have not actually implemented these changes in the compiler. Instead, we have created two new classes, x10.compiler.Foreach and x10.compiler. WorkerLocal, which are intended as targets for future versions of the language; in the interim, these classes can be used directly from user code. Given the definition of the foreach statement in Section 3, a variety of code transformations are possible. The X10 compiler should provide an efficient default transformation (for example, recursive bisection), combined with annotations to allow the user to choose different transformations for particular applications. To illustrate some possible transformations, we consider the following implementation of a simple “DAXPY” using a foreach statement over a LongRange as follows: 1 2 3 foreach ( i in lo .. hi ) { x ( i ) = alpha * x ( i ) + y ( i ) ; } As a first step, the body of the foreach is extracted into a closure that executes sequentially over a range of indices as parameters: 1 2 3 4 5 val body = ( min_i : Long , max_i : Long ) = > { for ( i in min_i .. max_i ) { x ( i ) = alpha * x ( i ) + y ( i ) ; } }; The body closure is then used to construct a parallel iteration using one of the code transformations in the following subsections. 4.1 Basic The basic transformation can be applied to any iterable index set, to create a separate activity for each index: 1 finish for ( i in lo .. hi ) async body (i , i ) ; This is equivalent to the original definition of foreach. 4.2 Block Decomposition A block decomposition can be applied to any countable index set, and divides the indices into contiguous blocks of approximately equal size. By default, Runtime.NTHREADS blocks are created, one for each worker thread. Each block is executed as a separate async, except for the first block which is executed synchronously by the worker thread that started the loop. 1 2 3 4 5 8 9 10 11 12 13 val numElem = hi - lo + 1; val blockSize = numElem / Runtime . NTHREADS ; val leftOver = numElem % Runtime . NTHREADS ; finish { for ( var t : Long = Runtime . NTHREADS -1; t >0; t - -) { val tLo = lo + t <= leftOver ? t *( blockSize +1) : t * blockSize + leftOver ; val tHi = tLo + (( t < leftOver ) ? ( blockSize +1) : blockSize ) ; async body ( tLo .. tHi ) ; } body (0 , blockSize + leftOver ? 1 : 0) ; } 4.3 Recursive Bisection 6 7 A recursive bisection transformation can be applied to any splittable index set. In this approach, the index set is divided into two approximately equal pieces, with each piece constituting an activity. Bisection recurs until a certain minimum grain size is reached. For multidimensional index sets, bisection applies preferentially to the largest dimension. 1 2 3 4 5 6 7 8 9 10 11 12 static def doBisect1D ( lo : Long , hi : Long , grainSize : Long , body :( min : Long , max : Long ) = > void ) { if (( hi - lo ) > grainSize ) { async doBisect1D (( lo + hi ) /2 L , hi , grainSize , body ) ; doBisect1D ( lo , ( lo + hi ) /2 L , grainSize , body ) ; } else { body ( lo , hi -1) ; } } finish doBisect1D ( lo , hi +1 , grainSz , body ) ; With the recursive bisection transformation, if a worker thread’s deque contains any activities, then the activity at the bottom of the deque will represent at least half of the index set held by that worker. Thus idle workers tend to steal large contiguous chunks of the index set, preserving locality. 5. Evaluation We identified a number of application kernels representing common patterns in high-performance scientific applications. The use of kernels instead of full applications allows the effects of dataparallel transformations to be studied in a simplified context free from scheduling effects due to other parts of the applications. Using these kernels, we compared the different compiler transformations for foreach the different storage options for intermediate data structures. Finally, we compare the performance of the X10 versions of these kernels with versions written in C++ with OpenMP and/or TBB. 5.1 5.1.1 Application Kernels DAXPY The DAXPY kernel updates each element of a vector as xi = αxi + yi . 1 2 3 5.1.2 foreach ( i in 0..( N -1) ) { x ( i ) = alpha * x ( i ) + y ( i ) ; } Dense Matrix Multiply The MatMul kernel is an inner-product formulation P of dense matrix multiplication which updates each element ci,j ← K k=1 ai,k bk,j . Parallel iteration is over a two-dimensional index set. 1 2 3 4 5 6 7 foreach ([ j , i ] in 0..( N -1) * 0..( M -1) ) { var temp : Double = 0.0; for ( k in 0..( K -1) ) { temp += a ( i + k * M ) * b ( k + j * K ) ; } c ( i + j * M ) = temp ; } Code for all versions of the DAXPY and Matrix Multiplication kernels can be found in ANUChem1 . 5.1.3 Sparse Matrix-Vector Multiply The SpMV kernel is taken from the X10 Global Matrix library [3], available for download at http://x10-lang.org. It performs sparse matrix-vector multiplication and forms the basis of many GML algorithms. 1 2 3 4 5 6 7 8 9 5.1.4 foreach ( col in 0..( A .N -1) ) { val colA = A . getCol ( col ) ; val v2 = B . d ( offsetB + col ) ; for ( ridx in 0..( colA . size () -1) ) { val r = colA . getIndex ( ridx ) ; val v1 = colA . getValue ( ridx ) ; C . d ( r + offsetC ) += v1 * v2 ; } } Jacobi Iteration The Jacobi kernel combines a stencil update of interior elements of a two-dimensional region with a reduction of an error residual. The Jacobi benchmark is available in the X10 applications repository2 . 1 2 3 4 5 6 7 8 9 10 11 12 13 error = reduce [ Double ]( ( a : Double , b : Double ) = >{ return a + b ;} , 0.0) foreach ( i in 1..( n -2) ) { var my_error : double = 0.0; for ( j in 1..( m -2) ) { val resid = ( ax *( uold (i -1 , j ) + uold ( i +1 , j ) ) + ay * ( uold (i , j -1) + uold (i , j +1) ) + b * uold (i , j ) - f (i , j ) ) / b ; u (i , j ) = uold (i , j ) - omega * resid ; my_error += resid * resid ; } offer my_error ; }; series of time steps up to a chosen end time. At each time step, node-centered kinematic variables and element-centered thermodynamic variables are advanced to a new state. The new values for each node/element depend on the values for neighboring nodes and elements at the previous time step. A model implementation is provided using C++, OpenMP and MPI; we ported this implementation to X10. The LULESH application contains a number of important computational kernels which update different node and element variables. The kernel which computes the Flanagan-Belytschko antihourglass force for a single grid element accounts for the largest portion – around 20% – of the application runtime. It requires a number of intermediate data structures which are all small 1D or 2D arrays. The LULESH Hourglass Force kernel is available in the X10 applications repository3 . 1 2 3 25 26 27 foreach ( i in 0..( numElem -1) ) local ( val hourgam = new Array_2 [ Double ]( hourgamStore , 8 , 4) ; val xd1 = new Rail [ Double ](8) ; ... ) { val i3 = 8* i2 ; val volinv = 1.0 / determ ( i2 ) ; for ( i1 in 0..3) { ... val setHourgam = ( idx : Long ) = > { hourgam ( idx , i1 ) = gamma ( i1 , idx ) - volinv * ( dvdx ( i3 + idx ) * hourmodx + dvdy ( i3 + idx ) * hourmody + dvdz ( i3 + idx ) * hourmodz ) ; }; setHourgam (0) ; setHourgam (1) ; ... setHourgam (7) ; } ... c a l c E l e m F B H o u r g l a s s F o r c e ( xd1 , yd1 , zd1 , hourgam , coefficient , hgfx , hgfy , hgfz ) ; ... } 5.2 Experimental Setup 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 The kernels described in §5.1 were executed on an Intel Xeon E54657L v2 @ 2.4 GHz. The machine has four sockets, each with 12 cores supporting 2-way SMT for a total of 96 logical cores. X10 version 2.5.2 was modified to implement the x10.compiler .Foreach and x10.compiler.WorkerLocal classes as described in Section 4. GCC version 4.8.2 was used for post-compilation of the Native X10 programs, as well for the C++ versions of the kernels. Intel TBB version 4.3 update 4 was used for the TBB versions of the kernels. Each kernel was run for a large number of iterations (100-5000, enough to generate a minimum total runtime of several seconds), recording the mean time over a total of 30 test runs. 5.3 5.1.5 LULESH Hourglass Force LULESH2.0 [4, 5] is a mini-app for hydrodynamics on an unstructured mesh. It models an expanding shock wave in a single material originating from a point blast. The simulation iterates over a Comparison of Compiler Transformations We first compare the efficiency of parallel iteration using the compiler transformations described in Section 4. Each kernel was compiled using the basic, block and recursive bisection transformations. Figure 3 shows the scaling with number of threads for each kernel using the different transformations. Parallel speedup (single- 1 https://sourceforge.net/projects/anuchem/ 2 http://svn.code.sourceforge.net/p/x10/code/applications/jacobi 3 http://svn.code.sourceforge.net/p/x10/code/applications/lulesh2 25 2 1.5 1 0 0 8 16 24 32 40 48 56 64 72 10 6 5 4 3 2 5 1 80 88 0 96 0 8 16 24 32 number of threads 40 48 56 64 72 80 88 0 96 0 8 16 24 32 number of threads (a) DAXPY 25 parallel speedup 20 15 10 16 24 32 40 48 56 64 72 80 72 80 88 96 15 10 5 block bisect basic 8 64 block bisect basic 20 0 56 best single-thread time per iteration: 18.19 ms 25 0 48 (c) Jacobi best single-thread time per iteration: 434.11 ms 5 40 number of threads (b) MatMul 30 parallel speedup 7 15 block bisect basic block bisect 8 parallel speedup 3 parallel speedup parallel speedup 20 2.5 9 block bisect basic 3.5 0.5 best single-thread time per iteration: 4.89 ms best single-thread time per iteration: 163.20 ms best single-thread time per iteration: 8.09 ms 4 0 88 96 0 8 16 24 32 number of threads 40 48 56 64 72 80 88 96 number of threads (d) SpMV (e) LULESH Hourglass Force Figure 3: Scaling with number of threads using different X10 compiler transformations. best single-thread time per iteration: 18.19 ms heap stack local 20 parallel speedup threaded time / multi-threaded time) is reported with respect to a baseline of the best mean time per iteration for a single thread. The results for the basic transformation illustrate why the original definition of foreach in X10 was infeasible: for all kernels, the basic transformation fails to achieve parallel speedup for any number of threads. The block and bisect transformations are more promising: all codes show some speedup up to at least 32 threads. The results fail to completely separate the two transformations; for each kernel, each transformation exhibits a greater parallel speedup for some portion of the tested thread range (2–96). The block transformation achieves the greatest maximum speedup for DAXPY, Jacobi and LULESH, whereas the 1D and 2D bisection transformations achieve the greatest speedup for SpMV and MatMul respectively. The fact that neither transformation is obviously superior indicates the importance of allowing the programmer to choose between them on a per-application or even per-loop basis. We next compare the three different approaches to storage of local data that were discussed in Section 3.2. Figure 4 shows the scaling with number of threads for the LULESH hourglass force kernel using per-iteration heap allocation, stack allocation, and x10 .compiler.WorkerLocal. The greatest total speedup for LULESH (× 22) is achieved with 56 threads using stack allocation, however, there is not a significant performance difference between the three approaches over the entire range. The intermediate data structures are not large (no array is larger than 32 8-byte elements), so it may be that the cost of allocating multiple copies is insignificant compared to other factors. Other application examples are needed to more thoroughly evaluate approaches to storing local data. 15 10 5 0 0 8 16 24 32 40 48 56 64 72 80 88 96 number of threads Figure 4: LULESH Hourglass Force kernel scaling with number of threads using different approaches for storage of local data. 5.4 Comparison of Programming Models We implemented the kernels listing in §5.1 using OpenMP and TBB. OpenMP codes used schedule(block) for parallel for loops, and TBB codes used the default auto_partitioner. Figure 5 shows the scaling with number of threads for each kernel using the different programming models. Parallel speedup is normalized to the best single-thread time for any of the three models. Each programming model achieves the greatest maximum speedup for one of the kernels. For the DAXPY kernel, OpenMP significantly outperforms both X10 and TBB. TBB was not tested for the Jacobi or LULESH kernels. None of the kernel codes presented here achieve anything near perfect parallel speedup across the full range of threads tested. The maximum speedup achievable for a code depends on many factors in addition to the programming model, including: the level of parallelism available in the algorithm; the balance between floating point, memory and other operations; and cache locality. We hope best single-thread time per iteration: 7.75 ms best single-thread time per iteration: 163.20 ms 45 12 X10 C++/OMP 30 TBB 30 25 20 15 20 15 10 10 0 0 0 8 16 24 32 40 48 56 64 72 80 88 96 8 6 4 2 5 5 X10 C++/OMP 10 25 parallel speedup parallel speedup parallel speedup best single-thread time per iteration: 4.90 ms 35 X10 40 C++/OMP TBB 35 0 0 8 16 24 number of threads 32 40 48 56 64 72 80 88 96 0 number of threads (a) DAXPY 8 16 24 32 40 48 56 64 72 80 88 96 number of threads (b) MatMul (c) Jacobi best single-thread time per iteration: 14.73 ms 18 X10 16 C++/OMP parallel speedup 14 12 10 8 6 4 2 0 0 8 16 24 32 40 48 56 64 72 80 88 96 number of threads (d) LULESH Hourglass Force Figure 5: Scaling with number of threads using X10, OpenMP and TBB. to further explore these issues with regard to particular kernels, to determine whether enhancements to the X10 scheduler – for example, support for affinity-based scheduling [7] – are necessary to achieve greater parallel performance. 6. Conclusion This paper presented the foreach construct, a new standard mechanism for local parallel iteration in the X10 language. It was shown that this mechanism achieves parallel speedup comparable with OpenMP and TBB for a range of kernels typical of high performance scientific codes. None of the compiler transformations are novel, nor is the provision of a mechanism for worker-local data. However, the mechanisms presented in this paper are composable with the X10 APGAS model, which exposes data locality in the form of places and supports asynchronous remote activities. The mechanisms presented here therefore represent a further step towards a programming model that can express the full range of parallelism in modern high performance computing systems. Acknowledgments Thanks to Olivier Tardieu and David Grove for their advice on many important details of the X10 runtime and compiler. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research under Award Number DE-SC0008923. References [1] OpenMP application program interface version 4.0. Technical report, OpenMP Architecture Review Board, Jul 2013. URL http://www. openmp.org/mp-documents/OpenMP4.0.0.pdf. [2] Intel Threading Building Blocks reference manual version 4.2. Technical report, Intel Corporation, 2014. [3] S. S. Hamouda, J. Milthorpe, P. E. Strazdins, and V. Saraswat. A resilient framework for iterative linear algebra applications in X10. In 16th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2015), May 2015. [4] I. Karlin, J. Keasler, and R. Neely. LULESH 2.0 updates and changes. Technical Report LLNL-TR-641973, August 2013. [5] LULESH. Hydrodynamics Challenge Problem, Lawrence Livermore National Laboratory. Technical Report LLNL-TR-490254. [6] M. McCool, J. Reinders, and A. Robison. Structured Parallel Programming: Patterns for Efficient Computation. Elsevier, July 2012. ISBN 9780123914439. [7] A. Robison, M. Voss, and A. Kukanov. Optimization via reflection on work stealing in TBB. In Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008), pages 1–8, 2008. [8] O. Tardieu, B. Herta, D. Cunningham, D. Grove, P. Kambadur, V. Saraswat, A. Shinnar, M. Takeuchi, and M. Vaziri. X10 and APGAS at petascale. In Proceedings of the 19th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (PPoPP ’14), pages 53–66, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2656-8. [9] X. Teruel, M. Klemm, K. Li, X. Martorell, S. Olivier, and C. Terboven. A proposal for task-generating loops in OpenMP. In Proceedings of the 9th International Workshop on OpenMP (IWOMP 2013), pages 1–14. 2013.