D5.4 MAQAO ARM Port - Mont
Transcription
D5.4 MAQAO ARM Port - Mont
D5.4– MAQAO ARM Port Version 1.1 Document Information Contract Number Project Website Contractual Deadline Dissemination Level Nature Authors Contributors Reviewers Keywords 610402 www.montblanc-project.eu M30/03/2015 PU Other Olivier Aumage, Denis Barthou, James Tombi M’Ba, Christopher Haine and Enguerrand Petit (INRIA) Brice Videau, Kevin Pouget (CNRS), Bernd Mohr (Juelich) Performance analysis, MAQAO, SIMDization, performance/energy tradeoffs Notices: The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no 610402. c Mont-Blanc 2 Consortium Partners. All rights reserved. D5.4 - MAQAO ARM Port Version 1.1 Change Log Version Description of Change v1.0 Initial version v1.1 Several corrections, in particular typos and graph figures v1.2 Better phrasing and graph caption corrections 2 D5.4 - MAQAO ARM Port Version 1.1 Contents Executive Summary 4 1 Introduction 1.1 MAQAO for Performance Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Porting MAQAO to ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 5 2 Preliminary Performance Study on ARM big.LITTLE 2.1 Architectural context . . . . . . . . . . . . . . 2.2 Impact of vectorization on energy on ARM32 2.2.1 TSVC Benchmark . . . . . . . . . . . 2.2.2 Vectorization/Energy tradeoffs . . . . 6 6 8 8 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Porting MAQAO on ARM 11 3.1 Installation and usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4 Performance Analysis with MAQAO on ARM 4.1 Dependence Graph Building . . . . . . . . . . . . . . . 4.1.1 Register Dependence Analysis . . . . . . . . . . 4.1.2 Detecting loop counters and induction variables 4.2 Interpretation and Analysis . . . . . . . . . . . . . . . 4.2.1 Detection of Vectorization Opportunities . . . . 4.2.2 Identifying Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 14 15 15 16 18 5 Performance Analysis 5.1 TSVC benchmark . . . . . . . . . . . . . . . . . . . 5.1.1 Reductions . . . . . . . . . . . . . . . . . . 5.1.2 Large Memory Strides . . . . . . . . . . . . 5.1.3 Complex Control Flow . . . . . . . . . . . . 5.1.4 Non vectorizable loops . . . . . . . . . . . . 5.2 Porting and Optimizing SMMP with MAQAO . . 5.2.1 Porting and profiling . . . . . . . . . . . . . 5.2.2 MAQAO Analysis . . . . . . . . . . . . . . 5.2.3 Optimizing for SIMDization . . . . . . . . . 5.3 Single Precision Limitations for Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 22 22 22 23 23 24 25 25 26 28 . . . . . . . . . . . . . . . . . . . . 6 On-going Developments 29 7 Conclusion 32 3 D5.4 - MAQAO ARM Port Version 1.1 Executive Summary This deliverable presents the main features of MAQAO on ARM. After a study on the impact of vectorization and vectorization/energy tradeoffs on ARM32 architectures, we present the static analyses used on ARM and briefly the currently working instrumentation feature. Then we apply MAQAO on a benchmark in order to describe the hints given by the tool and apply MAQAO on SMMP, an MontBlanc application, in order to optimize it. Finally, we provide the on-going work concerning data layout transformations. 4 D5.4 - MAQAO ARM Port Version 1.1 Contents 1 Introduction This deliverable presents MAQAO [2] ported for ARM. We first describe a performance study on ARM, in order to capture the importance of some metrics on performance, then present MAQAO features implemented on ARM and finally, describe some performance study on MontBlanc applications with MAQAO. 1.1 MAQAO for Performance Tuning MAQAO is a performance tuning tool for multicore codes. It analyzes binary executable codes and provides performance reports and hints to the users. These reports are obtained through the combined used of global static analyses on the binary and dynamic information captured through binary instrumentation. For the static analyses, MAQAO finds functions, loops, control flow in the code, detects SIMDization and possible other optimizations performed by the compiler (inlining for instance). MAQAO proposes a C API to disassemble and analyze binary codes. All analyses can be extended and scripted with LUA. The SIMD analyses for instance proposed in this deliverable are written in LUA. To capture the dynamic behavior of a code, MAQAO proposes an instrumentation API, able to perform time and value profiling. The functions of this API can be given parameters coming from the global static analysis (such as the depth of a loop, whether the code is part of an inlined function, ...) and coming from registers. The probes are user defined and can be either defined in assembly, in dynamically loaded libraries or even in LUA (and compiled through a JIT). This API is described for Intel architecture in a previous paper [4]. Value profiling of data streams (load and store addresses) can be handled with the compression proposed by NLR [6]. The combination of static and dynamic results is performed statically. MAQAO consists in different modules, and we discuss in the following section the modules that need to be ported. 1.2 Porting MAQAO to ARM MAQAO has been developed for a large range of Intel architectures, in collaboration with Intel in particular. The objective of this deliverable is to port MAQAO to ARM architecture. Figure 1 gives an overview of MAQAO. MAQAO consists in: • A binary parser: This element is central for an ARM port. As there exists different ISAs for ARM processors, we will focus on Arm32 instruction set (not thumb), and later on Arm64. • The code representation is built through 3 key analyses: (i) The call graph, finding which function are called within each function, (ii) The control flow graph, detecting the execution paths within a function and (iii) The dependence graph, defining the partial execution order between instructions, to preserve for most optimizations. This part is also dependent on the architecture, due to particularities in the ISA for the description of the returns for instance or for post-increments. Besides, the intrumentation, based on the information collected by these analysis, relies on the ISA and has to be ported. 5 D5.4 - MAQAO ARM Port Version 1.1 Figure 1: MAQAO overview, giving the different parts of MAQAO and their interaction. • The performance analysis module generates the hints provided to the user. For this port, a specific analysis is proposed. The first part, presented in this document, corresponds to the SIMD analysis. The second part, that will be part of the second deliverable, will focus on data layout analysis and optimization. Therefore, the ARM port requires to rewrite many MAQAO modules. However, two parts are essential for the rest of the development on ARM: The binary parser and the instrumentation API. The performance analysis will be developed during the whole project. 2 Preliminary Performance Study on ARM big.LITTLE The target architecture is ARM big.LITTLE processors. This is an heterogeneous architecture with different key metrics driven by frequency, cache capacity, SIMD instruction efficiency. We highlight in this section the key features of this architecture, in terms of performance and energy consumption. The test have been conducted on a ARM32 big.LITTLE architecture, and tests on ARM64 big.LITTLE are still in progress. Instead of modeling one particular architecture of ARM processor, our objective is to capture essential metrics and measure performance (instead of predicting it). To this intent, this first preliminary study shows how SIMDization is related to energy. 2.1 Architectural context The ARM big.LITTLE architectures have been designed in order to be able to deliver performance and to save energy, according to the need. For our experiments we have two architecture setups, one with ARM32 Cortex-A15/Cortex-A7 processor (Samsung ODROID-XU), one with the ARM64 Cortex-A57/Cortex-A53 processor (ARM Juno development board). Figure 2 shows the two processors Cortex-A57/Cortex-A53 (with possibly the GPU) sharing an interconnect [1] able to maintain cache coherency between the 2 L2. The same architecture holds for the Cortex-A15/Cortex-A7 of the Samsung ODROID XU+E board we used in our benchmarks. 6 D5.4 - MAQAO ARM Port Version 1.1 Figure 2: Cortex-A15/Cortex-A7 processor in ODROID-XU+E architecture Figure 3: Steps for switching from big to LITTLE (or LITTLE to big). The decision to execute a code on the big or LITTLE processor can be taken automatically by the system or imposed by the developer. When governed automatically by the system, the execution is switched from big to LITTLE or the reverse according to some heuristics, based on the current load of the machine. On the ODROID XU+E board, only the active processor is on, the inactive one is off. Only one of the two Cortex works at a time. Here, changing from one processor to the other at some point of an application requires to flush out data from one of the L2 cache in order to transfer them to the other L2 cache. The different steps required to perform this transfer are explained in Figure 3, including task migration. On more recent configurations, it is possible to use both processors at the same time. The board we used for the development of MAQAO on ARM has been so far the ODROID XU+E, since we received it in February 2014. We received the ARM64 Juno development card in January 2015 and have not yet achieved a performance study on it. The ODROID XU+E is a Samsung development board with 2GB of RAM, featuring the Exynos5Octa processor with a quadcore Cortex-A15 (from 800MHz to 1.6GHz, 2MB L2) and a quadcore Cortex-A7 (from 250MHz to 1.2GHz, 512KB L2). Both processors have out of order execution. The PowerVR 7 D5.4 - MAQAO ARM Port Version 1.1 for (int i = 1; i < LEN; i++) X[i] = Y[i] + 1; Figure 4: function s000, TSVC for (int i = 0; i < LEN2; i++) for (int j = 0; j < i; j++) aa[i][j] = aa[j][i] + bb[i][j]; Figure 5: function s114, TSVC GPU available on the board has not been used. Switching from one processor to the other requires to operate both at 800MHz. On this processor, SIMD vectors are 128bits (4 floats single precision) and the processor has no native double-precision instruction. The operating system is Linux Ubuntu 12.04 LTS Linaro and the compiler used is GCC 4.6.3. 2.2 Impact of vectorization on energy on ARM32 Vectorization is one of the key optimization for high performance on modern architectures. We evaluate here the performance of both processors for vectorization. 2.2.1 TSVC Benchmark The Test Suite for Vectorizing Compilers (TSVC) [3] is a benchmark suite of small kernels exercising the capacity of compilers to vectorize. It has been improved in 2011 [7] and rewritten in C. It has 151 kernels, mostly simple loops, and most of them can be vectorized. Figures 4, 5 and 6 show sample kernels from TSVC. Figure 4 represents a straightforward example of vectorizable code, 5 exhibits a triangular iteration domain and 6 has a non-trivial control flow graph, with an induction variable (j). All of them are vectorizable on modern architectures, but require different techniques so that SIMDization is efficient. The dataset is small enough to fit into the L2 cache. We first try to evaluate the impact of vectorization on performance and energy, and thus consider kernels for which GCC is able to generate both versions. We have selected 10 simple kernels from TSVC that the compiler GCC is able to vectorize (checked thanks to the vectorization reports): va, vpv, vtv, vpvtv, vpvts, vpvpv, vtvtv, vsumr, vdotr, vbor. The expression int j = -1; for (int i = 0; i < LEN; i++) if (b[i] > (float)0.) j++; a[j] = b[i] + d[i] * e[i]; else j++; a[j] = c[i] + d[i] * e[i]; Figure 6: function s124, TSVC 8 D5.4 - MAQAO ARM Port Version 1.1 computed by each kernel is shown in Table 7, with its number of floating point operations. We Kernel va Expression a [ i ]=b [ i ] Flops 0 vpv a [ i ]+=b [ i ] 64 ∗ 109 vtv a [ i ]∗=b [ i ] 64 ∗ 109 vpvtv a [ i ]+=b [ i ] ∗ c [ i ] 25.6 ∗ 109 vpvts a [ i ]+=b [ i ] ∗ s 6.4 ∗ 109 vpvpv a [ i ]+=b [ i ]+ c [ i ] 51.2 ∗ 109 vtvtv a [ i ]∗=b [ i ] ∗ c [ i ] 51.2 ∗ 109 vsumr sum+=a [ i ] 64 ∗ 109 vdotr vbor dot+=a [ i ]+b [ i ] all combinations 64 ∗ 109 12.288 ∗ 109 Figure 7: Expressions and Flops for simple TSVC kernels explore two dimensions for these kernels: their performance on the little and BIG processor, and the energetic consumption. In particular, we want to determine the gain in terms of energy provided by the vectorization and if trade-offs are interesting to explore in the context of HPC applications. 2.2.2 Vectorization/Energy tradeoffs Two metrics have been considered: instantaneous consumption, showing the cost in terms of energy of the vector/scalar pipelines. And the global consumption, taking into account the time to execute a kernel. Figure 8 show results of the instantaneous consumption, both on ARM A15 and ARM A7. Figure 8 shows that the vector pipeline of the Cortex-A15 requires more Consommation instantanée moyenne sur TSVC A15 3.5 3 Consommation instantanée moyenne , TSVC A7 0.5 runvec A15 runnovec A15 0.45 0.4 2.5 0.35 0.3 2 Watt Watt runvec A7 runnovec A7 1.5 0.25 0.2 0.15 1 0.1 0.5 0.05 0 0 r um vs va v vp or vb r s ot vd v vt v v vt vp vt vp v vp vp v vt vt v vp r v s vt vp vt vp ot vd v vt v vp vp r um vt vt vs va or vb Figure 8: Mean instantaneous consumption on A15 (left) and A7 (right) for vectorized (runvec) and scalar (runnovec) versions of TSVC kernels energy than the scalar one, with a mean measured gap of 0.5W att. This gap for the Cortex-A7 processor is too small to be accounted for. The global consumption determines whether vectorizing pays off, in terms of energy but also in performance measured in Gflops/cycle. Figure 9 shows the impact on energy for the A15 and A7. This is confirmed by the plots shown in the previous figure 9, vectorization has an important impact on performance and on energy, even if the vector pipeline requires more energy. 9 D5.4 - MAQAO ARM Port Version 1.1 Figure 9: Global energy consumption on A15 (left) and on A7 (right) for vectorized (runvec) and scalar (runnovec) versions of TSVC kernels The timing measurements obtained (not shown here) follows the same pattern. Kernels consuming less are kernels that execute faster. Between the two architectures, the energy requirement is within a mean factor of 1.8 (A7 consumes less than the A15). However, some kernels such as vpvtv, vpvpv, vtvtv have similar energy consumption for A7 and A15, while they are faster on A15. One of the possible reasons is that the A15 architecture may be more efficient with three operand kernels. The previous plots of the A7 and A15 in Figure 9 enabled the comparison between two versions: vectorized and non vectorized. Now, Figure 10 compares the efficiency of the same version of the kernels (vectorized), on both architectures. Figure 10 on the left shows the GFlop par Watt TSVC Benchmark 0.12 0.1 Gflop/Watt runvec A15 Gflop/Watt runvec A7 Gflop/W 0.08 0.06 0.04 0.02 0 or v r um vs vb v vp vt vt v vp r s ot vp vd v vt v vt vt vp vp Figure 10: Comparison of Gflop/W ratio (left) and Gflop/W/s (right) on Cortex-A15 and Cortex-A7 architectures, for TSVC kernels. For both, high is better, the right figure has log scale. Gflop/W for each of the 10 kernels, while the right figure presents the Gflop/W/s metric. The first figure therefore focuses on the energetic efficiency of the code on the architecture. Globally for these simple kernels, the A7 outperforms the A15 in terms of energy efficiency (for vsumr,vbor kernels) but for the kernels cited previously (vpvtv, vpvpv, vtvtv ), both architectures have the same consumption behavior. In terms of efficiency measured in Gflop/W/s the Cortex-A15 outperforms nearly all the time the Cortex-A7, but there are still examples where both architectures are on par. Determining precisely the conditions on the kernels that lead to such a situation is still to be done. Figure 11 provides the speed-up on execution time, comparing both architectures for the same (vectorized) kernels. Combining these observations with the previous ones, Figure 11 shows that the CortexA15 offers a speedup of 2.2 compared to the Cortex-A7, even if for the vsumr kernel, both architectures provide the same performance in terms of GFlop/W/s. For most of the other kernels, the Cortex-A15 offers better Gflop/W/s and better performance. This study shows that SIMDization is indeed critical for performance, on both big and 10 D5.4 - MAQAO ARM Port Version 1.1 Speed-Up A15-A7, TSVC 4.5 4 runvec Speed-Up 3.5 3 2.5 2 1.5 1 0.5 0 or vb t or vd r um vs v vt vt v vp vp s v vt vp vt vp v vt v vp va Figure 11: Speed-up between A7 and A15 processors for different vectorized TSVC kernels LITTLE processors. The SIMD pipeline of the benchmarked architecture consumes more energy (around 10%) than the scalar one but in terms of energy efficiency, even for simple kernels, performance between big and LITTLE architectures is in general in favor of the big processor due to the reduced execution time. In the following, only the big architecture will be considered. SIMDization is the main metric used by the current version of MAQAO on ARM. 3 Porting MAQAO on ARM We provide thereafter a short description on how to use MAQAO on ARM32 architecture and what kind of result if provides. 3.1 Installation and usage The prerequisites for MAQAO are provided in the README file: • gcc and g++ • cmake with a version higher than 2.8.8 To build MAQAO inside the build directory, type: > cmake .. -DARCHS=arm > make Other usual flags can be used for the cmake command. By default, MAQAO and its libraries are built inside the bin and lib directory. A system-wide installation can be obtained by using make install. The environment variable MAQAO SA PATH must be set to the source directory of MAQAO. MAQAO can be either compiled on ARM architecture or on an x86 Intel machine (faster). MAQAO disassembler can be tested through the use of > bin/maqao madras -d <mybinaryfile> 11 D5.4 - MAQAO ARM Port Version 1.1 The binary files considered should be compiled with -marm flag. Thumb instructions are not analyzed by MAQAO (even if they are disassembled). To generate a report, > bin/maqao simd_analyzer.lua <mybinaryfile> with mybinaryfile the executable to analyze. The output generated corresponds to the analysis of the innermost loops of all functions. For a more pinpointed analysis, the name of the function to analyze can be given as parameter: > bin/maqao simd_analyzer.lua runvec:s242 MAQAO then generates a report for the function named s242 of binary called runvec. This report is: analysing: s242 s242 debug data unavailable ---------- raw instruction listing o l.137 0xf160: vldr s15, [r5, #0] ; flags(0x4010) o l.137 0xf164: ldr r5, [pc, #208] ; flags(0x10) o l.137 0xf168: ldr ip, [pc, #208] ; flags(0x10) o l.137 0xf16c: mov r3, #1 ; flags(0x10) o l.137 0xf170: ldr r0, [pc, #204] ; flags(0x10) o l.137 0xf174: ldr r1, [pc, #204] ; flags(0x10) o l.137 0xf178: mov r2, r5 ; flags(0x10) o l.137:136 0xf17c: vadd.f32 s15, s15, s16 ; flags(0x4010) o l.137:136 0xf180: add ip, ip, #4 ; flags(0x10) o l.137:136 0xf184: vldr s12, [ip, #0] ; flags(0x10) o l.137:136 0xf188: add r0, r0, #4 ; flags(0x10) o l.137:136 0xf18c: vldr s13, [r0, #0] ; flags(0x10) o l.137:136 0xf190: add r1, r1, #4 ; flags(0x10) o l.137:136 0xf194: vldr s14, [r1, #0] ; flags(0x10) o l.137:136 0xf198: add r3, r3, #1 ; flags(0x10) o l.137:136 0xf19c: cmp r3, #32000 ; flags(0x10) o l.137:136 0xf1a0: vadd.f32 s15, s15, s12 ; flags(0x10) o l.137:136 0xf1a4: vadd.f32 s15, s15, s13 ; flags(0x10) o l.137:136 0xf1a8: vadd.f32 s15, s15, s14 ; flags(0x10) o l.137:136 0xf1ac: vmov lr, s15 ; flags(0x10) o l.137:136 0xf1b0: str lr, [r2, #4]! ; flags(0x10) o l.137:136 0xf1b4: bne f17c ; flags(0x19) o l.137 0xf1b8: ldr r0, [pc, #124] ; flags(0x10) o l.137 0xf1bc: ldr r1, [pc, #124] ; flags(0x10) o l.137 0xf1c0: ldr r2, [pc, #124] ; flags(0x10) o l.137 0xf1c4: ldr r3, [pc, #124] ; flags(0x10) o l.137 0xf1c8: str r6, [sp, #16] ; flags(0x10) o l.137 0xf1cc: stm sp, r7, r8, sl ; flags(0x10) o l.137 0xf1d0: str sb, [sp, #12] ; flags(0x10) o l.137 0xf1d4: blx 8814 ; flags(0x12) o l.137 0xf1d8: subs r4, r4, #1 ; flags(0x10) o l.137 0xf1dc: bne f160 ; flags(0x19) 12 D5.4 - MAQAO ARM Port Version 1.1 ---------- circuits analysis - checking whether data dependences are compatible with vectorization . l.136 has 1 dependence circuit(s) on FP instructions . analysing dependence circuit 1 of l.136: . 0xf17c: vadd . 0xf1a0: vadd . 0xf1a4: vadd . 0xf1a8: vadd > reduction with instruction vadd > loop is vectorizable with reduction - generation of dot files ---->./cfg_runvecs242.dot The first part provides the listing of the assembly code analyzed. The left annotations, named l.137 and 136, show the scope of two nested loops. Here, loop 136 is included in loop 137. SIMD analysis is only performed on innermost loops. Then follows a list of analyses. Here on the previous example, the circuit analysis finds that there is a loop-carried dependence cycle involving floating point (FP) instructions. As all instructions involved are additions, this is a reduction and can therefore be vectorized. The list of analysis results are discussed in the following section. Finally, the name of the data dependence graph for the innermost loop is generated and its name is given at the end of the report. The file generated are dot files, and png figures can be obtained using the tool graphviz and the dot command: > dot -Tpng ./cfg_runvecs242.dot An example can be seen in Figure 12. 3.2 Instrumentation MAQAO is able to patch ARM binaries. For this preliminary version, the patch can only be done for ARM32 architectures. The command line is the following: > bin/maqao madras <mybinaryfile> --function="test;@0x8490" where the argument --function defines the function probe to insert and the address where the call will be inserted. The function (here test) has to be defined in the binary code statically. This means that the instrumentation functions have to be given at compile time and the code need to be recompiled. Another approach, easier to use and with no need for recompilation, would require to dynamically load the probe functions from a library. The dynamic loading of an instrumentation library containing user probes is not yet functional. The address given to the patcher can be automatically found from MAQAO according to user constraints. This low-level patching method can be used within MAQAO lua scripts, where the API can iterate through the code structure (loops, instructions, blocks) and provide the corresponding code addresses. The patching mechanism proceeds by block relocation. The whole basic block, containing the address to patch, is relocated to a new code section. A jump is inserted at its previous address and the block is padded with nops. Registers are saved before calling the probe function and restored after its execution. 13 D5.4 - MAQAO ARM Port Version 1.1 4 Performance Analysis with MAQAO on ARM The main objective of this deliverable is to analyze ARM binaries and detect opportunities for optimization. MAQAO analyzes binary executable codes, restructures them by finding functions, loops, blocks and computes dependence graph. From this information, user hints are generated, finding possible ways to improve the code. The construction of the control flow graph and call graph are based on usual techniques and is not described here. The disassembling itself relies on objdump library (libopcode) and is adapted for MAQAO. The dependence graph is detailed in the following, with the hints that are generated by MAQAO. 4.1 Dependence Graph Building We propose a method to build statically the dependence graph between instructions. This method uses reduction detection with a register dependence analysis, performed statically on the code by MAQAO. Using instrumentation, a dynamic memory dependence graph will be used further in the project in order to better capture memory dependences. 4.1.1 Register Dependence Analysis This dependence analysis is performed by MAQAO on instructions in inner-most loops. It computes existing dependences between any couple of instructions in the loop, due to the use of registers. The dependences are among one of the three types: RAW (read after write), WAR (write after read) and WAW (write after write). Here for the vectorization analysis, only the RAW dependence type is analyzed (true dependences) since WAW and WAR dependences are due to register reuse and can be removed by choosing a different register allocation. All register dependences and their distances (0: dependence inside the same iteration, 1: the write occurs one iteration before the read) are computed. Several particularities of the assembly code are taken into account: • Zeroing registers: when XOR-like instructions are applied with two operands that are the same register, this register is initialized to 0. The outcome of the instruction does not depend on initial input value of the register, even if there is a read, hence there is no dependence with previous instructions. This special case is used by compilers to initialize registers, in particular SIMD registers. Our dependence analysis takes this particular case into account. • Post-incrementing address registers: on ARM, address registers can be post-incremented with LDMIA/STMIA instructions for instance, or more generally a ! suffix. • Return and call instructions are usually achieved through explicit manipulation of the instruction pointer register. In particular, returns are usually generated through the pop instruction, restoring its value from the stack. Besides, when registers are used by instructions for memory indices, the dependences to these instructions for these registers are tagged as memory index computation. This will be used later to separate the computation that need to be vectorized from the induction variable/address computation. Figure 12 presents the code and dependence graph for function s000 from TSVC benchmark. The nodes in the graph are assembly instructions, the edges are dependences (RAW) with their distance. 14 D5.4 - MAQAO ARM Port Version 1.1 for (int i = 0; i < lll; i++) X[i] = Y[i] + 1; 0xb888: add r3, r3, #4 0 1 0xb8a0: bne b888 0 0xb88c: vldr s15, [r3, #0] 0xb894: cmp r3, r4 0 0xb890: vadd.f32 s15, s15, s14 0 0xb898: vmov r1, s15 0 0xb89c: str r1, [r2, #4]! 1 Figure 12: Source code and dependence graph on the binary code of function s000. Labels on edges represent dependence distance. Red edges are dependences for registers used in address computation. Two recurrences occur, one with a post-incremented address register, the other with the loop counter. 4.1.2 Detecting loop counters and induction variables Induction variables are detected through the analysis of the dependence graph, taking into account the specificities of some instructions (such as post-increments). Following the usual analyses proposed in the literature and implemented in compilers, dependence cycles are detected and whenever all instructions involved are used for address computation, if the instructions are only mov and simple arithmetic operations the registers written are tagged as induction variables for the address comptutation. Capturing the stride of these variables is important to grasp how data structures are iterated. This approach is simple but cannot capture all cases. To complement it, a trace-based approach using instrumentation is planned. This could capture more complex memory access patterns. Figure 13 shows an example of induction variable detection. Two independent registers are used to compute the 4 addresses (3 loads and one store), both use a 1024 increment, that advocate for either an optimization at the loop level (here an interchange with the outerloop) or a restructuration of the data layout (transpose). 4.2 Interpretation and Analysis The dependence graph, associated to the knowledge of the individual instructions and register accesses provided by MAQAO, constitutes the foundation for the vectorization analysis. In this section, we describe the methods used to identify different vectorization opportunities on codes, in particular on TSVC benchmark suite [7]. Besides, we propose a set of hints in order to guide the user towards vectorization. 15 D5.4 - MAQAO ARM Port Version 1.1 for (int j = 0; j < LEN2; j++) aa[j][i] = aa[j][i] + bb[j][i] * cc[j][i]; 0x10cb4: add r3, r3, #1024 1 1 1 0 0x10cc0: add r2, r2, #1024 1 0x10cb8: cmp r3, #262144 0x10ca4: add r1, lr, r3 1 0 0x10c9c: vldr s15, [r2, #0] 1 0 0x10ca0: add r0, ip, r3 0 0x10cac: vldr s14, [r1, #0] 0 0x10cc4: bne 10c9c 0x10ca8: vldr s13, [r0, #0] 0 0x10cb0: vmla.f32 s15, s13, s14 0 0x10cbc: vstr s15, [r2, #0] Figure 13: Source code and dependence graph on the binary code of function s2275. Two induction registers are detected, r2 and r3, with increment of 1024 in both cases. This corresponds to a large stride in the data structure. 4.2.1 Detection of Vectorization Opportunities The method to detect possible vectorization is based on the dependence graph, with memory and register-based dependences and only determines whether a schedule exists for a possible vectorized version. The first step is to separate (if possible) in the graph the instructions relative to the computation of memory addresses or induction variables from the rest of the computation. The existence of a schedule for a vectorized version will depend on structural conditions on the remaining graph. We describe in details the different steps of this method. Identifying Address Computation Vectorization concerns only a fraction of the instructions, some instructions are necessary for the control and for address computation and these will not be vectorized. This is therefore essential to separate in the dependence graph these different instructions. By tagging RAW dependences for registers used in address indexing (loads and stores), we partition the instructions belonging to the same connected component by cutting these edges. One of the partition corresponds to instructions relative to address computation while the second corresponds to the instructions to vectorize. Post-incremented address register are handled specifically: for such instructions (usually a load or store), two nodes are considered, one reading and writing the address register, the other performing the load/store. This edge cut may lead to more than 2 partitions of the same connected component. This reflects the case where there are indirections (such as A[B[i]]). This property will be used in the following section to suggest data reshaping. Moreover, it is possible to find dependence graph where no cut exists. This does not occur in TSVC benchmark suite, but the following example illustrates this case: for i = 0, n A[i] = B[i-1]; B[i] = B[A[i]] 16 D5.4 - MAQAO ARM Port Version 1.1 In this code, there is a dependence cycle between the two instructions, and one of the dependences is due to address computation. In this case, we cannot partition the dependence graph and the following conditions for vectorization will be applied on the whole graph. 0xdaf8: add r3, r3, r5 1 0xdae4: add r2, r2, #4 1 0 0xdae0: vldr s14, [r3, #0] 1 0 0xdae8: vldr s15, [r2, #0] 1 0xdafc: bne dae0 0 0xdaf0: cmp r2, r4 0 0xdaec: vadd.f32 s15, s14, s15 0 0xdaf4: vstr s15, [r3, #0] Figure 14: Partitioning the dependence graph of function s171. The first partition corresponds to address computation, the second one to floating point computation. Red edges connect the two partitions. Figure 14 represents the dependence graph on the binary code of function s171. The graph is partition so as to analyze separately the floating point part of computation from the address computation. Only the floating point instructions will be vectorized. Dependence Cycles We only consider here partitions of the dependence graph that do not contribute to address computation (if partitioning is possible). We present in this section a sufficient condition for vectorization, based on the preservation of the dependences. Dependences are weighted by their distance, and cycles in the dependence graph have cumulative weight > 0 (assuming single dimension distance vector). We assume the vectorization we want to achieve will place inside the same SIMD vector, the data accessed by a few consecutive iterations, such 4 floats for instance for Neon 128bit vectors. These 4 floats that were accessed through different iterations, are accessed after vectorization during the same iteration. In terms of dependence distances, the vectorization decreases the distance by a factor corresponding to the number of elements in a SIMD vector (4 in our example). Hence vectorization is possible if there exists a schedule for the vectorized dependence graph. Such schedule exists if and only if there is no cycle of weight 0. For multidimensional distance vectors, only the distance of the innermost loop is considered (the one that will be vectorized). Large strides or dependence cycles for this dimension may lead to consider loop interchange or transposition. This condition is a structural property of the graph and can be automatically checked. In the current state of dependence graph computation, as only scalar dependences are computed (with 0 or 1 distance), this vectorization check is not necessary. However, as we plan to use memory traces to capture more memory dependences, this check will need to be performed. Reductions We present in this section another sufficient condition for vectorization, this time not preserving dependences. Similarly to the previous section, we consider in the dependence graph the cycles with a weight lower than the size of the SIMD vectors (in number of elements). These cycles would be transformed after vectorization into 0-weight cycles, preventing from finding a schedule. 17 D5.4 - MAQAO ARM Port Version 1.1 When all instructions involved in such cycle are the same associative operation (such as ADD, MAX, MIN) and the cycle is elementary, the computation boils down to a reduction: dependences can be broken and the computation can be rescheduled thanks to associativity. Depending on the dependence distance on the cycle, this will require some data layout transformation or shuffling (depending on SIMD ISA). for (int i = 0; i < LEN; i += 5) dot = dot + a[i] * b[i] + a[i + 1] * b[i + 1] + a[i + 2] * b[i + 2] + a[i + 3] * b[i + 3] + a[i + 4] * b[i + 4]; 0x1403c: vldr s14, [r2, #-12] 0 0x14040: vldr s15, [r3, #-12] 0 0x14044: vmul.f32 s15, s14, s15 0x14048: vldr s7, [r2, #-16] 0 0x1404c: vldr s8, [r3, #-16] 0 0 0x1407c: vmla.f32 s15, s7, s8 0 0x14080: vadd.f32 s16, s15, s16 0x14050: vldr s9, [r2, #-8] 0 0 0x14084: vmla.f32 s16, s9, s10 1 0x14054: vldr s10, [r3, #-8] 0 0 0x14058: vldr s11, [r2, #-4] 0 0x1405c: vldr s12, [r3, #-4] 0 0x14088: vmla.f32 s16, s11, s12 0x14068: vldr s14, [r1, #0] 0 0 0x14078: vldr s13, [r1, #0] 0 0x1408c: vmla.f32 s16, s13, s14 Figure 15: Reduction detection: code and dependence graph for function s352. Address computation instructions have been removed. The cycle of length 4 is a reduction with a combination of vadd and vmla. Figure 15 illustrates the case where the dependence graph has a cycle of weight 1, and all nodes in the cycles are additions (either only addition, or combined with a multiply). The code presented is not vectorized (using scalar registers instead of vectors). Therefore, this code computes a reduction and is vectorizable provided that this 4 term reduction can be rewritten with SIMD vector code. For this particular case, code is unrolled and each memory access has large strides. MAQAO will not find the reroll transformation but suggest to change the data structure (like changing array of structure into structure of array), and implement the reduction. 4.2.2 Identifying Transformations In addition to the analysis of the dependence graph, MAQAO can help the user determining the transformations in order to make the code vectorizable. Instruction Scheduling The dependence graph has no cycle and the code can be vectorized, provided that the instructions are scheduled in the loop according to the dependences. With intrinsics, load and store operations can be scheduled independently of computation operations. MAQAO can help the user finding a correct schedule for intrinsics instructions, in particular in presence of loop-carried dependences in the original code. 18 D5.4 - MAQAO ARM Port Version 1.1 for (int i = 0; i < LEN-1; i++) a[i] = b[i] * c[i ] * d[i]; b[i] = a[i] * a[i+1] * d[i]; 0xf038: vldr s13, [lr, #0] 0xf034: vldr s14, [r0, #0] 0 0 0xf05c: vldr s13, [ip, #0] 0 0xf03c: vmul.f32 s13, s14, s13 0xf040: vldr s15, [r3, #0] 0 0 0 0xf060: vmul.f32 s14, s14, s13 0 0xf050: vmul.f32 s15, s13, s15 0xf064: vmul.f32 s15, s14, s15 0 0 0xf054: vmov fp, s15 0xf068: vstmia r3, s15 0 0xf058: str fp, [r1, #4]! Figure 16: Instruction scheduling: code and dependence graph for function s241 (address computation have been removed for clarity). This function is vectorizable. All the loads involved in the multiplications have to be scheduled before the stores. However, the compiler has not succeeded the vectorization. Figure 16 shows for function s241 a code that can be vectorized with no difficulty. There are two expressions computed here, sharing some common variables. The compiler has scheduled the left computation first, and then the second one. This prevents from vectorizing. The graph also shows a valid schedule where all loads are first scheduled (vectorized) and then perform all computations. MAQAO can here advise to reschedule explicitly the source instructions in order to ease vectorization. Data Reshaping MAQAO static dependence analysis finds the strides used by the address registers. This provides several hints of transformations: • The value of sn is larger than the size of an element accessed: either another stride sk has a value equal to the size of an element, in this case this advocates for a loop interchange, or a data layout transformation corresponding to a transposition. Or sn is the smallest stride, but it does not correspond to the size of an element (such as for function s2275, presented in Fig.13, where strides for all accesses are 1024-byte long). In this case, there are “holes” in the structure that need array reshaping. This generally corresponds to a change of array of structures into a structure of arrays. • The value of sn is negative: negative strides may prevent the compiler from vectorizing. If all other memory accesses have also negative strides for the same loop, loop reversal can be a solution. For function s112 for instance, all loads and stores have strides of −4 (in bytes). Loop reversal is therefore possible here. For function s122, only one load has 19 D5.4 - MAQAO ARM Port Version 1.1 a negative stride and this advocates for changing the data layout of this array. Loop Transformation The dependence graph can show some opportunities for loop distribution, loop reversal (see data reshaping) or loop interchange. Loop distribution will be beneficial when one part of the computation is sequential while the other is parallel, and these parts can be separated. This occurs for function s222 illustrated in Figure 17. The cycle in the left one for (int i = 1; i < LEN; i++) a[i] += b[i] * c[i]; e[i] = e[i - 1] * e[i - 1]; a[i] -= b[i] * c[i]; 0xe768: vldmia r0, s14 1 0xe76c: vldmia 0 ip, s15 1 0xe778: add r1, r1, #4 0 1 0xe7a0: bne e768 0 0xe770: vmul.f32 s15, s14, s15 0xe77c: vldr s14, [r1, #0] 0 0 0xe78c: vadd.f32 s13, s13, s15 0 0 0 0xe790: vmul.f32 s14, s14, s14 0 0xe794: vsub.f32 s15, s13, s15 0xe784: add r3, r3, #4 0xe788: cmp r3, r5 1 1 0xe780: add lr, r3, r4 0 0xe798: vstr s14, [lr, #4] 0 0xe79c: vstmia r2, s15 1 1 0xe774: vldr s13, [r2, #0] Figure 17: Loop distribution: code and dependence graph for function s222 . The loop has two distinct slices of computation. It could be distributed into two loops. is not a reduction but is only due to post-increment instruction. This part is vectorizable. On the right, the loop is sequential due to memory dependence (not detected here by the static dependence analysis). Reduction Rewriting Reduction detection should generate a hint showing how to write a reduction with intrinsics. The code will depend on the SIMD ISA used. Idiom Recognition Idiom recognition consists in recognizing a (vector) expression from the dependence graph. Using the memory trace to identify the different memory arrays used by the computation, and the register-based dependence graph to identify the vector operations, it is possible to detect the following operations from TSVC benchmark (X,Y denote vectors, c scalar) and build the vector expression, as a hint for the user: • memcopy, on dense or sparse arrays • reductions such as c+ = X[i] (s311), c = dot(X, Y ) (313), c = max(c, max(X)) (314) • vector operations such as X = Y + c with c a scalar (s000), X[i] = Y [i] + X[i − 1] (111), and all functions vpv,vtv,.... The hint for all the functions among which we can associate a library function, is to replace the function with the appropriate call. 20 D5.4 - MAQAO ARM Port Version 1.1 Figure 18 shows a dependence graph where the heart of the computation is a mem copy, and the array read is accessed through an indirection. The user then has to determine the appropriate library call to replace this code, according to the indirection array. for (int i = 0; i < LEN; i++) a[i] = b[ip[i]]; 0x16074: ldr r1, [r2, #4]! ; flags(0x4010) 1 0x16088: bne 16074 ; flags(0x19) 0 0x16078: add r1, r5, r1, #0, 2 ; flags(0x10) 0 0x1607c: ldr r1, [r1, #0] ; flags(0x10) 0 0x16080: str r1, [r3, #4]! ; flags(0x10) 1 0 0x16084: cmp r3, r4 ; flags(0x10) Figure 18: Idiom recognition: code and dependence graph for function vag. By naming the independent data streams in the dependence graph, it can be found that this function computes the vector expression Z[i] = X[Y [i]]. This is a sparse memcpy. Another case corresponds to a code where the dependence graph has no memory dependence, and MAQAO could build a vector expression corresponding to the computation. In the case presented in Figure 19, this is a DAXPY operation for (int i = 0; i < LEN; i++) a[i] += b[i] * c[i]; 0x16570: add r1, r1, #4 1 0x16574: add r0, r0, #4 0 0x1656c: vldr s15, [r2, #0] 0x16578: vldr s13, [r1, #0] 0 1 0 0 0x1657c: vldr s14, [r0, #0] 1 0x16584: add r3, r3, #1 1 0x16590: bne 1656c 0 0x16588: cmp r3, #32000 0 0x16580: vmla.f32 s15, s13, s14 0 0x1658c: vstmia r2, s15 1 Figure 19: Idiom recognition: code and dependence graph for function vpvtv . By naming the independent data streams in the dependence graph, it can be found that this function computes the vector expression Z = Z + X ∗ Y . Limits Most of the limitations come directly from the dependence analysis itself. Some remaining limits are part of identifying the correct vectorization transformation. 21 D5.4 - MAQAO ARM Port Version 1.1 Structurally, MAQAO focuses on loops. When the loops are fully unrolled, or when the loop bodies are part of other functions (not inlined), MAQAO cannot detect vectorization opportunities. More generally, loop rerolling (even after a partial unroll) would be difficult to advocate for. Indeed, it requires to identify that different slices of computation are equivalent and can be “factorized” in a loop. This rerolling comes with data reshaping issues and is not considered so far. 5 Performance Analysis The evaluation of our method as been conducted on TSVC benchmark suite[7] and several codes from the MontBlanc project. The objective is different on both cases: TSVC indeed illustrates the capacity of MAQAO to detect a number of optimization opportunities, while for the real application, the goal is to guide optimization with MAQAO and obtain better performance results. 5.1 TSVC benchmark MAQAO performs a dependence analysis on the code in order to find conditions for vectorization. For ARM, the dependence graph is only based on register dependences. On contrary to the x86 64 implementation, the trace of all memory references is not collected and memory dependences are therefore not computed. This implies that some loops that may not be vectorizable, due to memory dependence, may be found as vectorizable. The instrumenter is not yet connected to the memory trace library NLR [6] but this will be the focus of the following developments. However, as for the analysis for x86 64, we generate hints corresponding to vectorization opportunities. The method is similar to the method exposed in the paper [8]. We present in the following the list of hints automatically generated. 5.1.1 Reductions Reductions correspond to cycles in the dependence graph, using an associative operation (such as ADD, MAX, MIN). If the weight of the cycle is lower than the size of a vector, then this is a reduction (otherwise the code is vectorizable). Reductions are vectorizable but require some additional transformation and rescheduling. The table in Figure 20 shows the reductions found on TSVC benchmark. Most functions used are either vadd or vmla. The latest corresponds to a dot product. 5.1.2 Large Memory Strides Memory strides are detected statically thanks to the reduction computing the address register used by load and store instructions. For instance, function s1115 in Figure 21 exhibits one load access with a 1024-byte stride, due to array c. This appears in its dependence graph with a reduction (add) with 1024 strides. The hint generated is the following: - memory stride analysis - checking whether data strides are compatible with vectorization . l.39 has references with large or negative index strides > memory load with stride of 1024 bytes 22 D5.4 - MAQAO ARM Port Version 1.1 Function with reduction s1118 s126 s221 s231 s233 s2233 s235 s242 s256 s275 s2111 s311 s312 s313 s317 s319 s3112 s323 s421 s1421 s422 s423 s424 s453 s471 s4115 s4116 vsumr vdotr vbor Length Operator 1 1 2 1 1 1 1 4 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 vmla vmla vadd vadd vadd vadd vmla vadd vsub vmla vadd vadd vmul vmla vmul vadd vadd vmla vadd vadd vadd vadd vadd vadd vadd vmla vmla vadd vmla vadd Figure 20: List of functions of TSVC with reductions on floating point operations. MAQAO lists for each function the type (load or store) and the number of accesses with large strides. All strides are given in bytes. 5.1.3 Complex Control Flow When several execution paths are present in the innermost loop of a function, it makes vectorization more difficult. Using versioning, loop splitting or conditional masks (or conditional moves for ARM) is required in general. Table 22 provides the list of functions with complex control flow. This is interesting to see that for a number of functions with conditionals, the compiler is able to flatten the conditional and generate code with conditional moves (such as s274). The most complex case corresponds to a switch statement. 5.1.4 Non vectorizable loops The following functions have been considered as non-vectorizable: s321,s3111,s322,s352. Since only dependence between registers are taken into account on ARM, it implies that there is a dependence cycle involving more than one instruction. These codes may correspond to scans 23 D5.4 - MAQAO ARM Port Version 1.1 for (int i = 0; i < LEN2; i++) for (int j = 0; j < LEN2; j++) aa[i][j] = aa[i][j]*cc[j][i] + bb[i][j]; 0xc228: add r0, r0, #4 1 0xc23c: add r1, r1, #1024 0 1 0xc234: add r3, r3, #1 1 0xc22c: vldr s15, [r0, #0] 0xc244: bne c220 0 0xc220: vldr s13, [r1, #0] 0 1 0xc238: cmp r3, #256 0 0xc230: vmla.f32 s15, s13, s14 0 0 0xc240: vstmia r2, s15 1 1 0xc224: vldr s14, [r2, #0] Figure 21: s1115 function with its dependence graph. One of the induction register, r1, has an increment of 1024. This corresponds to a large stride in the vector cc. (computing partial sums) that could be vectorizable (s321), codes with flattened conditionals (s3111) or code mixing different instructions for addition (vadd and vmla such as s352) that could be vectorizable. 5.2 Porting and Optimizing SMMP with MAQAO SIMPLE MOLECULAR MECHANICS FOR PROTEINS (SMMP) is an application proposed by the University of Oklahoma, Leibniz Institute for Molecular Pharmacology, Academia Sinica and Juelich Supercomputing Centre. SMMP proposes algorithms of Monte-Carlo type in order to simulate the thermodynamics of proteins. The code is written in Fortran and several parallel versions are available. However it was not ported yet to ARM architecture. Function s253 s272 s277 s278 s279 s1279 s2710 s441 s442 # of execution paths 2 2 2 2 2 2 2 2 3 Figure 22: List of functions with complex control flow. Two paths means there is an ’if’ statement that has not been flattened by the compiler. More than 2 paths implies a switch statement for instance. 24 D5.4 - MAQAO ARM Port Version 1.1 5.2.1 Porting and profiling The main limitation of our benchmark platform (the ODROID-XU+E, ARM32 processors) is that it is only able to perform single precision computation. Changing this in the code still leads to an execution with no error and the first results are correct. The execution time does not exceed one hour and is therefore appropriate for performance tuning. To compile and vectorize the code, the following flags are used: -O2 -g -funsafe-math-optimizations -ftree-vectorize mfpu=neon -ftree-vectorizer-verbose=4. This application is proposed with 5 examples, 3 will be used in this study: • annealing • multicanonical • parallel tempering s The two other examples are not considered: one has issues related to single precision (not converging), and the second has a too small execution time to expect any interesting performance gain. A basic profiling is conducted with gprof, in order to focus our study only on hot functions. For the 3 considered examples, the profiling graphs are shown in Figures 23, 24 and 25. The exclusive time spent in each function is the value given between parenthesis. The hot spots of these examples are the two functions enyshe and enysol taking resp. 93.38% and 77.58% of the total exclusive time (only spent in the function itself). Most of the time is spent in loop nests. Figure 26 shows one of the innermost loops of function enysol. Note that the accesses to xyz and spoint do not allow vectorization. The number of iteration count for this loop is small, but the loop is executed many times. 5.2.2 MAQAO Analysis The SIMD report of MAQAO for enysol function is not provided in extent, but we highlight some interesting points in this section. We compare the output of the optimization report given by GCC with the hints given by MAQAO. Note that at this point MAQAO is only performing a static dependence analysis, hence some memory dependence, possibly preventing SIMDization, are not taken into account. MAQAO provides “optimistic” hints. For instance Figure 27 presents a loop given as SIMDizable by MAQAO while the memory dependence prevents this. This will be corrected with trace-based dependences (see instrumentation). On the contrary, the loop presented in Figure 28 cannot be vectorizable according to GCC, and is detected as vectorizable by MAQAO. The GCC optimization report indicates that booleans have a type LOGICAL(kind=4) that prevents the compiler from vectorizing. A number of initialization loops and computation loops involving these booleans cannot be therefore vectorized. A possible modification is therefore to change the type of these variables (as in C where integers encode booleans) and observe the impact on vectorization. On loops 650 and 653 presented in Figure 29, MAQAO indicates that a 16000 byte stride exists between elements. The concerned optimization report is given in Figure 30. This stride comes from the way elements of arrays spoint and xyz are iterated. To get rid of this stride, there are two possibilities: • Loop interchange, • Transpose the arrays. The MAQAO report shows that loop 660 has 8 possible execution paths. For inner loops, multiple execution paths prevent the compiler from aggressively optimizing and vectorizing. The 25 D5.4 - MAQAO ARM Port Version 1.1 Figure 23: gprof annealing Figure 24: gprof multicanonical report is shown in Figure 32. The dependence analysis provided by MAQAO shows that there is no dependence between instructions in the different execution paths, as shown in Figure 31. We could therefore split this loop into at least 3 different loops, and this could help the compiler to vectorize. This also loop exhibits stride issues, since elements are not read consecutively. However, even when considering these transformations, there still remains conditional instructions in loop bodies, and this prevents many optimizations. Loops 645 and 655 have resp. 2 and 3 possible execution paths. Loop 655 has 2 nested conditionals. These conditions are interdependent and cannot be modified. The two paths for loop 645 result from one conditional, depending on an indirect memory access as can be seen in Figure 33. A possible solution would be to copy first the values of rvdw(indsort(ii)) in a new array, outside of the loops, so that the compiler could vectorize the code (using a mask). 5.2.3 Optimizing for SIMDization In order to improve the SIMDization by the compiler, the following optimizations have been performed on the code of the enysol function. Booleans have been modified into integers for all the code with the following condition (as in C) 26 D5.4 - MAQAO ARM Port Version 1.1 Figure 25: gprof parallel tempering s • TRUE = 1 • FALSE = 0 All conditions have been rewritten to preserve code correctness. This change (difficult to do automatically) improved significantly the capacity of GCC to vectorize loops. Loop 660 is split into 3 loops in order to reduce its complexity. However gcc was still not able to vectorize. This modification has another beneficial impact: as the loop accesses 4 large arrays (16000 bytes), reducing the number of accesses to different arrays reduces the possiblity to have cache conflicts. Several optimization have been performed on loops 650 and 653. Loop interchange is not possible due to memory dependences. Transposition is still possible, requiring a copy-in outside of the loop for xyz and spoint arrays. This solution was tested but the gains are nullified by the time taken by the copy-in. Another possibility is to change from the beginning the data layout of these arrays, so that no additional copy is required. A more global analysis shows that xyz and spoint are only used in this function. The initialization was therefore modified so as to transpose all accesses. The modified loops are shown in Figure 34. After all optimizations have been applied, compilation is achieved with the flags -O2 -g -funsafe-math-optimizations -ftree-vectorize -mfpu=neon -ftree-vectorizer-verbose=4. The GCC 27 D5.4 - MAQAO ARM Port Version 1.1 lst=1 do il=1,npnt sdd=0.0 do ilk=1,3 sdd=sdd+(xyz(lst,ilk)+spoint(il,ilk))**2 end do ... Figure 26: Loop example from enysol do i=1,ncbox inbox(i+1)=inbox(i+1)+inbox(i) end do Figure 27: Sequential loop given as vectorizable by MAQAO, enysol function optimization report indicates that loops 650 and 653 have been vectorized, together with the initialization loops. After verification on the assembly code, vectorization has been achieved indeed. In order to improve performance, initialization loops have been replaced by calls to memset in enysol. We compare the impact of these optimizations on performance and energy. Both code versions are compiled with the same flags: -O2 -g -funsafe-math-optimizations -ftree-vectorize mfpu=neon. The first version has no hand-tuned optimizations while the second one corresponds to the version with improved vectorization and memsets. The first version executes in 55min39s and the optimized version in 34m9s, exhibiting a speedup of 1.6. Figure 35 represents energy measurements on both versions of SMMP, with parallel tempering s. Figure 35, on the left, shows instantaneous consumption in the two versions of SMMP: There is no difference between the two versions. The additional energy cost due to the SIMD pipeline, observed on TSVC benchmarks, has been smoothed out here. Figure 35, on the right, shows the total energy for the execution of SMMP, optimized and not optimized. The optimization brings a factor of 1.5. This shows mainly the impact of vectorization on energy consumption. 5.3 Single Precision Limitations for Other Applications The study of applications such as Profasi, Quantum Espresso or BigDFT has led to some difficulties on the single-precision architecture considered. Profasi The code considered has not yet been ported to ARM architecture. The code is a protein folding and aggregation simulator, developed at Juelich Supercomputing Center. This application relies on an iterative solver. By transforming data into single precision, the time for convergence increased dramatically (from 40 min to 8 days). do il=1,npnt surfc(il)=.false. end do Figure 28: Intialization loop of an array of booleans, enysol function 28 D5.4 - MAQAO ARM Port Version 1.1 lst=1 do il=1,npnt sdd=0.0 do ilk=1,3 650 sdd=sdd+(xyz(lst,ilk)+spoint(il,ilk))**2 end do if(sdd.gt.radv2(lst)) then do ik=1, nnei sdd=00 do ilk=1,3 653 sdd=sdd + (xyz(ik,ilk)+spoint(il,ilk))**2 end do ... Figure 29: Loop 650 and 653 of enysol - memory stride analysis - checking whether data strides are compatible with vectorization . l.650 has references with large or negative index strides - memory load with stride of 16000 bytes - memory stride analysis - checking whether data strides are compatible with vectorization . l.653 has references with large or negative index strides - memory load with stride of 16000 bytes Figure 30: Report fragment of MAQAO for loops 650 and 653 Quantum Espresso The code is developed and used by Cineca, Democritos National Simulation Center and University Pierre and Marie Curie. This code studies the structure of material at the nanoscopic scale. Transforming this code from double precision to single precision generates approximations: atoms overlap due to the lack of precision. The application detect errors and stops. BigDFT This application is developed at the CEA of Grenoble and by the Insitut Nanoscience et Cryogenie (INAC). This code has been developed and optimized for ARM. However, the optimization considered is not on vectorization and the code works in double precision. When simplified, the code corresponds to a stencil pattern. The optimization of the code relies not only on the SIMDization but also on the reuse of data elements [10]. This requires the trace-based analysis we are currently porting to ARM. 6 On-going Developments The on-going developments on MAQAO are the following: • Porting MAQAO on ARM64: this is an extension of what has been done on ARM32, as the libopcode (disassembling routines for objdump, used in MAQAO) for ARM64 relies on what is written for ARM32. 29 D5.4 - MAQAO ARM Port Version 1.1 do j=nlow+1,nup if(xat(j).le.xmin) then xmin=xat(j) else if(xat(j).ge.xmax) then xmax=xat(j) end if avr_x=avr_x+xat(j) if(yat(j).le.ymin) then ymin=yat(j) else if(yat(j).ge.ymax) then ymax=yat(j) end if avr_y=avr_y+yat(j) if(zat(j).le.zmin) then zmin=zat(j) else if(zat(j).ge.zmax) then zmax=zat(j) end if avr_z=avr_z+zat(j) if(rvdw(j).ge.rmax) rmax=rvdw(j) end do Figure 31: Loop 660, enysol - control flow analysis . l.660 has 8 different execution paths > complex control flow hinders vectorization. Use versioning, loop splitting or masks Figure 32: Report fragment of MAQAO for loop 660 do ii=inbox(jbox)+1, inbox(jbox+1) if(rvdw(indsort(ii)).gt.0.0) then look(jcnt)=indsort(ii) jcnt+=1 end if end do Figure 33: Loop 645, enysol 30 D5.4 - MAQAO ARM Port Version 1.1 lst=1 do il=1,npnt sdd=0.0 do ilk=1,3 650 sdd=sdd+(xyz(ilk,lst)+spoint(ilk,il))**2 end do if(sdd.gt.radv2(lst)) then do ik=1, nnei sdd=00 do ilk=1,3 653 sdd=sdd + (xyz(ilk,ik)+spoint(ilk,il))**2 end do .. Figure 34: Optimized loops 650 and 653, enysol Instantaneous energy for SMMP Total energy for SMMP 2.5 120000 Instantaneous energy for SMMP Total energy for SMMP 100000 2 Energy (W) Energy (W) 80000 1.5 60000 1 40000 0.5 20000 0 0 SMMP_optimized SMMP_base SMMP_optimized SMMP_base Figure 35: Mean instantaneous consumption (left) and global consumption (right) for SMMP original and optimized codes • Higher level instrumentation: A higher level instrumentation requires first dynamic loading of an instrumentation library. This requires a modification of the PLT ELF section, this is on-going work. In particular, capturing memory traces will be possible as soon as this goal is achieved. • Analysis of the data layout. This analysis requires the use of memory traces, in order to capture the full dependence graph (including memory dependences) and memory patterns. As presented in LCPC 2014 [5], this analysis aims at measuring the impact of data layout transformations. • Integration into BOAST [9]. This corresponds to the next deliverable for MAQAO. To reach this goal, we rely on the previous analysis, altogether with the current SIMD analysis, in order to provide feed-back to BOAST. The two first items are incremental w.r.t. the current MAQAO state. The last item does not depend on the target architecture. It requires however that the memory trace is captured through instrumentation. The algorithmic part of this approach has already been worked on and the results for Intel x86 have been presented in a workshop. As soon as the instrumentation API enables memory traces, the third item will be tested on ARM. 31 D5.4 - MAQAO ARM Port Version 1.1 The objective of data layout transformations is to evaluate the impact of this kind of transformations on other optimization (in particular vectorization) and evaluate the impact of reuse, in particular for BigDFT. The solutions proposed by MAQAO will not be semantically correct in the general case, since the analysis is based on some profile. However, depending on the code (if in particular the code is regular), the data layout transformations can be made more generic. As this is already the case, performance will not be predicted through a performance model. Instead, we will resort to performance measures of the code generated. This is particularly important for data layout transformations where cache effects, coherence traffic and memory bandwidth are difficult to model and have a large impact on performance. 7 Conclusion This report shows how MAQAO can analyze and provide some feedback to the user for ARM codes. This analysis is essentially static and concerns vectorization, while a preliminary working instrumentation of binary ARM codes has been developed in MAQAO, opening the way for static/dynamic analyses. Several applications of the MontBlanc project have been analyzed with MAQAO on ARM32, and TSVC benchmarks have been used in order to test the different features of MAQAO. The development of instrumentation using dynamic libraries is the next step, with the extension to ARM64 codes. The method to analyze memory layouts and reuses, that has been developed in parallel, will then be tested on ARM architecture in the context of performance tuning with Boast. References [1] ARM. Arm(r) corelink(tm) cci-400 cache coherent interconnect technical reference manual. Technical report, ARM, 2011. [2] Denis Barthou, Andres Charif Rubial, William Jalby, Souad Koliai, and Cedric Valensi. Performance tuning of x86 openmp codes with maqao. In Parallel Tools Workshop, pages 95—113, Desden, Germany, September 2009. Springer-Verlag. [3] D. Callahan, J. Dongarra, and D. Levine. Vectorizing compilers: A test suite and results. In Proceedings of the 1988 ACM/IEEE Conference on Supercomputing, Supercomputing ’88, pages 98–105, Los Alamitos, CA, USA, 1988. IEEE Computer Society Press. [4] Andres S. Charif-Rubial, Denis Barthou, Cedric Valensi, Sameer Shende, Allen Malony, and William Jalby. Mil: A language to build program analysis tools through static binary instrumentation. In IEEE Intl. High Performance Computing Conference (HiPC), pages 206–215, Hyberabad, India, December 2013. [5] Christopher Haine, Olivier Aumage, and Denis Barthou. Exploring and evaluating array layout restructuration for simdization. In Intl. Workshop on Languages and Compilers for Parallel Computing (LCPC), Hillsboro, OR, USA, September 2014. To appear. [6] Alain Ketterlin and Philippe Clauss. Prediction and trace compression of data access addresses through nested loop recognition. In ACM/IEEE Intl. Symp. on Code Optimization and Generation, pages 94–103, New York, NY, USA, 2008. ACM Press. 32 D5.4 - MAQAO ARM Port Version 1.1 [7] Saeed Maleki, Yaoqing Gao, Maria J. Garzarn, Tommy Wong, and David A. Padua. An evaluation of vectorizing compilers. In ACM/IEEE Intl. Conf. on Parallel Architectures and Compilation Techniques, 2011. [8] C. Haine O. Aumage, D. Barthou and T. Meunier. Detecting simdization opportunities through static-dynamic dependence analysis. In Workshop on Productivity and Performance, 2013. [9] Brice Videau, Vania Marangozova-Martin, and Johan Cronsioe. BOAST: Bringing Optimization through Automatic Source-to-Source Tranformations. In Proceedings of the 7th International Symposium on Embedded Multicore/Manycore System-on-Chip (MCSoC), Tokyo, Japan, 2013. IEEE Computer Society. [10] Brice Videau, Vania Marangozova-Martin, Luigi Genovese, and Thierry Deutsch. Optimizing 3D Convolutions for Wavelet Transforms on CPUs with SSE Units and GPUs. In Proceedings of the 19th Euro-Par International Conference, Aachen, Germany, 2013. Springer Berlin Heidelberg. 33