PTF Energy Tuning Plugin for HPC Applications
Transcription
PTF Energy Tuning Plugin for HPC Applications
FAKULTÄT FÜR INFORMATIK DER TECHNISCHEN UNIVERSITÄT MÜNCHEN Masterarbeit in Informatik PTF Energy Tuning Plugin for HPC Applications Energieoptimierung für HPC-Anwendungen mit Periscope Umbreen Sabir Mian FAKULTÄT FÜR INFORMATIK DER TECHNISCHEN UNIVERSITÄT MÜNCHEN Masterarbeit in Informatik PTF Energy Tuning Plugin for HPC Applications Energieoptimierung für HPC-Anwendungen mit Periscope Author: Umbreen Sabir Mian Supervisor: Prof. Dr. Micheal Gerndt Date: December 16, 2013 Statement of Academic Integrity I, Last name: Mian First name: Umbreen Sabir ID No.: 03624284 hereby confirm that the attached thesis, “PTF Energy Tuning Plugin for HPC Aplications” or “Energieoptimierung für HPC-Anwendungen mit Periscope” is my own work and I have documented all sources and material used. Munich, December 15, 2013 Umbreen Sabir Mian Acknowledgments I would like to thank my supervisor, Prof. Dr. Michael Gerndt for his support and for being the inspiration for the work. Deepest gratitude is due to Prof. Gerndt without whose knowledge and assistance this thesis would not have been successful. I pay my sincere thanks to Robert Mijakovic for his valuable advices, kind guidance and technical support. I am very thankful to him that he always spared some time from his busy schedule to help me. Furthermore, It has been a very pleasant experience to work at the Lehrstuhl für Rechnertechnik und Rechnerorganisation, Fakultät für Informatik der Technische Universität München. The acknowledgement is incomplete without saying special thanks to Dr. Carmen Navarrete from LRZ for providing me required information and guidance. Last but not the least; I also respect the support of my family and friends who have always stood besides me both in ups and downs of my life. vii Abstract The PCAP plugin is targeted to tune the energy consumption of OMP applications. The plugin uses the energy measurements from ENOPT library to allow the optimization of the energy consumption of instrumented applications at runtime by changing the tuning parameter ”Number of Threads” on certain instrumented code regions (OMP parallel regions). Energy consumption and the execution time for the code region is combined using Energy Delay Product (EDP). Tuning of the application is done by applying different values of the tuning parameter on different code regions using cross product and then finding the global optimum EDP for the application. The scalability of the code region effects the tuning parameter value and the granularity of the code region is depicted through the energy consumption measurement for that region. ix x Contents Acknowledgements vii Abstract ix Outline of the Thesis xv I. Introduction and Background 1 1. Introduction 3 2. Related Work 9 3. SuperMUC 3.1. System purpose and target users 3.2. System overview . . . . . . . . . 3.3. Energy Efficiency . . . . . . . . . 3.4. System Configuration Details . . 3.4.1. Memory Architecture . . 3.4.2. Details on processors . . . 3.5. System Software . . . . . . . . . . 3.6. Storage Systems . . . . . . . . . . 3.6.1. Home file systems . . . . 3.6.2. Work and Scratch areas . 3.6.3. Tape backup and archives 3.7. Energy Measurement - enopt . . 3.7.1. Using EnOpt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 14 14 14 15 16 16 17 17 18 18 18 23 4. Energy Delay Product 25 4.1. Auto-tuning Feedback Metric . . . . . . . . . . . . . . . . . . . . . 28 4.2. Energy Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5. Periscope Tuning Framework 31 5.1. Periscope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2. Periscope Tuning Framework (PTF) . . . . . . . . . . . . . . . . . 33 5.2.1. PTF main components . . . . . . . . . . . . . . . . . . . . . 34 xi Contents 5.2.2. PTF Repository Structure . . . . . . . . . . . . . . . . . . . 34 5.2.3. PTF Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 II. Design and Implementation 43 6. PCAP Plugin 6.1. Tuning Plugin Interface (TPI) 6.1.1. Initialize . . . . . . . . 6.1.2. Start Tuning Step . . . 6.1.3. Create Scenarios . . . 6.1.4. Prepare Scenarios . . . 6.1.5. Define Experiment . . 6.1.6. Get Restart Info . . . . 6.1.7. Process Results . . . . 6.2. PCAP Plugin . . . . . . . . . . 6.2.1. Tuning Objective . . . 6.2.2. Tuning points . . . . . 6.2.3. Initialize . . . . . . . . 6.2.4. Start Tuning Step . . . 6.2.5. Create Scenarios . . . 6.2.6. Prepare Scenarios . . . 6.2.7. Define Experiment . . 6.2.8. Get Restart Info . . . . 6.2.9. Process Results . . . . 6.2.10. Objective Functions . . . . . . . . . . . . . . . . . . . . 45 45 45 46 46 46 47 47 48 48 49 49 49 50 50 50 51 51 51 52 . . . . . . 53 53 54 55 55 55 55 7. Exhaustive Serach 7.1. Add Search Space . . 7.2. Create Scenarios . . . 7.3. Iterate Search Spaces 7.4. Iterate Tuning Points 7.5. Generate Scenarios . 7.6. Search Finished . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III. Experiments and Results 8. Experimental Analysis 8.1. NAS Parallel Benchmarks (NPB) . . . . . . . . . . . . . . . . 8.2. Results on SuperMUC Compute Node (16 processing cores) 8.3. First Experiment Set . . . . . . . . . . . . . . . . . . . . . . . 8.4. Second Experiment Set . . . . . . . . . . . . . . . . . . . . . . xii 57 . . . . . . . . . . . . 59 59 61 62 77 Contents 8.5. Third Experiment Set . . . . . . . . . . . . . . . . . . . . . . . . . . 78 8.6. Fourth Experiment Set . . . . . . . . . . . . . . . . . . . . . . . . . 79 9. Conclusion 83 Bibliography 87 xiii Contents Outline of the Thesis Part I: Introduction and Background C HAPTER 1: I NTRODUCTION This chapter presents an overview of the work and it purpose. C HAPTER 2: R ELATED W ORK This chapter gives a brief overview of the related work being done in the same topic as of this work C HAPTER 3: S UPER MUC This chapter gives a brief overview of the SuperMUC machine used during this work for all experiments and gives details about the Enopt library. C HAPTER 4: EDP This chapter discusses details of the EDP objective function used in the plugin. C HAPTER 5: PTF This chapter presents an overview of Periscope and PTF. Part II: Design and Implementation C HAPTER 6: PCAP This chapter presents the details about the design and implementation of the tuning plugin. C HAPTER 7: E XHAUSTIVE S EARCH This chapter presents the details of the implementation of the search algorithm. Part III: Experiments and Results C HAPTER 8: E XPERIMENTS This chapter gives details about all the experiments done and their results. C HAPTER 9: C ONCLUSION This part briefly concludes all the work done in this thesis. xv Part I. Introduction and Background 1 1. Introduction In the last few years, the emergence of multicore architectures has revolutionized the landscape of high-performance computing. The recent developments in computer architecture especially related to multicore and manycore architecture have triggered considerable performance gains in high performance computing. The multicore shift has not only increased the per-node performance potential of computer systems but also has made great strides in curbing power and heat dissipation. High-performance computing is and has always been performance-oriented. However, a consequence of the push towards maximum performance has increased energy consumption, especially in datacenters and supercomputing centers. Moreover, as peak performance is rarely attained, some of this energy consumption results in little or no performance gain. Today, the running costs for energy for many systems have passed the break even point where they exceed by far the acquisition costs of the hardware platform. Beside these economical issues, more fundamental aspects related to the adequate design of efficient IT-Infrastructures in datacenters must be addressed in that context. Even the supply of the needed amount of energy is becoming critical for many installation. In addition, large energy consumption costs datacenters and supercomputing centers a significant amount of money and wastes natural resources. Furthermore, a typical supercomputer consumes large amount of electrical 3 1. Introduction power, almost all of which is converted into heat, requiring cooling. Further the concerns about the rise of an energy crisis and the global warming lead to an unmistakable call for energy efficiency in High-Performance-Computing. On one hand, datacenters and supercomputing centers have to bear a large amount of cooling costs while on the other hand, large amount of heat emitted by these giant machines leaves severe impacts on the environment and adds to the global warming problem of the environment. Because of the huge volumes of these machines, energy consumption regulation is a very important research area in supercomputing. Recognition of the importance of power in the field of High Performance Computing, whether it be as an obstacle, expense or design consideration, has never been greater and more pervasive. So nowadays, energy consumption is a major concern with high-performance multicore systems. On the other hand, performance analysis and tuning is an important step in programming multicore-based parallel architectures. Aspects for tuning the application performance are as diverse as the selection of compilation options, the energy consumption, the execution efficiency of MPI/OpenMP, the execution time of GPU kernels and so on. A lot of work has been done on tuning of parallel applications and still it is a very hot reserch topic. Many analysis tools have been developed which analyze the performance properties of parallel applications. The goal of this thesis is to to tune energy consumption of OpenMP based parallel applications run at very large scale while minimizing the impact on run-time performance (wall-clock execution time). As the basic principle of parallel processing in OpenMP is based on creation of multiple threads and assigning part of application execution to each thread, so the ultimate goal of the 4 thesis is to optimize the number of threads in such a way that optimal performance is achieved along with reduced energy consumption. It can be easily seen that there would be a tradeoff between the two properties of the parallel application. Increasing the number of cores can increase the performance by reducing the execution time of the application but it would consume more energy. Furthermore, not all applications are scalable, so there is an optimal number of threads after which increasing the number of threads does not give benefit of lower execution time or in some cases scalability is lower than the increase in energy may be due to memory bandwidth limitation.Concerning the measurement of energy consumption by the application, there are special energy measurement systems which can give the measurement of energy consumption both at CPU level and at node level. Using these basic concepts as the basis for the process of tuning, a very efficient way of tunning parallel application has been developed. The framework used for this tuning task is Periscope Tuning Framework (PTF). PTF is an automatic tuning framework for the design of autotuning tools targeting various types of applications and optimization opportunities developed at Technical University Munich. The framework is based on the performance analysis tool Periscope and offers a set of plugins to automatically tune the application performance on diverse aspects. Periscope is an automatic performance analysis tool for highly parallel applications written in MPI and/or OpenMP, which has also been developed at Technical University Munich. A Periscope analysis run provides values of different properties of the application under consideration, which can then be optimized using some tuning strategy. So the properties that are the output of Periscope are tuned using tuning plugins of PTF. After an auto-tuning run, the user is given recommendations on 5 1. Introduction how the application performance can be improved. So during the course of this thesis, a plugin has been developed inside PTF, which tunes the execution time and energy efficiency of a given OpenMP application. The tuning plugin uses an objective function that uses the properties values from Periscope and tries to tune those values to get an optimal performance of the application. The objective function used in this tuning plugin is known as Energy Delay Product. As the name of the objective function depicts, this function minimizes the execution time of the application while lowering the energy consumption by the application. Testing of the plugin has been done using NAS Parallel Benchmarks (NPB). These benchmarks are used for the performance evaluation of highly parallel supercomputers. The benchmarks consist of five parallel kernel benchmarks and three simulated application benchmarks. They depict the computation and data movement characteristics of large-scale computational fluid dynamics applications. Next chapter of the introduction section gives an overview of the related work done. As mentioned previously, energy consumptions by supercomputers is a very active research field in high performance computing, so a lot of work has been done in this context to make the massive machines energy efficient both in terms of hardware and also on the basis of applications running on these machines. In second chapter, some background details about SuperMUC and the previous work done in the PTF project are given. It starts with the details about the architecture, layout and memory scheme of the SuperMUC and the energy measurement done by the LRZ library.. Then it describes the architecture and working principle of the Periscope. Finally the chapter concludes after explain- 6 ing all the details about the PTF. Section two describes the details about the thesis work; more precisely, it consists of presenting the design details of the tuning plugin describing all its subcomponents. It elaborates on the whole process of tuning, starting from all the minor details of the design and implementation of the tuning plugin and then decribing the entire tuning process. Section three presents all the test results done on the tuning plugin using the NPB benchmarks. It gives the details of each experiment performed, the problems encountered during the course of experiments and the results generated after running tuning on these benchmarks. Lastly, a brief summary of the entire work is given in the conclusion section. This section also gives an overview of some possible aspects of the area which can be worked upon in future. 7 1. Introduction 8 2. Related Work With Exascale systems on the horizon, we have ushered in an era with power and energy consumption as the primary concerns for scalable computing. To achieve this, revolutionary methods are required with a stronger integration among hardware features, system software and applications. Current approaches for energy efficient computing rely heavily on power efficient hardware in isolation. Existing hardware has been successfully leveraged by present day operating systems to conserve energy whenever possible but these approaches have proven ineffective and even detrimental at large scale. While hardware must provide part of the solution, how these solutions are leveraged on large scale platforms requires new and flexible approach. It is particularly important that any approach taken has a system and node-level view of these issues. To address the issue of efficient energy usage, many low-power architectural techniques have been proposed and implemented. For example, they include putting the system in sleep mode [9]; scaling the voltage and/or frequency [16] [18] [29]; switching contexts to a job that consumes less power [30]; reconfiguring hardware structures [1]; gating pipeline signals, for example to control speculation [3] [26]; throttling the instruction cache [9]; clock optimizations, including multiple clocks and clock gating [15]; better signal encoding [15]; low 9 2. Related Work power memory design techniques [20] like bank partitioning or divided word line; low power cache design techniques like cache block buffering [10], subbanking [14] [32], or filter caches [23]; and TLB optimizations [22]. Research on software-controlled dynamic power management has focused extensively on controlling voltage supply and frequency in sequential microprocessors. This research has derived analytical models for DVFS [35], compilerdriven techniques [36], and control-theoretic approaches [34]. Similar techniques have been employed to reduce dynamic power management in system components other than processors, such as RAM [8] and disk [4]. Researchers have recently modeled and analyzed the impact of control knobs; DVFS and concurrency throttling, on dynamic power management on shared-memory [7] [19] [24] [28], and on distributed-memory parallel systems [12] [17] [31]. All the above mentioned work focuses on exploring techniques either at architectural level or at software level to minimize the energy usage. In this thesis, focus is on how we can run the applications on existing hardware while efficiently utilizing the energy. The main focus of this work are the OpenMP based applications and using different configurations of number of threads used to run the application, application energy usage and execution time are tuned. Focus is mainly on tuning the application as a whole as well as tuning the individual parallel regions in such a way that a global optimum configuration for minimum energy usage can be explored. Similar work has been presented in [33] in which authors have presented that there are non-trivial interactions between compiler performance optimization strategies and energy usage. They have used Whatsup?Pro power meter to measure energy usage and leveraged open source projects to explore energy 10 and performance optimization space for computation intensive kernels. They have used clock frequency tuning as the tuning parameter. This thesis share similar objectives by using number of threads as the tuning parameter and measuring energy usage using the Enopt library for energy measurement to explore energy and performance optimization space of OpenMP based applications. 11 2. Related Work 12 3. SuperMUC SuperMUC is the name of the new supercomputer at Leibniz-Rechenzentrum (Leibniz Supercomputing Centre) in Garching near Munich (the MUC suffix is borrowed from the Munich airport code). With more than 155.000 cores and a peak performance of 3 Petaflop/s (=1015 Floating Point Operations per second) SuperMUC is one of the fastest supercomputers in the world. 3.1. System purpose and target users SuperMUC strengthens the position of Germany’s Gauss Centre for Supercomputing in Europe by delivering outstanding compute power and integrating it into the European High Performance Computing ecosystem. With the operation of SuperMUC, LRZ acts as an European Centre for Supercomputing and will be Tier-0 centre of PRACE, the Partnership for Advanced Computing in Europe. SuperMUC is available to all European researchers to expand the frontiers of science and engineering. Since August 2011, a migration system (nicknamed SuperMIG) enables porting applications to the new programming environment [25]. 13 3. SuperMUC 3.2. System overview • 155,656 processor cores in 9400 compute nodes • 300 TB RAM • Infiniband FDR10 interconnect • 4 PB of NAS-based permanent disk storage • 10 PB of GPFS-based temporary disk storage • 30 PB of tape archive capacity • Powerful visualization systems • Highest energy-efficiency 3.3. Energy Efficiency SuperMUC uses a new, revolutionary form of warm water cooling developed by IBM. Active components like processors and memory are directly cooled with water that can have an inlet temperature of up to 40 degrees Celsius. The “High Temperature Liquid Cooling” together with very innovative system software promises to cut the energy consumption of the system. In addition, all LRZ buildings will be heated re-using this energy [25]. 3.4. System Configuration Details LRZ’s target for the architecture is a combination of a large number of moderately powerful compute nodes, with a peak performance of several hundred 14 3.4. System Configuration Details GFlop/s each, and a small number of fat compute nodes with a large shared memory. The network interconnect between the nodes allows for perfectly linear scaling of parallel applications up to the level of more than 10,000 tasks [25]. SuperMUC consists of 18 Thin Node Islands and one Fat Node Island which is at first also used as the Migration System SuperMIG. Each Island contains more than 8,192 cores. All compute nodes within an individual Island are connected via a fully non-blocking Infiniband network (FDR10 for the Thin nodes / QDR for the Fat Nodes). Above the Island level, the high speed interconnect enables a bi-directional bi-section bandwidth ratio of 4:1 (intra-Island / inter-Island) [25]. The SuperMUC system will be expanded in 2015 by doubling the performance. 3.4.1. Memory Architecture SuperMUC has 18 partitions called Islands. Each Island consists of 512 nodes. A node is a shared memory system with two processors shown in Figure 3.1. Each node consists of: • Sandy Bridge-EP Intel Xeon E5-2680 8C – Each processor has eight cores. – Each core has 2-way hyperthreading – 172.8 GFlops per processor with 21.6 GFlops at 2.7 GHz per core • 32 GByte memory • Inifiniband network interface 15 3. SuperMUC Figure 3.1.: SuperMUC NUMA Node 3.4.2. Details on processors • Westmere-EX for the Fat Node Island / Migrationssystem • Sandy Bridge-EP for the Thin Node Islands (Intel Xeon E5-2680 8C) - 2.7 GHz (Turbo 3.5 GHz). Architecture of a Sandy Bridge processor is shown in Figure 3.2. 3.5. System Software SuperMUC uses following software components: • Suse Linux Enterprise Server (SLES) • System management: xCat from IBM • Batch processing: Loadleveler from IBM From the user side a wide range of compilers, tools and commercial and free applications is provided. Many scientists also build and run their own software. 16 3.6. Storage Systems Figure 3.2.: Sandy Bridge Processor Architecture 3.6. Storage Systems SuperMUC has a powerful I/O-Subsystem which helps to process large amounts of data generated by simulations. 3.6.1. Home file systems Permanent storage for data and programs is provided by a 16-node NAS cluster from Netapp. This primary cluster has a capacity of 2 Petabytes and has demonstrated an aggregated throughput of more than 10 GB/s using NFSv3. Netapp’s Ontap 8 “Cluster-mode” provides a single namespace for several hundred project volumes on the system. Users can access multiple snapshots of data in their home directories [25]. Data is regularly replicated to a separate 4-node Netapp cluster with another 2 PB of storage for recovery purposes. Replication uses Snapmirror-technology 17 3. SuperMUC and runs with up to 2 GB/s in this setup. Storage hardware consists of 3400 SATA-Disks with 2 TB each protected by double-parity RAID and integrated checksums. 3.6.2. Work and Scratch areas For highest-performance checkpoint I/O IBM’s General Parallel File System (GPFS) with 10 PB of capacity and an aggregated throughput of 200 GB/s is available. Disk storage subsystems were built by DDN [25]. 3.6.3. Tape backup and archives LRZ’s tape backup and archive systems based on TSM (Tivoli Storage Manager) from IBM are used for or archiving and backup. The have been extended to provide more than 30 Petabytes of capacity to the users of SuperMUC. Digital long-term archives help to preserve results of scientific work on SuperMUC. User archives are also transferred to a disaster recovery site [25]. 3.7. Energy Measurement - enopt LRZ provides an energy monitoring library for the measurement of energy consumed by an application. The aim of the library, known as enopt, is to gain knowledge of the distribution of the energy consumption among the different components of the compute nodes of a supercomputer system, taking also into account the characteristics of the application running on it. For this objective, the PAPI-RAPL component and the native ibmaem-HWMON kernel have been integrated. 18 3.7. Energy Measurement - enopt This tool supports FORTRAN or C/C++ applications parallelized either with MPI, OMP or hybrid computing. At the moment, it runs on SandyBridge processors. The library provides classes to monitor not only energy counters, but also some other PAPI performance counters in order to be able to find corelations between energy consumption and the behavior of the application regarding these other performance counters (such as cache misses, number of cycles, instructions per second, etc.) and also the application runtime [5]. PAPI aims to provide the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most microprocessors. PAPI enables software engineers to see, in near real time, the relationship between software performance and processor events. One of the components of the PAPI library is the so called PAPI-RAPL component that makes use of the RAPL sensors available in the SandyBridge micro architecture. The PAPI-RAPL component provides energy consumption measurement of the CPU-level components by examining the MSR registers [5]. The specific RAPL domain counters available in Intel platforms vary across “product segments”. • Platforms which target the client segment: In this case, they support the following RAPL domain hierarchy: – Package (PKG) – Two power planes (PP0 and PP1). In this case, PP0 refers to the processor cores and PP1 refers to the uncore devices. • Platforms targeting the server segment support: – Package (PKG) 19 3. SuperMUC – The power plane PP0. In this segment, PP0 refers also to the processor cores, whereas PP1 domain is not supported. – DRAM The package domain PKG, regardless of the targeted segment, is defined as the processor die. The specific MSR interfaces defined for the RAPL domain are: • MSR PKG POWER LIMIT: allows software to set power limits for the package. • MSR PKG POWER INFO: reports the package power range information for RAPL usage. PAPI-RAPL provides a set of PAPI native events to interact with the RAPL interface. This events are, among others (see Figure 3.1 [5]): • PACKAGE ENERGY:PACKAGEEx: Energy used by chip package 0 or 1, respectively [5]. • DRAM ENERGY:PACKAGEEx: Energy used by the DRAM on package 0 or 1, respectively. It is unavailable for client segments [5]. • PP0 ENERGY:PACKAGEEx: Energy used by all cores in package 0 or 1, respectively [5]. • PP1 ENERGY:PACKAGEEx: Energy used by all uncore devices in package 0 or 1, respectively. It is unavailable for server segments [5]. Figure 3.3 [5] shows the location of the RAPL counters on the SandyBridge micro architecture. Blocks with red background belong to the PACKAGE0, 20 3.7. Energy Measurement - enopt Figure 3.3.: Graphical location of the RAPL counters on the SandyBridge microarchitecture [5] whereas the green ones belong to PACKAGE1. The voilet block represents the DRAM device. The yellow circles represent the RAPL sensors whereas the blue one represents the HWMON counter. The available RAPL counters on SuperMUC are: 1. PP0 ENERGY:PACKAGE0, for measuring the energy consumption of the cores (PP0) that belongs to PKG0; 2. PP0 ENERGY:PACKAGE1, same but for PKG1; 3. PACKAGE ENERGY:PACKAGE0, processor die from PKG0; 4. The same but related to PKG1; 5. Energy consumption of the DRAM belonging to PKG0; 6. DRAM ENERGY:PACKAGE1, the same but for PKG1; 7. DC energy counters provided by the paddle cards; 8. AC energy counter provided by the paddle cards. Each AC counter is shared by two nodes. The IBM paddle cards are hardware devices located on the motherboard for measuring the AC and DC power consumption. In this case, the uncore measurements can be emulated as the difference between PACKAGE ENERGY:PACKAGEx and PP0 ENERGY:PACKAGEx [5]. 21 3. SuperMUC The codes to use the 8 sensors shown in Figure 3.3 [5] in an application are as follows: • ENOPT ALL CORES = 1 + 2 • ENOPT ALL UNCORES = (3 -1) + (4 - 2) • ENOPT ALL SOCKETS = 3 + 4 • ENOPT ALL DRAMS = 5 + 6 • ENOPT NODE = 7 • ENOPT PDU = 8 • ENOPT CORES 1 = 1 • ENOPT CORES 2 = 2 • ENOPT UNCORES 1 = 3 - 1 • ENOPT UNCORES 2 = 4 - 2 • ENOPT SOCKET 1 = 3 • ENOPT SOCKET 2 = 4 • ENOPT DRAM 1 = 5 • ENOPT DRAM 2 = 6 The library also allows to reduce the energy consumption by changing the CPU frequency. It provides two ways to change the CPU frequency. One is to directly set the frequency of the CPU and the other is to choose a governor to set the power policy of the node. Following five governor policies are available. 22 3.7. Energy Measurement - enopt • Conservative: It is based on two thresholds. • Ondemand: It uses one threshold. • Performance: Sets the maximal frequency of 2.7GHz • Powersave: Sets the minimal frequency of 1.7GHz. • Userspace: Sets the user defined frequency. 3.7.1. Using EnOpt In order to use enopt for energy measurement in an application, following functions are available to use: • enopt init(): Used to initialize the library at the start of program. • enopt finalize(): Used to end the call to the library before exiting the program. • enopt start(): Used to start the counter to measure. • enopt stop(): Used to stop the counter under consideration. • enopt get(ENOPT NAME, &Localvariable): Immediately after enopt stop(), this function is called to get the measured value. • enopt setGoverner(int): Used to set a particular governor policy. • enopt setFrequency(int): Used to set the frequency of the cores. 23 3. SuperMUC 24 4. Energy Delay Product Until recently, performance was the single most important feature of a microprocessor. Today, however, designers have become more concerned with the power dissipation, and in some cases low power is one of the key design goals. This has led to an increasing diversity in the processors available. Comparing processors across this wide spectrum is difficult, and we need to have a suitable metric for energy efficiency. Power is not a good metric to compare these processors since it is proportional to the clock frequency. By simply reducing the clock speed we can reduce the power dissipated in any of these processors. While the power decreases, the processor does not really become “better”. Another possible metric is energy, measured in Joule/Instruction or its inverse SPEC/W, where SPEC is the ‘rate of intruction per second’. SPEC/W = Instruction/second/W = Instructions/Joule [2] While better than power, this metric also has problems. It is proportional to CV2 so one can reduce the energy per instruction by reducing the supply voltage or decreasing the capacitance-with smaller transistors. Both of these changes increase the delay of the circuits, so we would expect the lowest energy processor to also have very low performance. Since we usually want minimum power at 25 4. Energy Delay Product a given performance level, or more performance for the same power, we need to consider both quantities simultaneously. The simplest way to do so is by taking the product of energy and delay (in Jules/SPEC or its inverse SPEC2 /W). To improve the energy-delay product of a processor we must either increase its performance or reduce its energy dissipation without adversely affecting the other quantity [2]. So Power consumption, delay, throughput and energy consumption are metrics commonly used to compare systems. Considering each of these metrics in isolation does not permit a fair comparison of systems because of the ability of CMOS circuits to trade performance for energy. When multiple criteria need to be optimized simultaneously, it is common to optimize their weighted product. In the case of energy and time, this product may be represented as the metric M for a circuit configuration C such that [15]: M(C) = EDn Here n is a weight that represents the relative importance of the two criteria. Since energy and time can be traded off for each other, consider the infinitesimally small quantity of energy ∆E that needs to be expended to reduce the time for a computation by an infinitesimally small amount ∆D. Using Newton’s binomial expansion and ignoring products and higher powers of ∆E and ∆D we get: M(C) = (E + ∆E)(D - ∆D)n = EDn - nE∆D + D∆E If this new operating point is equivalent to the old operating point under the metric M: EDn - nE∆D + D∆E = EDn Rearranging this equation yields: 26 ∆E/E = n∆D/D Intuitively, this means that a small reduction in time is considered n times more valuable than a corresponding reduction in energy. For example, if n = 1, a 1% reduction in time is considered worth paying a 1% increase in energy. If n = 2, then it is acceptable to pay for a 1% increase in performance with a 2% increase in energy consumption. In general, when n = 1, energy and delay are equally important, when n > 1 performance is valued more than energy and when 0 < n < 1 energy savings are considered more important than performance. The case of n = 0 optimizes just for energy and n = -1 optimizes for power. Other negative values of n are not useful for optimization since E/Dn changes in opposite directions for improvements in energy and delay. As described earlier, the metric EDP is commonly used to compare processors with different underlying technologies. In this thesis, we are not going to compare architectures rather we focus on tuning of OpenMP application run on SuperMUC with different number of threads. We can characterize an application run with four metrics: performance (measured with total execution time, also called delay), average power consumption, total energy consumption, and product of energy times execution time (energy-delay product). We will strive for a low energy-delay product, since it implies a good balance between high speed and low energy consumption. The EDP is defined as the amount of energy consumed during the execution of a program multiplied by the execution time of the program. This EDP metric, and more generally, EDn , where n is an integer, is commonly used in circuit design. However, the EDn product emphasizes performance over energy, particularly as n increases. A metric of increasing interest is the amount of computational work completed per joule. The question here becomes “What 27 4. Energy Delay Product constitutes work” In some literature, work completed per joule is defined as operations per joule. However, our set of applications do not use that unit of work, nor do they share a common unit of work other than instructions, and those are debatable since different instructions may be chosen by the compilers for each platform. As such, we treat a complete run of a given application as a single unit of work and report the total energy consumed per application run. 4.1. Auto-tuning Feedback Metric The most common feedback metric used by auto-tuners is application execution time, which can also be expressed as runtime delay with respect to some baseline. For energy auto-tuning, however, we need a feedback metric (objective function) that combines power usage with the execution time of a given program. There has been a lot of debate about the appropriateness of different combinations of power and performance in literature in investigating energy consumption reducing techniques in today’s architectures. All of them hinge on how much the delay in execution time should be penalized in return for lower energy. We can use four different feedback metrics: E (total energy), ED (energy * delay), ED2 (energy * delay * delay) and T (execution time). Total energy (E) is derived by multiplying the average power usage by the application execution time. E does not penalize execution time delay at all. T penalizes only execution time delay with no credit for saving energy. Between these extremes, ED and ED2 metrics put more emphasis on the total application execution time than the total energy metric. The appropriateness of which metric to use depends on the overall goal of the tuning exercise. 28 4.2. Energy Measurement 4.2. Energy Measurement For the purpose of this thesis, energy measurements are done using ENOPT library provided by LRZ. All results need to be normalized against the base case of running the application with one thread. 29 4. Energy Delay Product 30 5. Periscope Tuning Framework 5.1. Periscope Periscope is an automatic performance analysis tool for large scale parallel systems. It consists of a frontend and a hierarchy of communication and analysis agents. Each of the analysis agents, i.e., the nodes of the agent hierarchy, searches autonomously for inefficiencies in a subset of the application processes. Before the analysis can be conducted the application has to be instrumented. Sourcelevel instrumentation is used to selectively instrument code regions, i.e., functions, loops, vector statements, OpenMP blocks, I/O statements, and call sites. The region types to be instrumented are determined via command line switches of the Fortran 95 instrumenter [13]. The application processes are linked with a monitoring system that provides the Monitoring Request Interface (MRI). The agents attach to the monitor via sockets. The MRI allows the agent to configure the measurements; to start, halt, and resume the execution; and to retrieve the performance data. The application and the agent network are started through the frontend process. It analyzes the set of processors available, determines the mapping of application and analysis agent processes, and then starts the application and the 31 5. Periscope Tuning Framework Figure 5.1.: Periscope Architecture [13] agent hierarchy. After startup, a command is propagated down to the analysis agents to start the search. The search is performed in one or more experiments. Most of the applications in HPC have an iterative behavior, e.g., a loop where in each iteration the next time step of the simulated time is performed. If the application has such an iterative phase, a single execution of the phase is an experiment. If such a phase is missing or not marked by the programmer, the whole program is executed for an experiment [13]. The search is performed according to a search strategy selected when the frontend is started. The strategy defines an initial set of hypotheses, i.e., prop- 32 5.2. Periscope Tuning Framework (PTF) erties that are to be checked in the first experiment, as well as the refinement from found properties to a new set of hypotheses. The agents start from the set of hypotheses, request the necessary information for proving the hypotheses via MRI, release the application for a single execution of a repetitive program phase, retrieve the information from the monitor after the processes were suspended again, and evaluate which hypotheses hold. If necessary, the found hypotheses might be refined and the next execution evaluation cycle is performed [13]. The strategies analyzing the single-node performance are multi-step strategies. They typically go through multiple refinement steps. The strategy used for analyzing the MPI behavior is a single-step strategy. At the end of the local search, the detected performance properties are reported back via the agent hierarchy to the frontend. The communication agents combine similar properties found in their child agents and forward only the combined properties. 5.2. Periscope Tuning Framework (PTF) The AutoTune project focuses on extending Periscope to the Periscope Tuning Framework combining performance and energy efficiency analysis with automatic tuning plugins. The Periscope Tuning Framework (PTF) is an extension of the automatic online performance analysis tool Periscope. PTF identifies tuning alternatives based on codified expert knowledge and evaluates the alternatives within the same run of the application (online), dramatically reducing the overall search time for a tuned code version. The application is executed under the control of the framework either in interactive or batch mode. During the ap- 33 5. Periscope Tuning Framework plication’s execution, the analysis is done, the found performance and energy properties are forwarded to tuning plugins that determine code alternatives and evaluate different tuned versions. At the end of the application run, detailed recommendations are given to the code developer on how to improve the code with respect to performance and energy consumption [6]. 5.2.1. PTF main components Figure 5.2 [6] outlines the main PTF components. The Eclipse-based graphical user interface allows to investigate the results of a PTF tuning run. It visualizes the performance and energy properties as well as the tuning recommendations. The PSC Frontend controls the entire execution. The main new components that extend the Periscope Frontend are the tuning plugins, the search algorithms, and the Scenario Execution Engine. The agent hierarchy is composed of a master agent, several high level agents and the analysis agents. The analysis agents provide a new tuning strategy that configures the MRI Monitor linked to the application with tuning actions and runtime measurements for the evaluation of code alternatives. 5.2.2. PTF Repository Structure The PTF repository covers the code for the components introduced in the previous section, except the graphical user interface. Figure 5.3 [6] outlines its directory structure. It consists of: • frontend: Covering the files of the frontend. • aagent: Covering the files of the analysis agent including the performance properties for different programming models and target systems as well 34 5.2. Periscope Tuning Framework (PTF) Figure 5.2.: PTF Main Components [6] as the analysis strategies. • hagent: Covering the files for the master and the high level agents. In fact, the master agent is a high level agent from a code point of view. • mrimonitor: Covering the files of the PTF monitor that implements the Monitoring Request Interface (MRI). • util: Covering files common to multiple components. • autotune: Covering the files for the tuning plugins and search algorithms. It has three major subdirectories: – datamodel: Files implementing base concepts for tuning in PTF, e.g., tuning points. – plugins: Code specific to a tuning plugin goes to a tuning pluginspecific directory. 35 5. Periscope Tuning Framework – searchalgorithms: Files implementing generic search algorithms for the tuning plugins. Figure 5.3.: PTF Repository [6] 5.2.3. PTF Plugins Periscope has been extended by a number of tuning plugins that fall into two categories: online and semi-online plugins. An online tuning plugin performs transformations to the application and/or the execution environment without requiring a restart of the application; a semi-online tuning plugin is based on a restart of the application but without restarting the agent hierarchy [27]. Figure 5.4 [27]illustrates the control flow in PTF. The tuning process starts with a preprocessing of the application source files. This preprocessing performs instrumentation and static analysis. Periscope is based on source-level instrumentation for C/C++ and Fortran. The instrumenter also generates a SIR file (Standard Intermediate Representation) that includes static information such as the instrumented code regions and the nesting. When the preprocessing is finished, the tuning can be started via the Periscope frontend either interactively or in a batch job. As done in Periscope, the application will be started by the frontend before the agent hierarchy is created. Periscope uses an analysis strategy, e.g. for MPI, OpenMP and single core 36 5.2. Periscope Tuning Framework (PTF) analysis, to guide the search for performance properties. This overall control strategy now becomes part of a higher-level tuning strategy. The tuning strategy controls the sequence of analysis and tuning steps. Typically, the analysis determines the application properties to guide the selection of a tuning plugin as well as the tuning actions performed by the plugin. After the plugin finishes, the tuning strategy might restart the same or another analysis strategy to continue on further tuning [27]. Once the tuning process is finished, PTF generates a tuning report documenting the remaining properties as well as the tuning actions recommended. These tuning actions can then be integrated into the application such that subsequent production runs will be more efficient. Tuning Plugin Design Given the number of programming models, parallel patterns and hardware targets to be supported by PTF, it provides a sufficiently generic tuning plugin design. This section describes some terminologies used in plugin design. The tuning plugins try to improve the application execution by influencing certain tuning points. Tuning points TP = {v1 , v2 , .....} are the features for influencing the execution of a region. Each tuning point has a name and an enumeration type or an interval of integer values with stride. For example, a tuning point is the clock frequency of the CPU which determines the overall energy consumption [27]. All tuning points of a tuning plugin define a multidimensional tuning space. Tuning space of a tuning plugin P is the cross product of the individual tuning points, i.e., TSP = TP1 × TP2 × ... × TPk [27] For a program region the tuning plugin will select a set of variants that may 37 5. Periscope Tuning Framework Figure 5.4.: Tuning Control Flow [27] lead to a potential improvement and that need to be evaluated by experiments. The variant space VSr of a program region ‘r’ is a subset of the overall tuning space, i.e., VSr ⊆ TSP. A variant of a code region ‘r’ is a concrete vector of values 38 5.2. Periscope Tuning Framework (PTF) for the region’s tuning points vr = (v1 , ... , vk ) [27]. The variant space is explored by a search strategy to optimize certain objectives. An objective is a function obj: REGappl × TSP → R where REGappl is the set of all regions in the application. A single or multiple objectives are to be optimized by the tuning plugin for a given program region over the regions variant space. The tuning plugin creates a sequence of tuning scenarios that are executed by Periscope to determine the values of one or more objectives. A tuning scenario is a tuple scr =( r, vr , {obj1 , ... , objn }) where ‘r’ is the program region, vr ∈ VSr is a variant of the region’s variant space, and obj1 ... objn are the objectives to be evaluated [27]. During the execution of a tuning scenario, tuning actions are executed to select the individual values of the tuning points. A tuning action TAi is executed for each tuning point TPi with 1 ≤ i ≤ k during the execution of a tuning scenario. It enforces the value vi for the tuning point ‘i’ given by the variant vr = (v1 , ... , vk ) [27]. Tuning Plugin Control Flow The PTF frontend controls the overall tuning process. For auto-tuning, the frontend enforces a predefined sequence of operations that are implemented by the tuning plugins. The predefined sequence of operations has to fulfill the requirements of all tuning plugins developed in AutoTune. Therefore, it is quite complex. In this section, we present a simplified version in Figure 5.5 [27]. All steps are involved in creating and processing the scenarios that need to be evaluated by experiments. Scenarios are stored in pools that are accessed and shared by the plugins as well as the frontend. 39 5. Periscope Tuning Framework • Created Scenario Pool (CSP): Scenarios that were created by a search algorithm. • Prepared Scenario Pool (PSP): Scenarios that are already prepared for execution. • Experiment Scenario Pool (ESP): Scenarios that are selected for the next experiment. • Finished Scenario Pool (FSP): Scenarios that were executed. Figure 5.5 [27] presents the sequence of steps followed by a tuning plugin. 1. Initialization: First, the plugin is initialized and the tuning points are created. 2. Scenario Creation: From the defined tuning space, the plugin creates the scenarios and inserts them into the CSP. Here, the plugin first selects the variant space to be explored. It then creates the individual scenarios, which combine the region, a variant, and the objectives, either via a generic search algorithm, e.g., exhaustive search, or by its own search algorithm. 3. Scenario Preparation: Scenarios are selected from the CSP, prepared and moved into the PSP. The preparation of scenarios typically covers tuning actions that cannot be executed at runtime, e.g., recompilation with a certain set of compilation flags or generation of special source code for the scenarioÕs variant. Only the plugin can decide whether certain scenarios can be prepared at the same time. For example, two scenarios requesting different compiler flag combinations for the same file cannot be prepared 40 5.2. Periscope Tuning Framework (PTF) at the same time. If no preparation is required, the plugin simply copies all the created scenarios to the PSP. 4. Define Experiment: A subset of the prepared scenarios is then selected for the next experiment and moved into the ESP. When the plugin selects the scenarios for the next experiment it has to take constraints into account. For example, different scenarios for the same program region cannot be executed in the same experiment unless they can be assigned, for example, to different processes of the MPI application. The assignment of scenarios to processes or threads is decided by the plugin in this step. 5. Experiment Execution: The Scenario Execution Engine (SEE) is responsible to execute the experiment. It will first check with the plugin, whether a restart of the application is necessary to implement the tuning actions. For example, the scenarios generated by the MPI tuning plugin explore certain parameters of the MPI runtime environment. These can only be set via environment variables before launching the application. After the potential restart of the application, the SEE will run the experiment by releasing the application for a phase, i.e., the execution of the phase region. If multiple phases are required to gather all the measurements for objectives, the SEE will automatically take care of that. It will even restart the application if it terminates before all the measurements were finished. At the end of this step, the executed scenarios are moved into the FSP and the objectives are returned to the plugin. 6. Process Results: The plugin accesses the objectives, which are implemented as standard Periscope properties. Each objective specifies its sce- 41 5. Periscope Tuning Framework nario. The objectivesÕ value is then used to select the best scenario and return the tuning recommendation. Figure 5.5.: Tuning plugin control and data flow [27] 42 Part II. Design and Implementation 43 6. PCAP Plugin This chapter gives a detailed description of the implementation of energy tuning plugin developed during the course of this thesis. The plugin is named as PCAP abbreviated from Power Capping. 6.1. Tuning Plugin Interface (TPI) First all the major methods of the Tuning Plugin Interface (TPI) are described. These methods must be implemented by all plugins and their conformance is checked when the plugin is loaded. 6.1.1. Initialize After the frontend has initialized itself and is ready to start the tuning process, it loads the plugin specified by the user. Before the plugin can be utilized, it needs to be instantiated and initialized. The frontend instantiates the plugin and then invokes this method to do so. In this method, the plugin needs to set up its internal data structure for tuning points. 45 6. PCAP Plugin 6.1.2. Start Tuning Step In this method, the plugin needs to set up its internal data structures for the tuning space, search algorithms to be used, and the objectives. 6.1.3. Create Scenarios After the plugin has initialized its data structures and search algorithm, the next step is to create scenarios. The plugin generates the scenarios using a search algorithm and inserts them into the CSP, so that the Frontend has access to them. The search algorithm might go through multiple rounds of scenario generation. The selection of new scenarios that are generated in the next step might depend on the objective values for the scenarios in the previous step. Before the frontend calls the final method to process the results, it checks if the search algorithm needs to generate additional scenarios. If so, the frontend triggers an additional iteration of creation, preparation, and execution of scenarios. 6.1.4. Prepare Scenarios Some scenarios require preparation before experiments can be executed. If a set of scenarios needs preparation, it should be done in this method. However, if no preparation is necessary, the Prepare Scenario method can simply move the scenarios from the CSP to the PSP. After the execution of an experiment, the frontend checks if the CSP is empty. If there are still scenarios, the frontend calls the Prepare Scenarios method again. 46 6.1. Tuning Plugin Interface (TPI) 6.1.5. Define Experiment Once generated and prepared, the scenarios need to be assembled into an experiment. An experiment will go through at least one execution of the phase region of the application. There are two ways to execute multiple scenarios in a single experiment. Either they can be assigned to a single process because they affect different regions or they can be assigned to different processes. Only the plugin can decide whether this is possible or not. Therefore the frontend calls the Define Experiment method to decide which scenarios are executed in the next experiment and to assign the executing process to the scenarios. Scenarios selected by the plugin for the next experiment are moved from the PSP to the EPS. After the plugin defined the experiment, the frontend transfers the control to the scenario execution engine, which forwards the scenarios to the analysis agents and triggers the experiment. At the end of the experiment the objectives of the scenarios are returned to the plugin. The frontend checks after the execution of an experiment if there are additional prepared scenarios in the PSP, and calls the Define Experiment method again to evaluate a next set of scenarios. 6.1.6. Get Restart Info This method is called by the scenario execution engine and returns true if a restart of the application is necessary for the execution of the experiment. For example, a restart is necessary if the application was recompiled according to the scenario with a special combination of compiler flags. It also permits to return parameters to the application launch command, e.g., 47 6. PCAP Plugin if a scenario requires certain configuration parameters of the MPI library to be set during the launch of the application. 6.1.7. Process Results If the CSP is empty and the search algorithm is finished, the frontend calls the Process Results method. In this method, the plugin analyzes the acquired properties and either commands that there are extra steps necessary for tuning or indicates that the tool is finished and generates the tuning advices for the user. 6.2. PCAP Plugin The plugin tunes the energy consumption and the execution time of the OpenMP based applications on runtime by changing the number of threads used for the execution of the parallel regions in the application. PCAP plugin has been designed to perform tuning in two steps. In the first step, the plugin does speedup analysis of the application by executing the application and measuring the scalability of the parallel regions. The result of this first step is used to shrink the search space used in the second step. In the second step, tuning action is applied to get the best optimum energy delay product for the application. Energy delay product is used to depict the tradeoff between the energy consumption and the execution time of the application. Depending on the requirements of the user, more weightage can be given to energy consumption or the execution time. 48 6.2. PCAP Plugin 6.2.1. Tuning Objective The main tuning objective of this plugin is to reduce the energy delay product of OpenMP based applications. To optimize the EDP of the application, tuning action is applied on the OMP parallel regions. Execution time property and energy consumption property of the region is returned as a result of tuning action. Both the properties are used to calculate the EDP and an optimum value is suggested to the user. 6.2.2. Tuning points For the PCAP plugin we define one tuning point: • Number of Threads It is simply an integer value specifying the number of threads used for executing the OMP parallel regions. In case of SuperMUC, as described in Chapter 2, each compute node consists of two sockets and each socket has 8 cores. So to limit the application execution to a single node a range of 1-16 threads has been used for all measurements. In this subsection we describe the implementation of the PCAP plugin in terms of its Tuning Plugin Interface (TPI) methods. 6.2.3. Initialize As mentioned in section 6.1.1, the plugin must create the following data structures at initialization time: tuning points, search algorithms, and objectives. In this plugin, a single tuning point is created named as “NUMTHREADS” for specifying the number of threads used to execute the parallel regions. 49 6. PCAP Plugin 6.2.4. Start Tuning Step This method is further subdivided in two steps respective to the two-step tuning process of the plugin. • StartTuningStep1SpeedupAnalysis • StartTuningStep2EnergyTuning In this method search algorithm is selected which is the exhaustive search algorithm in this case. Search spaces are created for each parallel region, variant space is added to each search space and then the search spaces are added to the search algorithm. Variant space is simply a range of integers for the NUMTHREADS. 6.2.5. Create Scenarios This method is further subdivided in two steps respective to the two-step tuning process of the plugin. • CreateScenarios1SpeedupAnalysis • CreateScenarios2EnergyTuning Scenarios are created by the exhaustive search algorithm. For the available search spaces, a cross product is created recursively to create the scenarios. Details of the implementation of creating scenarios are given in Chapter 7. 6.2.6. Prepare Scenarios This method is further subdivided in two steps respective to the two-step tuning process of the plugin. • PrepareScenarios1SpeedupAnalysis 50 6.2. PCAP Plugin • PrepareScenarios2EnergyTuning Since no recompilation is needed, the created scenarios are moved directly to the PSP. 6.2.7. Define Experiment This method is further subdivided in two steps respective to the two-step tuning process of the plugin. • DefineExperiment1SpeedupAnalysis • DefineExperiment2EnergyTuning In this case, every single scenario defines an execution of the region with a different value for the manipulated variable. The experiments are executed to request the energy and execution time properties. When the agent network is finished with the experiment, the objectives are propagated to the frontend. The frontend then sets them up in a properties pool that is available to both the plugin and the search algorithm. The frontend also moves all the scenarios from the ESP to the FSP. 6.2.8. Get Restart Info This method returns false to indicate that no restart is required. 6.2.9. Process Results The optimum energy delay product combinations for the phase region i.e EDP, ED2 P and ED3 P will be selected and provided to the user as a recommendation. In addition, a detailed summary of all the results for the created scenarios are also provided in the form a table for evaluation of results by the user. Three 51 6. PCAP Plugin objective functions for calculating EDP, ED2 P and ED3 P are implemented which are used to show the results to the user. 6.2.10. Objective Functions The objective functions are implemented to calculate the EDP, ED2 P and ED3 P for each scenario. In case of PCAP, the scenarios are run on the phase region with different number of threads. The first scenario is always run with one thread. So the objective function are normalized with respect to the first scenario i.e. first scenario with one thread is taken as a base case. 52 7. Exhaustive Serach The purpose of a search algorithm is to creates combinations of search spaces and the variant spaces provided for a given application. Each combination is called as a “Scenario” in PTF. A Scenario is a list of Tuning Specifications. Each Tuning Specification is a tuple containing a variant value and a variant context. Whereas a Variant context can be a single region in the application, a list of regions or the entire application. In case of exhaustive search, all possible scenarios are created and explored. The entire implementation of the exhaustive search algorithm is divided in following methods: 7.1. Add Search Space This method simply creates a data structure to add search spaces created by the plugin. Each search space is a tuple containing a region(s) and a variant space. i.e Search Space (SS) = {Region(s), Variant Space} 53 7. Exhaustive Serach 7.2. Create Scenarios The purpose of this method is to create all possible combinations of the search spaces and variants. To create all the possible combinations, it forms the cross product of all the search spaces and variant space. This can only be done iteratively. We need to iterate recursively at two levels. First level iterates over search spaces and the second level iterates over variant space for the respective search space. So it means a nested recursive algorithm has been implemented to create the cross product. For each iteration of first level, we create one Tuning Specification after one complete execution of second level. Similarly after the first compete execution of first level, all the Tuning specifications created are gathered to form one Scenario. Then the iterative calls return one by one to create as many as (Number of SS)Number of Variants Scenarios. The iterative algorithm is graphically depicted in Figure 7.1. Figure 7.1.: Graphical Representation of Recursive Algorithm for Creating Scenarios 54 7.3. Iterate Search Spaces 7.3. Iterate Search Spaces This method implements the first level of recursion in the recursive algorithm of Figure 7.1. The iteration is carried over all the search spaces recursively. 7.4. Iterate Tuning Points This method implements the second level of recursion in the recursive algorithm of Figure 7.1. For each search space, iterations are done over the entire variant space recursively. 7.5. Generate Scenarios The list of Tuning Specifications created after each iteration of recursion is passed to this method. The method clones the list to avoid pointer references to deleted objects and creates a local copy of the Tuning Specifications. This local copy is then used to create Scenario. The created Scenarios are added to the Created Scenario Pool (CSP). 7.6. Search Finished Once all the scenarios has been executed and the requested properties has been added to the property pool by the frontend, this method is invoked by the search engine. This method calls the requested objective function, gets the results for the objective function, finds the optimum value and returns that back to the plugin for displaying to the user. 55 7. Exhaustive Serach 56 Part III. Experiments and Results 57 8. Experimental Analysis In this section, we discuss the evaluation of our energy tuning plugin (PCAP) on SuperMUC which has been done through different experiments. We begin with a brief description of the experimental setup. For the first set of experiments, we analyze the scalability of the NAS Parallel benchmarks for the user region on the SuperMUC pointing out configurations which allow for optimal performance and power consumption. Then, we evaluate the speedup analysis step of the plugin by first estimating the static and dynamic power per core on SuperMUC compute node and then using the speedup formula, shrink the search space to evaluate the energy consumption for multiple regions. Finally, we discuss the second set of experiments in which NPB benchmarks are evaluated for multiple parallel regions inside the user region for finding the global optimum for the entire user region, in terms of both performance and energy benefits. 8.1. NAS Parallel Benchmarks (NPB) OpenMP version 3.3 of NPB has been used for the evaluation of PCAP. NPB’s were derived from CFD codes. They were designed to compare the performance of parallel computers and are widely recognized as a standard indicator 59 8. Experimental Analysis of computer performance. NPB consists of five kernels and three simulated CFD applications derived from important classes of aerophysics applications. These five kernels mimic the computational core of five numerical methods used by CFD applications. The benchmarks are specified only algorithmically (”pencil and paper” specifications) and referred to as NPB-1. Details of the NPB-1 suite can be found in [11]. but for completeness of discussion we outline the five benchmarks that has been used in this thesis to evaluate PCAP. • BT is a simulated CFD application that uses an implicit algorithm to solve 3-dimensional (3-D) compressible Navire- Stokes equation. The finite difference solution to the problem is based on an Alternating Direction Implicit (ADI) approximate factorization that decouples the x, y and z dimensions. The resulting systems Block-Tridiagonal of 5×5 blocks and are solved sequentially along each dimension [21]. • SP is a simulated CFD application that has a similar structure to BT. The finite differences solution to the problem is based on a Beam-Warming approximate factorization that decouples the x, y and z dimensions. The resulting system has Scalar Pentadiagonal bands of linear equations that are solved sequentially along each dimension [21]. • LU is a simulated CFD application that uses symmetric successive overrelaxation (SSOR) method to solve seven-block-diagonal system resulting from finite-difference discretization of the Navier-Stokes equations in 3-D by splitting it into block Lowed and Upper triangular systems [21]. • CG uses a Cconjugate Gradiant method to compute an approximation to the smallest eigenvalue of a large, sparse, unstructured matrix. This ker- 60 8.2. Results on SuperMUC Compute Node (16 processing cores) nel tests unstructured grid computations and communications by using a matrix with randomly generated locations of entries [21]. • EP is an Eembarrassingly Parallel benchmark. It generates pairs of Gaussian random deviates according to a specific scheme. The goal is to establish a reference point for peak performance of a given platform [21]. 8.2. Results on SuperMUC Compute Node (16 processing cores) We begin by profiling scalability of all applications in the benchmark suite and determine its effect on the performance and energy consumption. In order to do that, the applications are executed with a variable concurrency ranging from one to sixteen threads and bound to processor/core combinations as shown in Figure 8.1. The notation (X, Y ) denotes non-adaptive (static) execution with X × Y threads bound to X processors and Y cores per processor. The cores marked in green are being used by the threads. The bindings shown, are not an exhaustive collection of bindings possible on this machine. The execution time and energy are the two properties requested for the user region by the plugin. These properties are then used to calculate the objective functions. Execution time is simply the wall clock time required for the execution of the user region. Energy is measured using the Enopt library provided by LRZ on the SuperMUC. We also evaluate the objective functions for all the benchmarks showing that all the three objective functions (ED, ED2 and ED3 ) are a very good depiction of the optimal configuration for the given benchmark. 61 8. Experimental Analysis Figure 8.1.: Thread to Processor/Core Bindings. 8.3. First Experiment Set In the first set of experiments, NPB benchmarks are run on SuperMUC in Periscope applying PCAP plugin for tuning. The benchmarks are run for a range of 116 threads on a compute node of SuperMUC. The execution time and energy properties for the user region are measured. Using these properties, objective functions are calculated and the results are returned suggesting the best optimum. All the benchmarks are run for problem sizes W, S, A, B and C. These experiments are used to analyze the scalability pattern of the benchmarks. Performance analysis shows that we have three categories of benchmarks. First are those applications that manage reasonable speedup through the utilization of additional cores (BT, EP, LU, and CG). BT, EP, LU and CG show that the machine allows linear speedup with appropriately written code with speedups of 2.1x, 2.5x, 2.0x and 2.12x respectively. They also reduce their energy consumption proportionally indicating an optimal use of the cores. Second are 62 8.3. First Experiment Set applications that neither substantially gain nor lose performance from higher concurrency (LU-HP and SP). Lastly, we have the applications that incur a nonnegligible performance loss when using more cores. MG fall under this category losing the performance by as much as 1.16x times over single threaded execution. This shows that this benchmarks is essentially memory bound and lose performance due to a high degree of contention for this resource with increased concurrency. The most energy-efficient configuration coincides with the most performance-efficient configuration for 4 out of the 7 benchmarks (BT, EP, LU and CG). For 2 benchmarks (LU-HP and SP), the user can use fewer than the performance-optimal number of cores, to achieve substantial energy savings, at a marginal performance loss. Following Figures 8.2, 8.3, 8.4, 8.5, 8.6 and 8.7 are showing the results of first set of experiments for BT, EP, LU, CG, LU-HP and SP respectively. Each of these figures has two graphs. In each figure, the graph on right side is showing the execution time (bars) and energy consumption (lines) of the respective benchmark and the graph on left side shows the three objective functions graphs for the corresponding benchmark. In right side graph, the configurations with the best performance and energy for each benchmark are marked with a blue gradient and a large diamond respectively. In left side graph, the best optimums for the three objective functions are marked with a large diamond. In each of the above mentioned figures, the subfigures (a), (b), (c), (d) and (e) are showing the results for the problem sizes A, B, C, W and S respectively for the respective benchmark. The results in Figures 8.2 - 8.7 show that the scalability of the application is a good depiction of energy usage. When the scalability of the application is good i.e. in case of BT, EP, LU and CG, the optimum for the execution time and 63 8. Experimental Analysis (a) BT.A (b) BT.B (c) BT.C Figure 8.2.: BT results for the entire user region respective to problem sizes A, B, C, W and S (contd.) 64 8.3. First Experiment Set (d) BT.W (e) BT.S Figure 8.2.: BT results for the entire user region respective to problem sizes A, B, C, W and S 65 8. Experimental Analysis (a) EP.A (b) EP.B (c) EP.C Figure 8.3.: EP results for the entire user region respective to problem sizes A, B, C, W and S (contd.) 66 8.3. First Experiment Set (d) EP.W (e) EP.S Figure 8.3.: EP results for the entire user region respective to problem sizes A, B, C, W and S 67 8. Experimental Analysis (a) LU.A (b) LU.B (c) LU.C Figure 8.4.: LU results for the entire user region respective to problem sizes A, B, C, W and S (contd.) 68 8.3. First Experiment Set (d) LU.W (e) LU.S Figure 8.4.: LU results for the entire user region respective to problem sizes A, B, C, W and S 69 8. Experimental Analysis (a) CG.A (b) CG.B (c) CG.C Figure 8.5.: CG results for the entire user region respective to problem sizes A, B, C, W and S (contd.) 70 8.3. First Experiment Set (d) CG.W (e) CG.S Figure 8.5.: CG results for the entire user region respective to problem sizes A, B, C, W and S 71 8. Experimental Analysis (a) LUHP.A (b) LUHP.B (c) LUHP.C Figure 8.6.: LU results for the entire user region respective to problem sizes A, B, C, W and S (contd.) 72 8.3. First Experiment Set (d) LUHP.W (e) LUHP.S Figure 8.6.: LU-HP results for the entire user region respective to problem sizes A, B, C, W and S 73 8. Experimental Analysis (a) SP.A (b) SP.B (c) SP.C Figure 8.7.: LU results for the entire user region respective to problem sizes A, B, C, W and S (contd.) 74 8.3. First Experiment Set (d) SP.W (e) SP.S Figure 8.7.: SP results for the entire user region respective to problem sizes A, B, C, W and S 75 8. Experimental Analysis the energy usage coincides. This also corresponds to the optimum for the three objective functions. For LU and CG, it can be easily seen that the scalability is linear in case of larger problem sizes (A, B, C and W) but it is really poor for the problem size of class S. So the results of the problem size S in case of LU and CG depict that as we keep on adding more cores, the energy and the execution time increases because of the poor scalability. On the other hand, when the application is not having good scalability i.e. in case of LU-HP and SP, the optimum for the energy and the execution time does not coincide with each other. The optimum for the objective functions show that we can run these application with less number of threads and get a good energy saving by compromising the execution time marginally. When the user is more interested in performance, then the ED2 or the ED3 can be used to get the optimum instead of ED as the objective functions because in case of ED2 or the ED3 more weightage is given to the execution time in both these objective functions. This has been shown in the results of both the benchmarks LU-HP and SP in Figure 8.6 and Figure 8.7. The markers in the graphs clearly depict that the energy delay product based objective functions are the good measure for the energy tuning framework as they have pointed the right scenario no matter the application has good scalability properties or not. This first set of experiments has been done with all the iterations in one scenario. The application restarts after each scenario and performs all the iterations in each scenario. So as the application has to be restarted every time, so running the tests takes a lot of time. To avoid this, we have done the next set of experiments. 76 8.4. Second Experiment Set 8.4. Second Experiment Set The base for this set of experiments is the hypothesis that whether the initialization of the data structures with less number of threads and running the scenarios with more threads has an impact on the execution time because of the memory distribution. To evaluate this two test cases has been tested. In the first test case, experiments have been performed on the BT benchmark by running one scenario per iteration of the main computation loop with and without initializing for every scenario. Figure 8.8 graphically depicts the scenario for the first test case. The experiments has been done on BT with problem size of CLASS = B. Figure 8.8.: Test Case 1: (a) Initialization outside user region. (b) Initialization inside user region The second test case to be tested is (a) if initializing with say 16 threads and running with four threads gives the different result as (b) initializing with four threads and running with four threads. Both the test case for the benchmark BT have proved that there is no difference between the case (a) and (b) for both the test cases. So from the results of these experiments, it has been confirmed that we can run the further experiments without having restarting the application 77 8. Experimental Analysis for every scenario. This makes the tests run fast which is critical when the number of scenarios is large because the search algorithm used in the PCAP is the exhaustive search algorithm. 8.5. Third Experiment Set The third set of experiments is conducted to find out if there can be a global optimum for multiple parallel regions. The parallel regions inside the user region are taken and for all the regions, cross product is created as described in Chapter 7 under exhaustive search. The experiments have been performed for benchmark CG with problem size of CLASS=A. To select only the parallel regions in user region, the regions which are not of interest i.e. ones in initialization etc. have been excluded manually by editing the .sir file produced by Periscope. There are three parallel regions inside user region in benchmark CG. The experiments have been done with a range of 1-16 threads. The cross product produces (Number of Threads)Number of Parallel Regions many scenarios. In case of CG 163 = 4096 scenarios has been created. The results for this experiment are shown in Figure 8.9. The results for the execution time show that the best optimum scenario is the scenario number 1685 ((R1, 7), (R2, 10), (R3, 6)). In this particular case, one of the region takes twice the time as other two but all the three regions show good scalability properties until 10 threads an then the scalability goes worse. This is also shown in the graph of Figure 8.9. The result is showing a periodic behavior. It shows decrease in EDP until 10 threads in the cross product and then the EDP starts increasing. So it can be seen that by using PCAP we can find the optimum number of threads for each parallel region in the application. 78 8.6. Fourth Experiment Set As the result is repeating itself after every 256 scenarios, this means that R2 is the most effective region in this case and is playing the vital role in deciding for the best configuration for optimum energy and performance configuration. Figure 8.9.: CG Results for Multiple Regions Experiment 8.6. Fourth Experiment Set All the above mentioned three sets of experiments have been performed with the tuning step of the PCAP plugin without first applying the speedup analysis step to shrink the search space. The speedup analysis step requires first the validation of the energy consumption formula which is shown in the Equation 8.1 below: E = Pstatic ∗ T + Pdynamic ∗ N umberof threads ∗ T (8.1) Where E = Energy Consumed by the application 79 8. Experimental Analysis Pstatic = Static power for the compute node of SuperMUC Pdynamic = Dynamic power per core for a compute intensive application T = Execution time of the application The static and dynamic energy per core of the SuperMUC compute node has been estimated by running a compute intensive application on compute node. The application is run with 8 threads which has been pinned to the 8 cores of the Package 0. The application always get assigned the whole compute node with 16 cores. We run the application on 8 cores. The other 8 cores are not being used by the application. The energy for the Package 0 and Package 1 of the compute node is measured. The energy of Package 1 depicts the static energy of the socket. The energy of Package 1 is subtracted from the energy of the Package 0 which gives the dynamic energy for the entire socket. Dividing this by 8 gives the dynamic energy per core. The measurements are done for six different frequencies (1.2GHz, 1.5GHz, 1.8GHZ, 2.1GHz, 2.4GHz and 2.7GHz). Power is calculated using the energy measured and the execution time of the application and then power has been used to validate the formula. For validation, BT benchmark has been used. The BT benchmark has been run by setting the Energy Policy Tag = none, which ensures that the application is run at 2.3GHz frequency. After checking the results for energy consumption formula, the values doesn’t sum up accordingly. The calculated value is always higher than the measured value but is in range. This section needs more work to figure out the correct static and dynamic power per core for the SuperMUC compute node. The idea is to estimate the speedup of an application first by using the energy consumption estimate of the application. Then only using the range of variant values which gives good speedup. Passing this speedup analysis to the tuning 80 8.6. Fourth Experiment Set step to actually do the tuning. Each parallel region will use its own specific range of variant values, making the search space small. As the search algorithm being used is exhaustive search, it is really crucial to shrink the search space to run the tuning efficiently. 81 8. Experimental Analysis 82 9. Conclusion In this thesis, a tuning plugin for performance and energy efficiency optimizations has been developed. The plugin is developed inside an existing framework named as Periscope Tuning Framework. Periscope is the performance analysis tool which has been developed at the chair of computer architecture and big performance computing in TUM. Periscope has been further extended to PTF to use the performance analysis results generated by Periscope for tuning parallel applications. For this purpose multiple tuning plugins have been developed inside PTF. Each of the plugin targets one category of parallel applications. In this work, we have developed a plugin named PCAP which focuses mainly on the energy and performance optimization of OpenMP based applications. The OpenMP version of NAS Parallel benchmark has been used to test PCAP. The plugin specifies one tuning parameter named “Number of Thread” for the application. This parameter specifies that with how many threads the main kernel (User Region) of the application will be executed. The plugin creates the search space of the possible scenarios which have to evaluated. The exhaustive search algorithm is used to create the scenarios. Two properties namely ‘Energy’ and ‘Execution Time’ are requested by the plugin for the user region. These properties are then used to calculate the objective functions. Three objec- 83 9. Conclusion tive functions are used in the plugin to evaluate the tuning configuration. Energy Delay Product is used as the objective function which tells the user about the best optimum configuration for the application. If the user desires more performance centric configuration, then the other two objective functions; ED2 and ED3 give the best optimum configuration. The results of the tests show that the when the application under considerate is having good scalability properties, the optimum for the energy consumed and the execution time coincide and the EDP gives the exactly same optimum configuration as the best one. If the application shows poor scalability, then the optimums for energy consumption and execution time do not coincide with each other. In this case ED gives the best configuration with optimum energy usage but a marginal loss in performance. It is also proved that the memory distribution does not have a significant effect when the number of threads used while initializing the memory is different than the number of threads used to run the kernel of the application. This shows that while running experiments, restarting the application is not necessary for each of the scenarios. PCAP has also been tested for tuning the individual regions inside the kernel other than tuning the entire kernel. THe purpose of tuning the individual regions is to find the optimum configuration per parallel region. The cross product of the parallel regions and the variants is created by exhaustive search algorithm to create the scenarios and then the best optimum configuration is returned. In case of individual parallel region tuning, when there are large number of parallel regions in the application, the search space expands very fast. As the plugin uses exhaustive search, so it is crucial to shrink the search space. For this 84 purpose, before applying the tuning process, each region has to be analyzed for speedup and the variant space for each region should be shrinker according to the speedup properties of the region. Some theoretical work has been done in this regard. For this purpose the static and dynamic energy estimation for the SuperMUC compute node has been done. This step is not complete some implementation work is still left and is referred as the future work to be done. 85 Bibliography [1] D. Albonesi. Dynamic Ipc/Clock Rate Optimization. International Symposium on Computer Architecture, pages 282 – 292, July 1998. [2] D. Baeck, A. Loeoef, and M. Roennbaeck. Evaluation of Ttechniques for Reducing the Energy-Delay Pproduct in a JAVA Processor. 1999. [3] D. Brooks and M. Martonosi. Adaptive Thermal Management for HighPerformance Microprocessors. Workshop on Complexity Effective Design, June 2000. [4] E. V. Carrera, E. Pinheiro, and R. Bianchini. Conserving Disk Energy in Network Servers. In Proceedings of the 17th International Conference on Supercomputing, June 2003. [5] C.B.Navarrete, A.Auweter, C.Guillen, W.Hesse, and M.Brehm. Energy Consumption Comparison for Running Applications on SandyBridge Supercomputers. 2013. [6] I. A. Compres. PTF Demonstrator. http://www.autotune- project.eu/sites/default/files/Materials/Deliverables/12/D2.2 PTF Demonstrator final.pdf. 87 Bibliography [7] M. Curtis-Maury, J. Dzierwa, C. Antonopoulos, and D. Nikolopoulos. Online Power-Performance Adaptation of Multithreaded Programs using Hardware Event-Based Prediction. In Proceedings of the International Conference on Supercomputing, June 2006. [8] Bruno Diniz, D. O. G. Neto, W. Meira Jr., and R. Bianchini. Limiting the Power Consumption of Main Memory. In Proceedings of the International Symposium on Computer Architectures, June 2007. [9] H. Sanchez et al. Thermal Management System for High Performance PowerPC Microprocessor. IEEE Computer Society International Conference, pages 325 – 330, February 1997. [10] N. Vijaykrishnan et al. Energy-Driven Integrated Hardware-Software Optimizations Using SimplePower. International Symposium on Computer Architecture, pages 96–106, June 2000. [11] High mance Performance Fortran Fortran Language Forum. Specification. High January Perfor1997. http://www.crpc.rice.edu/CPRC/softlib/TRs online.html. [12] R. Ge, X. Feng, and K. W. Cameron. Performance Constrained Distributed DVS Scheduling for Scientific Applications on Power-aware Clusters. In Proceedings of Supercomputing, November 2005. [13] M. Gerndt and M. Ott. Automatic Performance Analysis with Periscope. In Concurrency and Computation: Practice and Experience, April 2010. http://www.lrr.in.tum.de/ ottmi/publications/ccpe2008.pdf. [14] K. Ghose and M. Kamble. 88 Reducing Power in Superscalar Processor Bibliography Caches Using Subbanking, Multiple Line Buffers and Bit-Line Segmentation. International Symposium on Low Power Electronics and Design, pages 70–75, August 1999. [15] R. Gonzalez and M. Horowitz. Energy Dissipation in General Purpose Microprocessors. IEEE Journal on Solid-State Circuits, pages 1277 – 1284, September 1996. [16] T. Halfhill. Transmeta Breaks x86 Low-Power Barrier. Microprocessor Report, pages 9 – 18, February 2000. [17] C.-H. Hsu and W. Feng. A Power-Aware Run-Time System for HighPerformance Computing. In Proceedings of SupercomputingÕ05,, November 2005. [18] Intel. Pentium III Processor Mobile Module: Mobile Module Connector 2 (MMC-2) Featuring Intel SpeedStep Technology. 2000. [19] C. Isci, A. Buyuktosunoglu, C.-Y. Cher, P. Bose, and M. Martonosi. An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget. In Proceedings of the International Symposium on Microarchitecture, December 2006. [20] K. Itoh. Low Power Memory Design. Low Power Design Methodologies, pages 201–251, September 1996. [21] H. Jin, M. Frumkin, and J. Yan. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. October 1999. NAS Technical Report NAS-99-011. 89 Bibliography [22] T. Juan, T. Lang, and J. Navarro. Reducing TLB Power Requirements. International Symposium on Low Power Electronics and Design, pages 196–201, August 1997. [23] J. Kin, M. Gupta, and W. Mangione-Smith. The Filter Cache: An Energy Efficient Memory Structure. International Symposium on Microarchitecture, pages 187–193, December 1997. [24] C. Liu, A. Sivasubramaniam, M. T. Kandemir, and M. J. Irwin. Exploiting Barriers to Optimize Power Consumption of CMPs. In Proceedings of the 19th International Parallel and Distributed Processing Symposium, April 2005. [25] LRZ. SuperMUC Petascale System. https://www.lrz.de/services/compute /supermuc/systemdescription/. [26] S. Manne, A. Klauser, and D. Grunwald. Pipeline Gating: Speculation Control for Energy Reduction. International Symposium on Computer Architecture, pages 132 – 141, July 1998. [27] L. Morin. Design of the tuning plugins. http://www.autotune- project.eu/sites/default/files/Materials/Deliverables/12/D4.1 Tuning Plugins final.pdf. [28] S. Park, W. Jiang, Y. Zhou, and S. V. Adve. Managing Energy-Performance Tradeoffs for Multithreaded Applications on Multiprocessor Architectures. In Proceedings of the 2007 ACM SIGMETRICS, June 2007. [29] T. Pering, T. Burd, and R. Brodersen. The Simulation and Evaluation of Dynamic Voltage Scaling Algorithms. International Symposium on Low Power Electronics and Design, pages 76 – 81, August 1998. 90 Bibliography [30] E. Rohou and M. Smith. Dynamically Managing Processor Temperature and Power. 2nd Workshop on Feedback-Directed Optimization, November 1999. [31] R. Springer, D. K. Lowenthal, B. Rountree, and V. W. Freeh. Minimizing Execution Time in MPI Programs on an Energy-Constrained, PowerScalable Cluster. In Proceedings of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, March 2006. [32] C-L. Su and A. Despain. Cache Design Trade-offs for Power and Performance Optimization: A Case Study. International Symposium on Low Power Electronics and Design, pages 63–68, April 1995. [33] Ananta Tiwari, Michael A. Laurenzano, Laura Carrington, and Allan Snavely. Auto-tuning for Energy Usage in Scientific Applications. In Proceedings of the Euro-Par 2011 Workshops Part II, pages 178 – 187, 2012. [34] A. Varma, B. Ganesh, M. Sen, S. R. Choudhury, L. Srinivasan, and B. L. Jacob. A Control-Theoretic Approach to Dynamic Voltage Scheduling. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, October 2003. [35] Q. Wu, P. Juang, M. Martonosi, and D. W. Clark. Formal Online Methods for Voltage/Frequency Control in Multiple Clock Domain Microprocessors. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2000. [36] Q. Wu, M. Martonosi, D. Clark, V. Reddi, D. Connors, Y. Wu, J. Lee, and D. Brooks. Dynamic Compiler-Driven Control for Microprocessor Energy and Performance. IEEE Micro, 26(3), 2006. 91