PTF Energy Tuning Plugin for HPC Applications

Transcription

PTF Energy Tuning Plugin for HPC Applications
FAKULTÄT FÜR INFORMATIK
DER TECHNISCHEN UNIVERSITÄT MÜNCHEN
Masterarbeit in Informatik
PTF Energy Tuning Plugin for HPC
Applications
Energieoptimierung für
HPC-Anwendungen mit Periscope
Umbreen Sabir Mian
FAKULTÄT FÜR INFORMATIK
DER TECHNISCHEN UNIVERSITÄT MÜNCHEN
Masterarbeit in Informatik
PTF Energy Tuning Plugin for HPC
Applications
Energieoptimierung für HPC-Anwendungen
mit Periscope
Author:
Umbreen Sabir Mian
Supervisor: Prof. Dr. Micheal Gerndt
Date:
December 16, 2013
Statement of Academic Integrity
I,
Last name: Mian
First name: Umbreen Sabir
ID No.: 03624284
hereby confirm that the attached thesis,
“PTF Energy Tuning Plugin for HPC Aplications”
or
“Energieoptimierung für HPC-Anwendungen mit Periscope”
is my own work and I have documented all sources and material used.
Munich, December 15, 2013
Umbreen Sabir Mian
Acknowledgments
I would like to thank my supervisor, Prof. Dr. Michael Gerndt for his support and for being the inspiration for the work. Deepest gratitude is due to
Prof. Gerndt without whose knowledge and assistance this thesis would not
have been successful. I pay my sincere thanks to Robert Mijakovic for his valuable advices, kind guidance and technical support. I am very thankful to him
that he always spared some time from his busy schedule to help me. Furthermore, It has been a very pleasant experience to work at the Lehrstuhl für Rechnertechnik und Rechnerorganisation, Fakultät für Informatik der Technische
Universität München.
The acknowledgement is incomplete without saying special thanks to Dr.
Carmen Navarrete from LRZ for providing me required information and guidance.
Last but not the least; I also respect the support of my family and friends who
have always stood besides me both in ups and downs of my life.
vii
Abstract
The PCAP plugin is targeted to tune the energy consumption of OMP applications. The plugin uses the energy measurements from ENOPT library to
allow the optimization of the energy consumption of instrumented applications
at runtime by changing the tuning parameter ”Number of Threads” on certain
instrumented code regions (OMP parallel regions). Energy consumption and
the execution time for the code region is combined using Energy Delay Product (EDP). Tuning of the application is done by applying different values of
the tuning parameter on different code regions using cross product and then
finding the global optimum EDP for the application. The scalability of the code
region effects the tuning parameter value and the granularity of the code region
is depicted through the energy consumption measurement for that region.
ix
x
Contents
Acknowledgements
vii
Abstract
ix
Outline of the Thesis
xv
I.
Introduction and Background
1
1. Introduction
3
2. Related Work
9
3. SuperMUC
3.1. System purpose and target users
3.2. System overview . . . . . . . . .
3.3. Energy Efficiency . . . . . . . . .
3.4. System Configuration Details . .
3.4.1. Memory Architecture . .
3.4.2. Details on processors . . .
3.5. System Software . . . . . . . . . .
3.6. Storage Systems . . . . . . . . . .
3.6.1. Home file systems . . . .
3.6.2. Work and Scratch areas .
3.6.3. Tape backup and archives
3.7. Energy Measurement - enopt . .
3.7.1. Using EnOpt . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
14
14
14
15
16
16
17
17
18
18
18
23
4. Energy Delay Product
25
4.1. Auto-tuning Feedback Metric . . . . . . . . . . . . . . . . . . . . . 28
4.2. Energy Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5. Periscope Tuning Framework
31
5.1. Periscope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2. Periscope Tuning Framework (PTF) . . . . . . . . . . . . . . . . . 33
5.2.1. PTF main components . . . . . . . . . . . . . . . . . . . . . 34
xi
Contents
5.2.2. PTF Repository Structure . . . . . . . . . . . . . . . . . . . 34
5.2.3. PTF Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
II. Design and Implementation
43
6. PCAP Plugin
6.1. Tuning Plugin Interface (TPI)
6.1.1. Initialize . . . . . . . .
6.1.2. Start Tuning Step . . .
6.1.3. Create Scenarios . . .
6.1.4. Prepare Scenarios . . .
6.1.5. Define Experiment . .
6.1.6. Get Restart Info . . . .
6.1.7. Process Results . . . .
6.2. PCAP Plugin . . . . . . . . . .
6.2.1. Tuning Objective . . .
6.2.2. Tuning points . . . . .
6.2.3. Initialize . . . . . . . .
6.2.4. Start Tuning Step . . .
6.2.5. Create Scenarios . . .
6.2.6. Prepare Scenarios . . .
6.2.7. Define Experiment . .
6.2.8. Get Restart Info . . . .
6.2.9. Process Results . . . .
6.2.10. Objective Functions .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
45
45
46
46
46
47
47
48
48
49
49
49
50
50
50
51
51
51
52
.
.
.
.
.
.
53
53
54
55
55
55
55
7. Exhaustive Serach
7.1. Add Search Space . .
7.2. Create Scenarios . . .
7.3. Iterate Search Spaces
7.4. Iterate Tuning Points
7.5. Generate Scenarios .
7.6. Search Finished . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
III. Experiments and Results
8. Experimental Analysis
8.1. NAS Parallel Benchmarks (NPB) . . . . . . . . . . . . . . . .
8.2. Results on SuperMUC Compute Node (16 processing cores)
8.3. First Experiment Set . . . . . . . . . . . . . . . . . . . . . . .
8.4. Second Experiment Set . . . . . . . . . . . . . . . . . . . . . .
xii
57
.
.
.
.
.
.
.
.
.
.
.
.
59
59
61
62
77
Contents
8.5. Third Experiment Set . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.6. Fourth Experiment Set . . . . . . . . . . . . . . . . . . . . . . . . . 79
9. Conclusion
83
Bibliography
87
xiii
Contents
Outline of the Thesis
Part I: Introduction and Background
C HAPTER 1: I NTRODUCTION
This chapter presents an overview of the work and it purpose.
C HAPTER 2: R ELATED W ORK
This chapter gives a brief overview of the related work being done in the same
topic as of this work
C HAPTER 3: S UPER MUC
This chapter gives a brief overview of the SuperMUC machine used during this
work for all experiments and gives details about the Enopt library.
C HAPTER 4: EDP
This chapter discusses details of the EDP objective function used in the plugin.
C HAPTER 5: PTF
This chapter presents an overview of Periscope and PTF.
Part II: Design and Implementation
C HAPTER 6: PCAP
This chapter presents the details about the design and implementation of the
tuning plugin.
C HAPTER 7: E XHAUSTIVE S EARCH
This chapter presents the details of the implementation of the search algorithm.
Part III: Experiments and Results
C HAPTER 8: E XPERIMENTS
This chapter gives details about all the experiments done and their results.
C HAPTER 9: C ONCLUSION
This part briefly concludes all the work done in this thesis.
xv
Part I.
Introduction and Background
1
1. Introduction
In the last few years, the emergence of multicore architectures has revolutionized the landscape of high-performance computing. The recent developments
in computer architecture especially related to multicore and manycore architecture have triggered considerable performance gains in high performance computing. The multicore shift has not only increased the per-node performance
potential of computer systems but also has made great strides in curbing power
and heat dissipation. High-performance computing is and has always been
performance-oriented.
However, a consequence of the push towards maximum performance has
increased energy consumption, especially in datacenters and supercomputing
centers. Moreover, as peak performance is rarely attained, some of this energy consumption results in little or no performance gain. Today, the running
costs for energy for many systems have passed the break even point where
they exceed by far the acquisition costs of the hardware platform. Beside these
economical issues, more fundamental aspects related to the adequate design
of efficient IT-Infrastructures in datacenters must be addressed in that context.
Even the supply of the needed amount of energy is becoming critical for many
installation. In addition, large energy consumption costs datacenters and supercomputing centers a significant amount of money and wastes natural resources.
Furthermore, a typical supercomputer consumes large amount of electrical
3
1. Introduction
power, almost all of which is converted into heat, requiring cooling. Further
the concerns about the rise of an energy crisis and the global warming lead
to an unmistakable call for energy efficiency in High-Performance-Computing.
On one hand, datacenters and supercomputing centers have to bear a large
amount of cooling costs while on the other hand, large amount of heat emitted
by these giant machines leaves severe impacts on the environment and adds to
the global warming problem of the environment. Because of the huge volumes
of these machines, energy consumption regulation is a very important research
area in supercomputing. Recognition of the importance of power in the field
of High Performance Computing, whether it be as an obstacle, expense or design consideration, has never been greater and more pervasive. So nowadays,
energy consumption is a major concern with high-performance multicore systems.
On the other hand, performance analysis and tuning is an important step
in programming multicore-based parallel architectures. Aspects for tuning the
application performance are as diverse as the selection of compilation options,
the energy consumption, the execution efficiency of MPI/OpenMP, the execution time of GPU kernels and so on. A lot of work has been done on tuning of
parallel applications and still it is a very hot reserch topic. Many analysis tools
have been developed which analyze the performance properties of parallel applications.
The goal of this thesis is to to tune energy consumption of OpenMP based
parallel applications run at very large scale while minimizing the impact on
run-time performance (wall-clock execution time). As the basic principle of
parallel processing in OpenMP is based on creation of multiple threads and assigning part of application execution to each thread, so the ultimate goal of the
4
thesis is to optimize the number of threads in such a way that optimal performance is achieved along with reduced energy consumption. It can be easily
seen that there would be a tradeoff between the two properties of the parallel application. Increasing the number of cores can increase the performance
by reducing the execution time of the application but it would consume more
energy. Furthermore, not all applications are scalable, so there is an optimal
number of threads after which increasing the number of threads does not give
benefit of lower execution time or in some cases scalability is lower than the
increase in energy may be due to memory bandwidth limitation.Concerning
the measurement of energy consumption by the application, there are special
energy measurement systems which can give the measurement of energy consumption both at CPU level and at node level. Using these basic concepts as
the basis for the process of tuning, a very efficient way of tunning parallel application has been developed.
The framework used for this tuning task is Periscope Tuning Framework
(PTF). PTF is an automatic tuning framework for the design of autotuning tools
targeting various types of applications and optimization opportunities developed at Technical University Munich. The framework is based on the performance analysis tool Periscope and offers a set of plugins to automatically tune
the application performance on diverse aspects. Periscope is an automatic performance analysis tool for highly parallel applications written in MPI and/or
OpenMP, which has also been developed at Technical University Munich. A
Periscope analysis run provides values of different properties of the application
under consideration, which can then be optimized using some tuning strategy.
So the properties that are the output of Periscope are tuned using tuning plugins of PTF. After an auto-tuning run, the user is given recommendations on
5
1. Introduction
how the application performance can be improved.
So during the course of this thesis, a plugin has been developed inside PTF,
which tunes the execution time and energy efficiency of a given OpenMP application. The tuning plugin uses an objective function that uses the properties
values from Periscope and tries to tune those values to get an optimal performance of the application. The objective function used in this tuning plugin is
known as Energy Delay Product. As the name of the objective function depicts,
this function minimizes the execution time of the application while lowering
the energy consumption by the application.
Testing of the plugin has been done using NAS Parallel Benchmarks (NPB).
These benchmarks are used for the performance evaluation of highly parallel
supercomputers. The benchmarks consist of five parallel kernel benchmarks
and three simulated application benchmarks. They depict the computation and
data movement characteristics of large-scale computational fluid dynamics applications.
Next chapter of the introduction section gives an overview of the related
work done. As mentioned previously, energy consumptions by supercomputers is a very active research field in high performance computing, so a lot of
work has been done in this context to make the massive machines energy efficient both in terms of hardware and also on the basis of applications running
on these machines.
In second chapter, some background details about SuperMUC and the previous work done in the PTF project are given. It starts with the details about
the architecture, layout and memory scheme of the SuperMUC and the energy
measurement done by the LRZ library.. Then it describes the architecture and
working principle of the Periscope. Finally the chapter concludes after explain-
6
ing all the details about the PTF.
Section two describes the details about the thesis work; more precisely, it
consists of presenting the design details of the tuning plugin describing all its
subcomponents. It elaborates on the whole process of tuning, starting from all
the minor details of the design and implementation of the tuning plugin and
then decribing the entire tuning process.
Section three presents all the test results done on the tuning plugin using the
NPB benchmarks. It gives the details of each experiment performed, the problems encountered during the course of experiments and the results generated
after running tuning on these benchmarks.
Lastly, a brief summary of the entire work is given in the conclusion section.
This section also gives an overview of some possible aspects of the area which
can be worked upon in future.
7
1. Introduction
8
2. Related Work
With Exascale systems on the horizon, we have ushered in an era with power
and energy consumption as the primary concerns for scalable computing. To
achieve this, revolutionary methods are required with a stronger integration
among hardware features, system software and applications. Current approaches
for energy efficient computing rely heavily on power efficient hardware in isolation.
Existing hardware has been successfully leveraged by present day operating systems to conserve energy whenever possible but these approaches have
proven ineffective and even detrimental at large scale. While hardware must
provide part of the solution, how these solutions are leveraged on large scale
platforms requires new and flexible approach. It is particularly important that
any approach taken has a system and node-level view of these issues.
To address the issue of efficient energy usage, many low-power architectural
techniques have been proposed and implemented. For example, they include
putting the system in sleep mode [9]; scaling the voltage and/or frequency [16]
[18] [29]; switching contexts to a job that consumes less power [30]; reconfiguring hardware structures [1]; gating pipeline signals, for example to control
speculation [3] [26]; throttling the instruction cache [9]; clock optimizations, including multiple clocks and clock gating [15]; better signal encoding [15]; low
9
2. Related Work
power memory design techniques [20] like bank partitioning or divided word
line; low power cache design techniques like cache block buffering [10], subbanking [14] [32], or filter caches [23]; and TLB optimizations [22].
Research on software-controlled dynamic power management has focused
extensively on controlling voltage supply and frequency in sequential microprocessors. This research has derived analytical models for DVFS [35], compilerdriven techniques [36], and control-theoretic approaches [34].
Similar techniques have been employed to reduce dynamic power management in system components other than processors, such as RAM [8] and disk
[4]. Researchers have recently modeled and analyzed the impact of control
knobs; DVFS and concurrency throttling, on dynamic power management on
shared-memory [7] [19] [24] [28], and on distributed-memory parallel systems
[12] [17] [31].
All the above mentioned work focuses on exploring techniques either at architectural level or at software level to minimize the energy usage. In this thesis, focus is on how we can run the applications on existing hardware while efficiently utilizing the energy. The main focus of this work are the OpenMP based
applications and using different configurations of number of threads used to
run the application, application energy usage and execution time are tuned.
Focus is mainly on tuning the application as a whole as well as tuning the individual parallel regions in such a way that a global optimum configuration for
minimum energy usage can be explored.
Similar work has been presented in [33] in which authors have presented that
there are non-trivial interactions between compiler performance optimization
strategies and energy usage. They have used Whatsup?Pro power meter to
measure energy usage and leveraged open source projects to explore energy
10
and performance optimization space for computation intensive kernels. They
have used clock frequency tuning as the tuning parameter.
This thesis share similar objectives by using number of threads as the tuning
parameter and measuring energy usage using the Enopt library for energy measurement to explore energy and performance optimization space of OpenMP
based applications.
11
2. Related Work
12
3. SuperMUC
SuperMUC is the name of the new supercomputer at Leibniz-Rechenzentrum
(Leibniz Supercomputing Centre) in Garching near Munich (the MUC suffix is
borrowed from the Munich airport code). With more than 155.000 cores and a
peak performance of 3 Petaflop/s (=1015 Floating Point Operations per second)
SuperMUC is one of the fastest supercomputers in the world.
3.1. System purpose and target users
SuperMUC strengthens the position of Germany’s Gauss Centre for Supercomputing in Europe by delivering outstanding compute power and integrating
it into the European High Performance Computing ecosystem. With the operation of SuperMUC, LRZ acts as an European Centre for Supercomputing
and will be Tier-0 centre of PRACE, the Partnership for Advanced Computing
in Europe. SuperMUC is available to all European researchers to expand the
frontiers of science and engineering. Since August 2011, a migration system
(nicknamed SuperMIG) enables porting applications to the new programming
environment [25].
13
3. SuperMUC
3.2. System overview
• 155,656 processor cores in 9400 compute nodes
• 300 TB RAM
• Infiniband FDR10 interconnect
• 4 PB of NAS-based permanent disk storage
• 10 PB of GPFS-based temporary disk storage
• 30 PB of tape archive capacity
• Powerful visualization systems
• Highest energy-efficiency
3.3. Energy Efficiency
SuperMUC uses a new, revolutionary form of warm water cooling developed
by IBM. Active components like processors and memory are directly cooled
with water that can have an inlet temperature of up to 40 degrees Celsius. The
“High Temperature Liquid Cooling” together with very innovative system software promises to cut the energy consumption of the system. In addition, all
LRZ buildings will be heated re-using this energy [25].
3.4. System Configuration Details
LRZ’s target for the architecture is a combination of a large number of moderately powerful compute nodes, with a peak performance of several hundred
14
3.4. System Configuration Details
GFlop/s each, and a small number of fat compute nodes with a large shared
memory. The network interconnect between the nodes allows for perfectly linear scaling of parallel applications up to the level of more than 10,000 tasks
[25].
SuperMUC consists of 18 Thin Node Islands and one Fat Node Island which
is at first also used as the Migration System SuperMIG. Each Island contains
more than 8,192 cores. All compute nodes within an individual Island are connected via a fully non-blocking Infiniband network (FDR10 for the Thin nodes
/ QDR for the Fat Nodes). Above the Island level, the high speed interconnect enables a bi-directional bi-section bandwidth ratio of 4:1 (intra-Island /
inter-Island) [25].
The SuperMUC system will be expanded in 2015 by doubling the performance.
3.4.1. Memory Architecture
SuperMUC has 18 partitions called Islands. Each Island consists of 512 nodes.
A node is a shared memory system with two processors shown in Figure 3.1.
Each node consists of:
• Sandy Bridge-EP Intel Xeon E5-2680 8C
– Each processor has eight cores.
– Each core has 2-way hyperthreading
– 172.8 GFlops per processor with 21.6 GFlops at 2.7 GHz per core
• 32 GByte memory
• Inifiniband network interface
15
3. SuperMUC
Figure 3.1.: SuperMUC NUMA Node
3.4.2. Details on processors
• Westmere-EX for the Fat Node Island / Migrationssystem
• Sandy Bridge-EP for the Thin Node Islands (Intel Xeon E5-2680 8C) - 2.7
GHz (Turbo 3.5 GHz). Architecture of a Sandy Bridge processor is shown
in Figure 3.2.
3.5. System Software
SuperMUC uses following software components:
• Suse Linux Enterprise Server (SLES)
• System management: xCat from IBM
• Batch processing: Loadleveler from IBM
From the user side a wide range of compilers, tools and commercial and free
applications is provided. Many scientists also build and run their own software.
16
3.6. Storage Systems
Figure 3.2.: Sandy Bridge Processor Architecture
3.6. Storage Systems
SuperMUC has a powerful I/O-Subsystem which helps to process large amounts
of data generated by simulations.
3.6.1. Home file systems
Permanent storage for data and programs is provided by a 16-node NAS cluster from Netapp. This primary cluster has a capacity of 2 Petabytes and has
demonstrated an aggregated throughput of more than 10 GB/s using NFSv3.
Netapp’s Ontap 8 “Cluster-mode” provides a single namespace for several
hundred project volumes on the system. Users can access multiple snapshots
of data in their home directories [25].
Data is regularly replicated to a separate 4-node Netapp cluster with another
2 PB of storage for recovery purposes. Replication uses Snapmirror-technology
17
3. SuperMUC
and runs with up to 2 GB/s in this setup.
Storage hardware consists of 3400 SATA-Disks with 2 TB each protected by
double-parity RAID and integrated checksums.
3.6.2. Work and Scratch areas
For highest-performance checkpoint I/O IBM’s General Parallel File System
(GPFS) with 10 PB of capacity and an aggregated throughput of 200 GB/s is
available. Disk storage subsystems were built by DDN [25].
3.6.3. Tape backup and archives
LRZ’s tape backup and archive systems based on TSM (Tivoli Storage Manager)
from IBM are used for or archiving and backup. The have been extended to
provide more than 30 Petabytes of capacity to the users of SuperMUC. Digital
long-term archives help to preserve results of scientific work on SuperMUC.
User archives are also transferred to a disaster recovery site [25].
3.7. Energy Measurement - enopt
LRZ provides an energy monitoring library for the measurement of energy consumed by an application. The aim of the library, known as enopt, is to gain
knowledge of the distribution of the energy consumption among the different
components of the compute nodes of a supercomputer system, taking also into
account the characteristics of the application running on it. For this objective,
the PAPI-RAPL component and the native ibmaem-HWMON kernel have been
integrated.
18
3.7. Energy Measurement - enopt
This tool supports FORTRAN or C/C++ applications parallelized either with
MPI, OMP or hybrid computing. At the moment, it runs on SandyBridge processors. The library provides classes to monitor not only energy counters, but
also some other PAPI performance counters in order to be able to find corelations between energy consumption and the behavior of the application regarding these other performance counters (such as cache misses, number of cycles,
instructions per second, etc.) and also the application runtime [5].
PAPI aims to provide the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware
found in most microprocessors. PAPI enables software engineers to see, in near
real time, the relationship between software performance and processor events.
One of the components of the PAPI library is the so called PAPI-RAPL component that makes use of the RAPL sensors available in the SandyBridge micro
architecture. The PAPI-RAPL component provides energy consumption measurement of the CPU-level components by examining the MSR registers [5].
The specific RAPL domain counters available in Intel platforms vary across
“product segments”.
• Platforms which target the client segment: In this case, they support the
following RAPL domain hierarchy:
– Package (PKG)
– Two power planes (PP0 and PP1). In this case, PP0 refers to the processor cores and PP1 refers to the uncore devices.
• Platforms targeting the server segment support:
– Package (PKG)
19
3. SuperMUC
– The power plane PP0. In this segment, PP0 refers also to the processor cores, whereas PP1 domain is not supported.
– DRAM
The package domain PKG, regardless of the targeted segment, is defined as
the processor die.
The specific MSR interfaces defined for the RAPL domain are:
• MSR PKG POWER LIMIT: allows software to set power limits for the
package.
• MSR PKG POWER INFO: reports the package power range information
for RAPL usage.
PAPI-RAPL provides a set of PAPI native events to interact with the RAPL
interface. This events are, among others (see Figure 3.1 [5]):
• PACKAGE ENERGY:PACKAGEEx: Energy used by chip package 0 or 1,
respectively [5].
• DRAM ENERGY:PACKAGEEx: Energy used by the DRAM on package 0
or 1, respectively. It is unavailable for client segments [5].
• PP0 ENERGY:PACKAGEEx: Energy used by all cores in package 0 or 1,
respectively [5].
• PP1 ENERGY:PACKAGEEx: Energy used by all uncore devices in package 0 or 1, respectively. It is unavailable for server segments [5].
Figure 3.3 [5] shows the location of the RAPL counters on the SandyBridge
micro architecture. Blocks with red background belong to the PACKAGE0,
20
3.7. Energy Measurement - enopt
Figure 3.3.: Graphical location of the RAPL counters on the SandyBridge microarchitecture [5]
whereas the green ones belong to PACKAGE1. The voilet block represents the
DRAM device. The yellow circles represent the RAPL sensors whereas the blue
one represents the HWMON counter. The available RAPL counters on SuperMUC are: 1. PP0 ENERGY:PACKAGE0, for measuring the energy consumption of the cores (PP0) that belongs to PKG0; 2. PP0 ENERGY:PACKAGE1,
same but for PKG1; 3. PACKAGE ENERGY:PACKAGE0, processor die from
PKG0; 4. The same but related to PKG1; 5. Energy consumption of the DRAM
belonging to PKG0; 6. DRAM ENERGY:PACKAGE1, the same but for PKG1; 7.
DC energy counters provided by the paddle cards; 8. AC energy counter provided by the paddle cards. Each AC counter is shared by two nodes. The IBM
paddle cards are hardware devices located on the motherboard for measuring
the AC and DC power consumption. In this case, the uncore measurements
can be emulated as the difference between PACKAGE ENERGY:PACKAGEx
and PP0 ENERGY:PACKAGEx [5].
21
3. SuperMUC
The codes to use the 8 sensors shown in Figure 3.3 [5] in an application are
as follows:
• ENOPT ALL CORES = 1 + 2
• ENOPT ALL UNCORES = (3 -1) + (4 - 2)
• ENOPT ALL SOCKETS = 3 + 4
• ENOPT ALL DRAMS = 5 + 6
• ENOPT NODE = 7
• ENOPT PDU = 8
• ENOPT CORES 1 = 1
• ENOPT CORES 2 = 2
• ENOPT UNCORES 1 = 3 - 1
• ENOPT UNCORES 2 = 4 - 2
• ENOPT SOCKET 1 = 3
• ENOPT SOCKET 2 = 4
• ENOPT DRAM 1 = 5
• ENOPT DRAM 2 = 6
The library also allows to reduce the energy consumption by changing the
CPU frequency. It provides two ways to change the CPU frequency. One is to
directly set the frequency of the CPU and the other is to choose a governor to
set the power policy of the node. Following five governor policies are available.
22
3.7. Energy Measurement - enopt
• Conservative: It is based on two thresholds.
• Ondemand: It uses one threshold.
• Performance: Sets the maximal frequency of 2.7GHz
• Powersave: Sets the minimal frequency of 1.7GHz.
• Userspace: Sets the user defined frequency.
3.7.1. Using EnOpt
In order to use enopt for energy measurement in an application, following functions are available to use:
• enopt init(): Used to initialize the library at the start of program.
• enopt finalize(): Used to end the call to the library before exiting the program.
• enopt start(): Used to start the counter to measure.
• enopt stop(): Used to stop the counter under consideration.
• enopt get(ENOPT NAME, &Localvariable): Immediately after enopt stop(),
this function is called to get the measured value.
• enopt setGoverner(int): Used to set a particular governor policy.
• enopt setFrequency(int): Used to set the frequency of the cores.
23
3. SuperMUC
24
4. Energy Delay Product
Until recently, performance was the single most important feature of a microprocessor. Today, however, designers have become more concerned with the
power dissipation, and in some cases low power is one of the key design goals.
This has led to an increasing diversity in the processors available.
Comparing processors across this wide spectrum is difficult, and we need to
have a suitable metric for energy efficiency.
Power is not a good metric to compare these processors since it is proportional to the clock frequency. By simply reducing the clock speed we can reduce
the power dissipated in any of these processors. While the power decreases, the
processor does not really become “better”.
Another possible metric is energy, measured in Joule/Instruction or its inverse SPEC/W, where SPEC is the ‘rate of intruction per second’.
SPEC/W = Instruction/second/W = Instructions/Joule
[2]
While better than power, this metric also has problems. It is proportional to CV2
so one can reduce the energy per instruction by reducing the supply voltage or
decreasing the capacitance-with smaller transistors. Both of these changes increase the delay of the circuits, so we would expect the lowest energy processor
to also have very low performance. Since we usually want minimum power at
25
4. Energy Delay Product
a given performance level, or more performance for the same power, we need
to consider both quantities simultaneously. The simplest way to do so is by taking the product of energy and delay (in Jules/SPEC or its inverse SPEC2 /W).
To improve the energy-delay product of a processor we must either increase its
performance or reduce its energy dissipation without adversely affecting the
other quantity [2].
So Power consumption, delay, throughput and energy consumption are metrics commonly used to compare systems. Considering each of these metrics in
isolation does not permit a fair comparison of systems because of the ability of
CMOS circuits to trade performance for energy. When multiple criteria need to
be optimized simultaneously, it is common to optimize their weighted product.
In the case of energy and time, this product may be represented as the metric
M for a circuit configuration C such that [15]:
M(C) = EDn
Here n is a weight that represents the relative importance of the two criteria.
Since energy and time can be traded off for each other, consider the infinitesimally small quantity of energy ∆E that needs to be expended to reduce the time
for a computation by an infinitesimally small amount ∆D. Using Newton’s binomial expansion and ignoring products and higher powers of ∆E and ∆D we
get:
M(C) = (E + ∆E)(D - ∆D)n = EDn - nE∆D + D∆E
If this new operating point is equivalent to the old operating point under the
metric M:
EDn - nE∆D + D∆E = EDn
Rearranging this equation yields:
26
∆E/E = n∆D/D
Intuitively, this means that a small reduction in time is considered n times more
valuable than a corresponding reduction in energy. For example, if n = 1, a 1%
reduction in time is considered worth paying a 1% increase in energy. If n = 2,
then it is acceptable to pay for a 1% increase in performance with a 2% increase
in energy consumption. In general, when n = 1, energy and delay are equally
important, when n > 1 performance is valued more than energy and when 0
< n < 1 energy savings are considered more important than performance. The
case of n = 0 optimizes just for energy and n = -1 optimizes for power. Other
negative values of n are not useful for optimization since E/Dn changes in opposite directions for improvements in energy and delay.
As described earlier, the metric EDP is commonly used to compare processors
with different underlying technologies. In this thesis, we are not going to compare architectures rather we focus on tuning of OpenMP application run on
SuperMUC with different number of threads. We can characterize an application run with four metrics: performance (measured with total execution time,
also called delay), average power consumption, total energy consumption, and
product of energy times execution time (energy-delay product). We will strive
for a low energy-delay product, since it implies a good balance between high
speed and low energy consumption.
The EDP is defined as the amount of energy consumed during the execution
of a program multiplied by the execution time of the program. This EDP metric, and more generally, EDn , where n is an integer, is commonly used in circuit design. However, the EDn product emphasizes performance over energy,
particularly as n increases. A metric of increasing interest is the amount of
computational work completed per joule. The question here becomes “What
27
4. Energy Delay Product
constitutes work” In some literature, work completed per joule is defined as
operations per joule. However, our set of applications do not use that unit of
work, nor do they share a common unit of work other than instructions, and
those are debatable since different instructions may be chosen by the compilers
for each platform. As such, we treat a complete run of a given application as a
single unit of work and report the total energy consumed per application run.
4.1. Auto-tuning Feedback Metric
The most common feedback metric used by auto-tuners is application execution time, which can also be expressed as runtime delay with respect to some
baseline. For energy auto-tuning, however, we need a feedback metric (objective function) that combines power usage with the execution time of a given
program. There has been a lot of debate about the appropriateness of different
combinations of power and performance in literature in investigating energy
consumption reducing techniques in today’s architectures. All of them hinge
on how much the delay in execution time should be penalized in return for
lower energy. We can use four different feedback metrics: E (total energy), ED
(energy * delay), ED2 (energy * delay * delay) and T (execution time). Total energy (E) is derived by multiplying the average power usage by the application
execution time. E does not penalize execution time delay at all. T penalizes only
execution time delay with no credit for saving energy. Between these extremes,
ED and ED2 metrics put more emphasis on the total application execution time
than the total energy metric. The appropriateness of which metric to use depends on the overall goal of the tuning exercise.
28
4.2. Energy Measurement
4.2. Energy Measurement
For the purpose of this thesis, energy measurements are done using ENOPT
library provided by LRZ.
All results need to be normalized against the base case of running the application with one thread.
29
4. Energy Delay Product
30
5. Periscope Tuning Framework
5.1. Periscope
Periscope is an automatic performance analysis tool for large scale parallel systems. It consists of a frontend and a hierarchy of communication and analysis agents. Each of the analysis agents, i.e., the nodes of the agent hierarchy, searches autonomously for inefficiencies in a subset of the application processes.
Before the analysis can be conducted the application has to be instrumented.
Sourcelevel instrumentation is used to selectively instrument code regions, i.e.,
functions, loops, vector statements, OpenMP blocks, I/O statements, and call
sites. The region types to be instrumented are determined via command line
switches of the Fortran 95 instrumenter [13].
The application processes are linked with a monitoring system that provides
the Monitoring Request Interface (MRI). The agents attach to the monitor via
sockets. The MRI allows the agent to configure the measurements; to start, halt,
and resume the execution; and to retrieve the performance data.
The application and the agent network are started through the frontend process. It analyzes the set of processors available, determines the mapping of application and analysis agent processes, and then starts the application and the
31
5. Periscope Tuning Framework
Figure 5.1.: Periscope Architecture [13]
agent hierarchy. After startup, a command is propagated down to the analysis
agents to start the search.
The search is performed in one or more experiments. Most of the applications
in HPC have an iterative behavior, e.g., a loop where in each iteration the next
time step of the simulated time is performed. If the application has such an
iterative phase, a single execution of the phase is an experiment. If such a phase
is missing or not marked by the programmer, the whole program is executed
for an experiment [13].
The search is performed according to a search strategy selected when the
frontend is started. The strategy defines an initial set of hypotheses, i.e., prop-
32
5.2. Periscope Tuning Framework (PTF)
erties that are to be checked in the first experiment, as well as the refinement
from found properties to a new set of hypotheses. The agents start from the set
of hypotheses, request the necessary information for proving the hypotheses
via MRI, release the application for a single execution of a repetitive program
phase, retrieve the information from the monitor after the processes were suspended again, and evaluate which hypotheses hold. If necessary, the found hypotheses might be refined and the next execution evaluation cycle is performed
[13].
The strategies analyzing the single-node performance are multi-step strategies. They typically go through multiple refinement steps. The strategy used
for analyzing the MPI behavior is a single-step strategy.
At the end of the local search, the detected performance properties are reported back via the agent hierarchy to the frontend. The communication agents
combine similar properties found in their child agents and forward only the
combined properties.
5.2. Periscope Tuning Framework (PTF)
The AutoTune project focuses on extending Periscope to the Periscope Tuning
Framework combining performance and energy efficiency analysis with automatic tuning plugins. The Periscope Tuning Framework (PTF) is an extension
of the automatic online performance analysis tool Periscope. PTF identifies tuning alternatives based on codified expert knowledge and evaluates the alternatives within the same run of the application (online), dramatically reducing the
overall search time for a tuned code version. The application is executed under
the control of the framework either in interactive or batch mode. During the ap-
33
5. Periscope Tuning Framework
plication’s execution, the analysis is done, the found performance and energy
properties are forwarded to tuning plugins that determine code alternatives
and evaluate different tuned versions. At the end of the application run, detailed recommendations are given to the code developer on how to improve
the code with respect to performance and energy consumption [6].
5.2.1. PTF main components
Figure 5.2 [6] outlines the main PTF components. The Eclipse-based graphical
user interface allows to investigate the results of a PTF tuning run. It visualizes
the performance and energy properties as well as the tuning recommendations.
The PSC Frontend controls the entire execution. The main new components
that extend the Periscope Frontend are the tuning plugins, the search algorithms, and the Scenario Execution Engine. The agent hierarchy is composed
of a master agent, several high level agents and the analysis agents. The analysis agents provide a new tuning strategy that configures the MRI Monitor
linked to the application with tuning actions and runtime measurements for
the evaluation of code alternatives.
5.2.2. PTF Repository Structure
The PTF repository covers the code for the components introduced in the previous section, except the graphical user interface. Figure 5.3 [6] outlines its
directory structure. It consists of:
• frontend: Covering the files of the frontend.
• aagent: Covering the files of the analysis agent including the performance
properties for different programming models and target systems as well
34
5.2. Periscope Tuning Framework (PTF)
Figure 5.2.: PTF Main Components [6]
as the analysis strategies.
• hagent: Covering the files for the master and the high level agents. In
fact, the master agent is a high level agent from a code point of view.
• mrimonitor: Covering the files of the PTF monitor that implements the
Monitoring Request Interface (MRI).
• util: Covering files common to multiple components.
• autotune: Covering the files for the tuning plugins and search algorithms.
It has three major subdirectories:
– datamodel: Files implementing base concepts for tuning in PTF, e.g.,
tuning points.
– plugins: Code specific to a tuning plugin goes to a tuning pluginspecific directory.
35
5. Periscope Tuning Framework
– searchalgorithms: Files implementing generic search algorithms for
the tuning plugins.
Figure 5.3.: PTF Repository [6]
5.2.3. PTF Plugins
Periscope has been extended by a number of tuning plugins that fall into two
categories: online and semi-online plugins. An online tuning plugin performs
transformations to the application and/or the execution environment without
requiring a restart of the application; a semi-online tuning plugin is based on a
restart of the application but without restarting the agent hierarchy [27].
Figure 5.4 [27]illustrates the control flow in PTF. The tuning process starts
with a preprocessing of the application source files. This preprocessing performs instrumentation and static analysis. Periscope is based on source-level
instrumentation for C/C++ and Fortran. The instrumenter also generates a
SIR file (Standard Intermediate Representation) that includes static information such as the instrumented code regions and the nesting. When the preprocessing is finished, the tuning can be started via the Periscope frontend either
interactively or in a batch job. As done in Periscope, the application will be
started by the frontend before the agent hierarchy is created.
Periscope uses an analysis strategy, e.g. for MPI, OpenMP and single core
36
5.2. Periscope Tuning Framework (PTF)
analysis, to guide the search for performance properties. This overall control
strategy now becomes part of a higher-level tuning strategy. The tuning strategy controls the sequence of analysis and tuning steps. Typically, the analysis
determines the application properties to guide the selection of a tuning plugin
as well as the tuning actions performed by the plugin. After the plugin finishes, the tuning strategy might restart the same or another analysis strategy to
continue on further tuning [27].
Once the tuning process is finished, PTF generates a tuning report documenting the remaining properties as well as the tuning actions recommended. These
tuning actions can then be integrated into the application such that subsequent
production runs will be more efficient.
Tuning Plugin Design
Given the number of programming models, parallel patterns and hardware
targets to be supported by PTF, it provides a sufficiently generic tuning plugin
design. This section describes some terminologies used in plugin design.
The tuning plugins try to improve the application execution by influencing
certain tuning points.
Tuning points TP = {v1 , v2 , .....} are the features for influencing the execution
of a region. Each tuning point has a name and an enumeration type or an
interval of integer values with stride. For example, a tuning point is the clock
frequency of the CPU which determines the overall energy consumption [27].
All tuning points of a tuning plugin define a multidimensional tuning space.
Tuning space of a tuning plugin P is the cross product of the individual tuning points, i.e., TSP = TP1 × TP2 × ... × TPk [27]
For a program region the tuning plugin will select a set of variants that may
37
5. Periscope Tuning Framework
Figure 5.4.: Tuning Control Flow [27]
lead to a potential improvement and that need to be evaluated by experiments.
The variant space VSr of a program region ‘r’ is a subset of the overall tuning
space, i.e., VSr ⊆ TSP. A variant of a code region ‘r’ is a concrete vector of values
38
5.2. Periscope Tuning Framework (PTF)
for the region’s tuning points vr = (v1 , ... , vk ) [27].
The variant space is explored by a search strategy to optimize certain objectives.
An objective is a function obj: REGappl × TSP → R where REGappl is the set of
all regions in the application. A single or multiple objectives are to be optimized
by the tuning plugin for a given program region over the regions variant space.
The tuning plugin creates a sequence of tuning scenarios that are executed by
Periscope to determine the values of one or more objectives.
A tuning scenario is a tuple scr =( r, vr , {obj1 , ... , objn }) where ‘r’ is the
program region, vr ∈ VSr is a variant of the region’s variant space, and obj1 ...
objn are the objectives to be evaluated [27].
During the execution of a tuning scenario, tuning actions are executed to select
the individual values of the tuning points.
A tuning action TAi is executed for each tuning point TPi with 1 ≤ i ≤ k
during the execution of a tuning scenario. It enforces the value vi for the tuning
point ‘i’ given by the variant vr = (v1 , ... , vk ) [27].
Tuning Plugin Control Flow
The PTF frontend controls the overall tuning process. For auto-tuning, the frontend enforces a predefined sequence of operations that are implemented by the
tuning plugins.
The predefined sequence of operations has to fulfill the requirements of all
tuning plugins developed in AutoTune. Therefore, it is quite complex. In this
section, we present a simplified version in Figure 5.5 [27]. All steps are involved in creating and processing the scenarios that need to be evaluated by
experiments. Scenarios are stored in pools that are accessed and shared by the
plugins as well as the frontend.
39
5. Periscope Tuning Framework
• Created Scenario Pool (CSP): Scenarios that were created by a search algorithm.
• Prepared Scenario Pool (PSP): Scenarios that are already prepared for
execution.
• Experiment Scenario Pool (ESP): Scenarios that are selected for the next
experiment.
• Finished Scenario Pool (FSP): Scenarios that were executed.
Figure 5.5 [27] presents the sequence of steps followed by a tuning plugin.
1. Initialization: First, the plugin is initialized and the tuning points are
created.
2. Scenario Creation: From the defined tuning space, the plugin creates
the scenarios and inserts them into the CSP. Here, the plugin first selects
the variant space to be explored. It then creates the individual scenarios, which combine the region, a variant, and the objectives, either via
a generic search algorithm, e.g., exhaustive search, or by its own search
algorithm.
3. Scenario Preparation: Scenarios are selected from the CSP, prepared and
moved into the PSP. The preparation of scenarios typically covers tuning
actions that cannot be executed at runtime, e.g., recompilation with a certain set of compilation flags or generation of special source code for the
scenarioÕs variant. Only the plugin can decide whether certain scenarios
can be prepared at the same time. For example, two scenarios requesting
different compiler flag combinations for the same file cannot be prepared
40
5.2. Periscope Tuning Framework (PTF)
at the same time. If no preparation is required, the plugin simply copies
all the created scenarios to the PSP.
4. Define Experiment: A subset of the prepared scenarios is then selected
for the next experiment and moved into the ESP. When the plugin selects the scenarios for the next experiment it has to take constraints into
account. For example, different scenarios for the same program region
cannot be executed in the same experiment unless they can be assigned,
for example, to different processes of the MPI application. The assignment of scenarios to processes or threads is decided by the plugin in this
step.
5. Experiment Execution: The Scenario Execution Engine (SEE) is responsible to execute the experiment. It will first check with the plugin, whether
a restart of the application is necessary to implement the tuning actions.
For example, the scenarios generated by the MPI tuning plugin explore
certain parameters of the MPI runtime environment. These can only be
set via environment variables before launching the application. After the
potential restart of the application, the SEE will run the experiment by releasing the application for a phase, i.e., the execution of the phase region.
If multiple phases are required to gather all the measurements for objectives, the SEE will automatically take care of that. It will even restart the
application if it terminates before all the measurements were finished. At
the end of this step, the executed scenarios are moved into the FSP and
the objectives are returned to the plugin.
6. Process Results: The plugin accesses the objectives, which are implemented as standard Periscope properties. Each objective specifies its sce-
41
5. Periscope Tuning Framework
nario. The objectivesÕ value is then used to select the best scenario and
return the tuning recommendation.
Figure 5.5.: Tuning plugin control and data flow [27]
42
Part II.
Design and Implementation
43
6. PCAP Plugin
This chapter gives a detailed description of the implementation of energy tuning plugin developed during the course of this thesis. The plugin is named as
PCAP abbreviated from Power Capping.
6.1. Tuning Plugin Interface (TPI)
First all the major methods of the Tuning Plugin Interface (TPI) are described.
These methods must be implemented by all plugins and their conformance is
checked when the plugin is loaded.
6.1.1. Initialize
After the frontend has initialized itself and is ready to start the tuning process,
it loads the plugin specified by the user. Before the plugin can be utilized, it
needs to be instantiated and initialized. The frontend instantiates the plugin
and then invokes this method to do so.
In this method, the plugin needs to set up its internal data structure for tuning points.
45
6. PCAP Plugin
6.1.2. Start Tuning Step
In this method, the plugin needs to set up its internal data structures for the
tuning space, search algorithms to be used, and the objectives.
6.1.3. Create Scenarios
After the plugin has initialized its data structures and search algorithm, the next
step is to create scenarios. The plugin generates the scenarios using a search algorithm and inserts them into the CSP, so that the Frontend has access to them.
The search algorithm might go through multiple rounds of scenario generation. The selection of new scenarios that are generated in the next step might
depend on the objective values for the scenarios in the previous step. Before
the frontend calls the final method to process the results, it checks if the search
algorithm needs to generate additional scenarios. If so, the frontend triggers an
additional iteration of creation, preparation, and execution of scenarios.
6.1.4. Prepare Scenarios
Some scenarios require preparation before experiments can be executed. If a
set of scenarios needs preparation, it should be done in this method. However,
if no preparation is necessary, the Prepare Scenario method can simply move
the scenarios from the CSP to the PSP.
After the execution of an experiment, the frontend checks if the CSP is empty.
If there are still scenarios, the frontend calls the Prepare Scenarios method
again.
46
6.1. Tuning Plugin Interface (TPI)
6.1.5. Define Experiment
Once generated and prepared, the scenarios need to be assembled into an experiment. An experiment will go through at least one execution of the phase
region of the application. There are two ways to execute multiple scenarios in a
single experiment. Either they can be assigned to a single process because they
affect different regions or they can be assigned to different processes. Only the
plugin can decide whether this is possible or not. Therefore the frontend calls
the Define Experiment method to decide which scenarios are executed in the
next experiment and to assign the executing process to the scenarios. Scenarios
selected by the plugin for the next experiment are moved from the PSP to the
EPS.
After the plugin defined the experiment, the frontend transfers the control
to the scenario execution engine, which forwards the scenarios to the analysis
agents and triggers the experiment. At the end of the experiment the objectives
of the scenarios are returned to the plugin.
The frontend checks after the execution of an experiment if there are additional prepared scenarios in the PSP, and calls the Define Experiment method
again to evaluate a next set of scenarios.
6.1.6. Get Restart Info
This method is called by the scenario execution engine and returns true if a
restart of the application is necessary for the execution of the experiment. For
example, a restart is necessary if the application was recompiled according to
the scenario with a special combination of compiler flags.
It also permits to return parameters to the application launch command, e.g.,
47
6. PCAP Plugin
if a scenario requires certain configuration parameters of the MPI library to be
set during the launch of the application.
6.1.7. Process Results
If the CSP is empty and the search algorithm is finished, the frontend calls
the Process Results method. In this method, the plugin analyzes the acquired
properties and either commands that there are extra steps necessary for tuning
or indicates that the tool is finished and generates the tuning advices for the
user.
6.2. PCAP Plugin
The plugin tunes the energy consumption and the execution time of the OpenMP
based applications on runtime by changing the number of threads used for the
execution of the parallel regions in the application. PCAP plugin has been designed to perform tuning in two steps. In the first step, the plugin does speedup
analysis of the application by executing the application and measuring the scalability of the parallel regions. The result of this first step is used to shrink the
search space used in the second step. In the second step, tuning action is applied to get the best optimum energy delay product for the application. Energy
delay product is used to depict the tradeoff between the energy consumption
and the execution time of the application. Depending on the requirements of
the user, more weightage can be given to energy consumption or the execution
time.
48
6.2. PCAP Plugin
6.2.1. Tuning Objective
The main tuning objective of this plugin is to reduce the energy delay product
of OpenMP based applications. To optimize the EDP of the application, tuning
action is applied on the OMP parallel regions. Execution time property and
energy consumption property of the region is returned as a result of tuning
action. Both the properties are used to calculate the EDP and an optimum value
is suggested to the user.
6.2.2. Tuning points
For the PCAP plugin we define one tuning point:
• Number of Threads It is simply an integer value specifying the number
of threads used for executing the OMP parallel regions. In case of SuperMUC, as described in Chapter 2, each compute node consists of two
sockets and each socket has 8 cores. So to limit the application execution
to a single node a range of 1-16 threads has been used for all measurements.
In this subsection we describe the implementation of the PCAP plugin in terms
of its Tuning Plugin Interface (TPI) methods.
6.2.3. Initialize
As mentioned in section 6.1.1, the plugin must create the following data structures at initialization time: tuning points, search algorithms, and objectives. In
this plugin, a single tuning point is created named as “NUMTHREADS” for
specifying the number of threads used to execute the parallel regions.
49
6. PCAP Plugin
6.2.4. Start Tuning Step
This method is further subdivided in two steps respective to the two-step tuning process of the plugin.
• StartTuningStep1SpeedupAnalysis
• StartTuningStep2EnergyTuning In this method search algorithm is selected
which is the exhaustive search algorithm in this case. Search spaces are
created for each parallel region, variant space is added to each search
space and then the search spaces are added to the search algorithm. Variant space is simply a range of integers for the NUMTHREADS.
6.2.5. Create Scenarios
This method is further subdivided in two steps respective to the two-step tuning process of the plugin.
• CreateScenarios1SpeedupAnalysis
• CreateScenarios2EnergyTuning Scenarios are created by the exhaustive
search algorithm. For the available search spaces, a cross product is created recursively to create the scenarios. Details of the implementation of
creating scenarios are given in Chapter 7.
6.2.6. Prepare Scenarios
This method is further subdivided in two steps respective to the two-step tuning process of the plugin.
• PrepareScenarios1SpeedupAnalysis
50
6.2. PCAP Plugin
• PrepareScenarios2EnergyTuning Since no recompilation is needed, the
created scenarios are moved directly to the PSP.
6.2.7. Define Experiment
This method is further subdivided in two steps respective to the two-step tuning process of the plugin.
• DefineExperiment1SpeedupAnalysis
• DefineExperiment2EnergyTuning In this case, every single scenario defines an execution of the region with a different value for the manipulated variable. The experiments are executed to request the energy and
execution time properties. When the agent network is finished with the
experiment, the objectives are propagated to the frontend. The frontend
then sets them up in a properties pool that is available to both the plugin
and the search algorithm. The frontend also moves all the scenarios from
the ESP to the FSP.
6.2.8. Get Restart Info
This method returns false to indicate that no restart is required.
6.2.9. Process Results
The optimum energy delay product combinations for the phase region i.e EDP,
ED2 P and ED3 P will be selected and provided to the user as a recommendation.
In addition, a detailed summary of all the results for the created scenarios are
also provided in the form a table for evaluation of results by the user. Three
51
6. PCAP Plugin
objective functions for calculating EDP, ED2 P and ED3 P are implemented which
are used to show the results to the user.
6.2.10. Objective Functions
The objective functions are implemented to calculate the EDP, ED2 P and ED3 P
for each scenario. In case of PCAP, the scenarios are run on the phase region
with different number of threads. The first scenario is always run with one
thread. So the objective function are normalized with respect to the first scenario i.e. first scenario with one thread is taken as a base case.
52
7. Exhaustive Serach
The purpose of a search algorithm is to creates combinations of search spaces
and the variant spaces provided for a given application. Each combination
is called as a “Scenario” in PTF. A Scenario is a list of Tuning Specifications.
Each Tuning Specification is a tuple containing a variant value and a variant
context. Whereas a Variant context can be a single region in the application, a
list of regions or the entire application. In case of exhaustive search, all possible
scenarios are created and explored.
The entire implementation of the exhaustive search algorithm is divided in
following methods:
7.1. Add Search Space
This method simply creates a data structure to add search spaces created by the
plugin. Each search space is a tuple containing a region(s) and a variant space.
i.e Search Space (SS) = {Region(s), Variant Space}
53
7. Exhaustive Serach
7.2. Create Scenarios
The purpose of this method is to create all possible combinations of the search
spaces and variants. To create all the possible combinations, it forms the cross
product of all the search spaces and variant space. This can only be done iteratively. We need to iterate recursively at two levels. First level iterates over
search spaces and the second level iterates over variant space for the respective
search space. So it means a nested recursive algorithm has been implemented
to create the cross product. For each iteration of first level, we create one Tuning
Specification after one complete execution of second level. Similarly after the
first compete execution of first level, all the Tuning specifications created are
gathered to form one Scenario. Then the iterative calls return one by one to create as many as (Number of SS)Number of Variants Scenarios. The iterative algorithm
is graphically depicted in Figure 7.1.
Figure 7.1.: Graphical Representation of Recursive Algorithm for Creating Scenarios
54
7.3. Iterate Search Spaces
7.3. Iterate Search Spaces
This method implements the first level of recursion in the recursive algorithm
of Figure 7.1. The iteration is carried over all the search spaces recursively.
7.4. Iterate Tuning Points
This method implements the second level of recursion in the recursive algorithm of Figure 7.1. For each search space, iterations are done over the entire
variant space recursively.
7.5. Generate Scenarios
The list of Tuning Specifications created after each iteration of recursion is passed
to this method. The method clones the list to avoid pointer references to deleted
objects and creates a local copy of the Tuning Specifications. This local copy is
then used to create Scenario. The created Scenarios are added to the Created
Scenario Pool (CSP).
7.6. Search Finished
Once all the scenarios has been executed and the requested properties has been
added to the property pool by the frontend, this method is invoked by the
search engine. This method calls the requested objective function, gets the results for the objective function, finds the optimum value and returns that back
to the plugin for displaying to the user.
55
7. Exhaustive Serach
56
Part III.
Experiments and Results
57
8. Experimental Analysis
In this section, we discuss the evaluation of our energy tuning plugin (PCAP)
on SuperMUC which has been done through different experiments. We begin
with a brief description of the experimental setup. For the first set of experiments, we analyze the scalability of the NAS Parallel benchmarks for the user
region on the SuperMUC pointing out configurations which allow for optimal
performance and power consumption. Then, we evaluate the speedup analysis step of the plugin by first estimating the static and dynamic power per core
on SuperMUC compute node and then using the speedup formula, shrink the
search space to evaluate the energy consumption for multiple regions. Finally,
we discuss the second set of experiments in which NPB benchmarks are evaluated for multiple parallel regions inside the user region for finding the global
optimum for the entire user region, in terms of both performance and energy
benefits.
8.1. NAS Parallel Benchmarks (NPB)
OpenMP version 3.3 of NPB has been used for the evaluation of PCAP. NPB’s
were derived from CFD codes. They were designed to compare the performance of parallel computers and are widely recognized as a standard indicator
59
8. Experimental Analysis
of computer performance. NPB consists of five kernels and three simulated
CFD applications derived from important classes of aerophysics applications.
These five kernels mimic the computational core of five numerical methods
used by CFD applications. The benchmarks are specified only algorithmically
(”pencil and paper” specifications) and referred to as NPB-1. Details of the
NPB-1 suite can be found in [11]. but for completeness of discussion we outline
the five benchmarks that has been used in this thesis to evaluate PCAP.
• BT is a simulated CFD application that uses an implicit algorithm to solve
3-dimensional (3-D) compressible Navire- Stokes equation. The finite difference solution to the problem is based on an Alternating Direction Implicit (ADI) approximate factorization that decouples the x, y and z dimensions. The resulting systems Block-Tridiagonal of 5×5 blocks and
are solved sequentially along each dimension [21].
• SP is a simulated CFD application that has a similar structure to BT. The
finite differences solution to the problem is based on a Beam-Warming
approximate factorization that decouples the x, y and z dimensions. The
resulting system has Scalar Pentadiagonal bands of linear equations that
are solved sequentially along each dimension [21].
• LU is a simulated CFD application that uses symmetric successive overrelaxation (SSOR) method to solve seven-block-diagonal system resulting
from finite-difference discretization of the Navier-Stokes equations in 3-D
by splitting it into block Lowed and Upper triangular systems [21].
• CG uses a Cconjugate Gradiant method to compute an approximation to
the smallest eigenvalue of a large, sparse, unstructured matrix. This ker-
60
8.2. Results on SuperMUC Compute Node (16 processing cores)
nel tests unstructured grid computations and communications by using a
matrix with randomly generated locations of entries [21].
• EP is an Eembarrassingly Parallel benchmark. It generates pairs of Gaussian random deviates according to a specific scheme. The goal is to establish a reference point for peak performance of a given platform [21].
8.2. Results on SuperMUC Compute Node (16
processing cores)
We begin by profiling scalability of all applications in the benchmark suite and
determine its effect on the performance and energy consumption. In order to
do that, the applications are executed with a variable concurrency ranging from
one to sixteen threads and bound to processor/core combinations as shown in
Figure 8.1. The notation (X, Y ) denotes non-adaptive (static) execution with
X × Y threads bound to X processors and Y cores per processor. The cores
marked in green are being used by the threads. The bindings shown, are not an
exhaustive collection of bindings possible on this machine. The execution time
and energy are the two properties requested for the user region by the plugin.
These properties are then used to calculate the objective functions. Execution
time is simply the wall clock time required for the execution of the user region.
Energy is measured using the Enopt library provided by LRZ on the SuperMUC. We also evaluate the objective functions for all the benchmarks showing
that all the three objective functions (ED, ED2 and ED3 ) are a very good depiction of the optimal configuration for the given benchmark.
61
8. Experimental Analysis
Figure 8.1.: Thread to Processor/Core Bindings.
8.3. First Experiment Set
In the first set of experiments, NPB benchmarks are run on SuperMUC in Periscope
applying PCAP plugin for tuning. The benchmarks are run for a range of 116 threads on a compute node of SuperMUC. The execution time and energy
properties for the user region are measured. Using these properties, objective
functions are calculated and the results are returned suggesting the best optimum. All the benchmarks are run for problem sizes W, S, A, B and C. These
experiments are used to analyze the scalability pattern of the benchmarks. Performance analysis shows that we have three categories of benchmarks. First
are those applications that manage reasonable speedup through the utilization
of additional cores (BT, EP, LU, and CG). BT, EP, LU and CG show that the
machine allows linear speedup with appropriately written code with speedups
of 2.1x, 2.5x, 2.0x and 2.12x respectively. They also reduce their energy consumption proportionally indicating an optimal use of the cores. Second are
62
8.3. First Experiment Set
applications that neither substantially gain nor lose performance from higher
concurrency (LU-HP and SP). Lastly, we have the applications that incur a nonnegligible performance loss when using more cores. MG fall under this category losing the performance by as much as 1.16x times over single threaded
execution. This shows that this benchmarks is essentially memory bound and
lose performance due to a high degree of contention for this resource with increased concurrency. The most energy-efficient configuration coincides with
the most performance-efficient configuration for 4 out of the 7 benchmarks (BT,
EP, LU and CG). For 2 benchmarks (LU-HP and SP), the user can use fewer
than the performance-optimal number of cores, to achieve substantial energy
savings, at a marginal performance loss.
Following Figures 8.2, 8.3, 8.4, 8.5, 8.6 and 8.7 are showing the results of first
set of experiments for BT, EP, LU, CG, LU-HP and SP respectively. Each of these
figures has two graphs. In each figure, the graph on right side is showing the
execution time (bars) and energy consumption (lines) of the respective benchmark and the graph on left side shows the three objective functions graphs for
the corresponding benchmark. In right side graph, the configurations with the
best performance and energy for each benchmark are marked with a blue gradient and a large diamond respectively. In left side graph, the best optimums
for the three objective functions are marked with a large diamond. In each of
the above mentioned figures, the subfigures (a), (b), (c), (d) and (e) are showing
the results for the problem sizes A, B, C, W and S respectively for the respective
benchmark.
The results in Figures 8.2 - 8.7 show that the scalability of the application is
a good depiction of energy usage. When the scalability of the application is
good i.e. in case of BT, EP, LU and CG, the optimum for the execution time and
63
8. Experimental Analysis
(a) BT.A
(b) BT.B
(c) BT.C
Figure 8.2.: BT results for the entire user region respective to problem sizes A,
B, C, W and S (contd.)
64
8.3. First Experiment Set
(d) BT.W
(e) BT.S
Figure 8.2.: BT results for the entire user region respective to problem sizes A,
B, C, W and S
65
8. Experimental Analysis
(a) EP.A
(b) EP.B
(c) EP.C
Figure 8.3.: EP results for the entire user region respective to problem sizes A,
B, C, W and S (contd.)
66
8.3. First Experiment Set
(d) EP.W
(e) EP.S
Figure 8.3.: EP results for the entire user region respective to problem sizes A,
B, C, W and S
67
8. Experimental Analysis
(a) LU.A
(b) LU.B
(c) LU.C
Figure 8.4.: LU results for the entire user region respective to problem sizes A,
B, C, W and S (contd.)
68
8.3. First Experiment Set
(d) LU.W
(e) LU.S
Figure 8.4.: LU results for the entire user region respective to problem sizes A,
B, C, W and S
69
8. Experimental Analysis
(a) CG.A
(b) CG.B
(c) CG.C
Figure 8.5.: CG results for the entire user region respective to problem sizes A,
B, C, W and S (contd.)
70
8.3. First Experiment Set
(d) CG.W
(e) CG.S
Figure 8.5.: CG results for the entire user region respective to problem sizes A,
B, C, W and S
71
8. Experimental Analysis
(a) LUHP.A
(b) LUHP.B
(c) LUHP.C
Figure 8.6.: LU results for the entire user region respective to problem sizes A,
B, C, W and S (contd.)
72
8.3. First Experiment Set
(d) LUHP.W
(e) LUHP.S
Figure 8.6.: LU-HP results for the entire user region respective to problem sizes
A, B, C, W and S
73
8. Experimental Analysis
(a) SP.A
(b) SP.B
(c) SP.C
Figure 8.7.: LU results for the entire user region respective to problem sizes A,
B, C, W and S (contd.)
74
8.3. First Experiment Set
(d) SP.W
(e) SP.S
Figure 8.7.: SP results for the entire user region respective to problem sizes A,
B, C, W and S
75
8. Experimental Analysis
the energy usage coincides. This also corresponds to the optimum for the three
objective functions. For LU and CG, it can be easily seen that the scalability is
linear in case of larger problem sizes (A, B, C and W) but it is really poor for the
problem size of class S. So the results of the problem size S in case of LU and
CG depict that as we keep on adding more cores, the energy and the execution
time increases because of the poor scalability.
On the other hand, when the application is not having good scalability i.e. in
case of LU-HP and SP, the optimum for the energy and the execution time does
not coincide with each other. The optimum for the objective functions show
that we can run these application with less number of threads and get a good
energy saving by compromising the execution time marginally. When the user
is more interested in performance, then the ED2 or the ED3 can be used to get
the optimum instead of ED as the objective functions because in case of ED2 or
the ED3 more weightage is given to the execution time in both these objective
functions. This has been shown in the results of both the benchmarks LU-HP
and SP in Figure 8.6 and Figure 8.7. The markers in the graphs clearly depict
that the energy delay product based objective functions are the good measure
for the energy tuning framework as they have pointed the right scenario no
matter the application has good scalability properties or not.
This first set of experiments has been done with all the iterations in one scenario. The application restarts after each scenario and performs all the iterations in each scenario. So as the application has to be restarted every time, so
running the tests takes a lot of time. To avoid this, we have done the next set of
experiments.
76
8.4. Second Experiment Set
8.4. Second Experiment Set
The base for this set of experiments is the hypothesis that whether the initialization of the data structures with less number of threads and running the scenarios with more threads has an impact on the execution time because of the
memory distribution. To evaluate this two test cases has been tested. In the
first test case, experiments have been performed on the BT benchmark by running one scenario per iteration of the main computation loop with and without
initializing for every scenario. Figure 8.8 graphically depicts the scenario for
the first test case. The experiments has been done on BT with problem size of
CLASS = B.
Figure 8.8.: Test Case 1: (a) Initialization outside user region. (b) Initialization
inside user region
The second test case to be tested is (a) if initializing with say 16 threads and
running with four threads gives the different result as (b) initializing with four
threads and running with four threads. Both the test case for the benchmark BT
have proved that there is no difference between the case (a) and (b) for both the
test cases. So from the results of these experiments, it has been confirmed that
we can run the further experiments without having restarting the application
77
8. Experimental Analysis
for every scenario. This makes the tests run fast which is critical when the
number of scenarios is large because the search algorithm used in the PCAP is
the exhaustive search algorithm.
8.5. Third Experiment Set
The third set of experiments is conducted to find out if there can be a global
optimum for multiple parallel regions. The parallel regions inside the user region are taken and for all the regions, cross product is created as described
in Chapter 7 under exhaustive search. The experiments have been performed
for benchmark CG with problem size of CLASS=A. To select only the parallel
regions in user region, the regions which are not of interest i.e. ones in initialization etc. have been excluded manually by editing the .sir file produced by
Periscope. There are three parallel regions inside user region in benchmark CG.
The experiments have been done with a range of 1-16 threads. The cross product produces (Number of Threads)Number of Parallel Regions many scenarios. In case
of CG 163 = 4096 scenarios has been created. The results for this experiment are
shown in Figure 8.9.
The results for the execution time show that the best optimum scenario is
the scenario number 1685 ((R1, 7), (R2, 10), (R3, 6)). In this particular case, one
of the region takes twice the time as other two but all the three regions show
good scalability properties until 10 threads an then the scalability goes worse.
This is also shown in the graph of Figure 8.9. The result is showing a periodic
behavior. It shows decrease in EDP until 10 threads in the cross product and
then the EDP starts increasing. So it can be seen that by using PCAP we can
find the optimum number of threads for each parallel region in the application.
78
8.6. Fourth Experiment Set
As the result is repeating itself after every 256 scenarios, this means that R2 is
the most effective region in this case and is playing the vital role in deciding for
the best configuration for optimum energy and performance configuration.
Figure 8.9.: CG Results for Multiple Regions Experiment
8.6. Fourth Experiment Set
All the above mentioned three sets of experiments have been performed with
the tuning step of the PCAP plugin without first applying the speedup analysis
step to shrink the search space. The speedup analysis step requires first the
validation of the energy consumption formula which is shown in the Equation
8.1 below:
E = Pstatic ∗ T + Pdynamic ∗ N umberof threads ∗ T
(8.1)
Where
E = Energy Consumed by the application
79
8. Experimental Analysis
Pstatic = Static power for the compute node of SuperMUC
Pdynamic = Dynamic power per core for a compute intensive application
T = Execution time of the application
The static and dynamic energy per core of the SuperMUC compute node has
been estimated by running a compute intensive application on compute node.
The application is run with 8 threads which has been pinned to the 8 cores of
the Package 0. The application always get assigned the whole compute node
with 16 cores. We run the application on 8 cores. The other 8 cores are not
being used by the application. The energy for the Package 0 and Package 1
of the compute node is measured. The energy of Package 1 depicts the static
energy of the socket. The energy of Package 1 is subtracted from the energy of
the Package 0 which gives the dynamic energy for the entire socket. Dividing
this by 8 gives the dynamic energy per core. The measurements are done for six
different frequencies (1.2GHz, 1.5GHz, 1.8GHZ, 2.1GHz, 2.4GHz and 2.7GHz).
Power is calculated using the energy measured and the execution time of
the application and then power has been used to validate the formula. For
validation, BT benchmark has been used. The BT benchmark has been run
by setting the Energy Policy Tag = none, which ensures that the application
is run at 2.3GHz frequency. After checking the results for energy consumption
formula, the values doesn’t sum up accordingly. The calculated value is always
higher than the measured value but is in range. This section needs more work
to figure out the correct static and dynamic power per core for the SuperMUC
compute node.
The idea is to estimate the speedup of an application first by using the energy
consumption estimate of the application. Then only using the range of variant
values which gives good speedup. Passing this speedup analysis to the tuning
80
8.6. Fourth Experiment Set
step to actually do the tuning. Each parallel region will use its own specific
range of variant values, making the search space small. As the search algorithm
being used is exhaustive search, it is really crucial to shrink the search space to
run the tuning efficiently.
81
8. Experimental Analysis
82
9. Conclusion
In this thesis, a tuning plugin for performance and energy efficiency optimizations has been developed. The plugin is developed inside an existing framework named as Periscope Tuning Framework. Periscope is the performance
analysis tool which has been developed at the chair of computer architecture
and big performance computing in TUM. Periscope has been further extended
to PTF to use the performance analysis results generated by Periscope for tuning parallel applications. For this purpose multiple tuning plugins have been
developed inside PTF. Each of the plugin targets one category of parallel applications.
In this work, we have developed a plugin named PCAP which focuses mainly
on the energy and performance optimization of OpenMP based applications.
The OpenMP version of NAS Parallel benchmark has been used to test PCAP.
The plugin specifies one tuning parameter named “Number of Thread” for
the application. This parameter specifies that with how many threads the main
kernel (User Region) of the application will be executed. The plugin creates the
search space of the possible scenarios which have to evaluated. The exhaustive search algorithm is used to create the scenarios. Two properties namely
‘Energy’ and ‘Execution Time’ are requested by the plugin for the user region.
These properties are then used to calculate the objective functions. Three objec-
83
9. Conclusion
tive functions are used in the plugin to evaluate the tuning configuration. Energy Delay Product is used as the objective function which tells the user about
the best optimum configuration for the application. If the user desires more
performance centric configuration, then the other two objective functions; ED2
and ED3 give the best optimum configuration.
The results of the tests show that the when the application under considerate
is having good scalability properties, the optimum for the energy consumed
and the execution time coincide and the EDP gives the exactly same optimum
configuration as the best one. If the application shows poor scalability, then the
optimums for energy consumption and execution time do not coincide with
each other. In this case ED gives the best configuration with optimum energy
usage but a marginal loss in performance.
It is also proved that the memory distribution does not have a significant
effect when the number of threads used while initializing the memory is different than the number of threads used to run the kernel of the application.
This shows that while running experiments, restarting the application is not
necessary for each of the scenarios.
PCAP has also been tested for tuning the individual regions inside the kernel other than tuning the entire kernel. THe purpose of tuning the individual
regions is to find the optimum configuration per parallel region. The cross
product of the parallel regions and the variants is created by exhaustive search
algorithm to create the scenarios and then the best optimum configuration is
returned.
In case of individual parallel region tuning, when there are large number of
parallel regions in the application, the search space expands very fast. As the
plugin uses exhaustive search, so it is crucial to shrink the search space. For this
84
purpose, before applying the tuning process, each region has to be analyzed for
speedup and the variant space for each region should be shrinker according to
the speedup properties of the region. Some theoretical work has been done
in this regard. For this purpose the static and dynamic energy estimation for
the SuperMUC compute node has been done. This step is not complete some
implementation work is still left and is referred as the future work to be done.
85
Bibliography
[1] D. Albonesi. Dynamic Ipc/Clock Rate Optimization. International Symposium on Computer Architecture, pages 282 – 292, July 1998.
[2] D. Baeck, A. Loeoef, and M. Roennbaeck. Evaluation of Ttechniques for
Reducing the Energy-Delay Pproduct in a JAVA Processor. 1999.
[3] D. Brooks and M. Martonosi. Adaptive Thermal Management for HighPerformance Microprocessors. Workshop on Complexity Effective Design,
June 2000.
[4] E. V. Carrera, E. Pinheiro, and R. Bianchini. Conserving Disk Energy in
Network Servers. In Proceedings of the 17th International Conference on Supercomputing, June 2003.
[5] C.B.Navarrete, A.Auweter, C.Guillen, W.Hesse, and M.Brehm. Energy
Consumption Comparison for Running Applications on SandyBridge Supercomputers. 2013.
[6] I. A. Compres.
PTF Demonstrator.
http://www.autotune-
project.eu/sites/default/files/Materials/Deliverables/12/D2.2 PTF
Demonstrator final.pdf.
87
Bibliography
[7] M. Curtis-Maury, J. Dzierwa, C. Antonopoulos, and D. Nikolopoulos.
Online Power-Performance Adaptation of Multithreaded Programs using
Hardware Event-Based Prediction. In Proceedings of the International Conference on Supercomputing, June 2006.
[8] Bruno Diniz, D. O. G. Neto, W. Meira Jr., and R. Bianchini. Limiting the
Power Consumption of Main Memory. In Proceedings of the International
Symposium on Computer Architectures, June 2007.
[9] H. Sanchez et al. Thermal Management System for High Performance
PowerPC Microprocessor. IEEE Computer Society International Conference,
pages 325 – 330, February 1997.
[10] N. Vijaykrishnan et al. Energy-Driven Integrated Hardware-Software Optimizations Using SimplePower. International Symposium on Computer Architecture, pages 96–106, June 2000.
[11] High
mance
Performance
Fortran
Fortran
Language
Forum.
Specification.
High
January
Perfor1997.
http://www.crpc.rice.edu/CPRC/softlib/TRs online.html.
[12] R. Ge, X. Feng, and K. W. Cameron. Performance Constrained Distributed
DVS Scheduling for Scientific Applications on Power-aware Clusters. In
Proceedings of Supercomputing, November 2005.
[13] M. Gerndt and M. Ott. Automatic Performance Analysis with Periscope.
In Concurrency and Computation:
Practice and Experience, April 2010.
http://www.lrr.in.tum.de/ ottmi/publications/ccpe2008.pdf.
[14] K. Ghose and M. Kamble.
88
Reducing Power in Superscalar Processor
Bibliography
Caches Using Subbanking, Multiple Line Buffers and Bit-Line Segmentation. International Symposium on Low Power Electronics and Design, pages
70–75, August 1999.
[15] R. Gonzalez and M. Horowitz. Energy Dissipation in General Purpose
Microprocessors. IEEE Journal on Solid-State Circuits, pages 1277 – 1284,
September 1996.
[16] T. Halfhill. Transmeta Breaks x86 Low-Power Barrier. Microprocessor Report, pages 9 – 18, February 2000.
[17] C.-H. Hsu and W. Feng. A Power-Aware Run-Time System for HighPerformance Computing. In Proceedings of SupercomputingÕ05,, November
2005.
[18] Intel. Pentium III Processor Mobile Module: Mobile Module Connector 2
(MMC-2) Featuring Intel SpeedStep Technology. 2000.
[19] C. Isci, A. Buyuktosunoglu, C.-Y. Cher, P. Bose, and M. Martonosi. An
Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget. In Proceedings of the International Symposium on Microarchitecture, December 2006.
[20] K. Itoh. Low Power Memory Design. Low Power Design Methodologies,
pages 201–251, September 1996.
[21] H. Jin, M. Frumkin, and J. Yan. The OpenMP Implementation of NAS
Parallel Benchmarks and Its Performance. October 1999. NAS Technical
Report NAS-99-011.
89
Bibliography
[22] T. Juan, T. Lang, and J. Navarro. Reducing TLB Power Requirements. International Symposium on Low Power Electronics and Design, pages 196–201,
August 1997.
[23] J. Kin, M. Gupta, and W. Mangione-Smith. The Filter Cache: An Energy
Efficient Memory Structure. International Symposium on Microarchitecture,
pages 187–193, December 1997.
[24] C. Liu, A. Sivasubramaniam, M. T. Kandemir, and M. J. Irwin. Exploiting
Barriers to Optimize Power Consumption of CMPs. In Proceedings of the
19th International Parallel and Distributed Processing Symposium, April 2005.
[25] LRZ. SuperMUC Petascale System. https://www.lrz.de/services/compute
/supermuc/systemdescription/.
[26] S. Manne, A. Klauser, and D. Grunwald. Pipeline Gating: Speculation
Control for Energy Reduction. International Symposium on Computer Architecture, pages 132 – 141, July 1998.
[27] L. Morin.
Design of the tuning plugins.
http://www.autotune-
project.eu/sites/default/files/Materials/Deliverables/12/D4.1 Tuning
Plugins final.pdf.
[28] S. Park, W. Jiang, Y. Zhou, and S. V. Adve. Managing Energy-Performance
Tradeoffs for Multithreaded Applications on Multiprocessor Architectures. In Proceedings of the 2007 ACM SIGMETRICS, June 2007.
[29] T. Pering, T. Burd, and R. Brodersen. The Simulation and Evaluation of Dynamic Voltage Scaling Algorithms. International Symposium on Low Power
Electronics and Design, pages 76 – 81, August 1998.
90
Bibliography
[30] E. Rohou and M. Smith. Dynamically Managing Processor Temperature
and Power. 2nd Workshop on Feedback-Directed Optimization, November
1999.
[31] R. Springer, D. K. Lowenthal, B. Rountree, and V. W. Freeh. Minimizing Execution Time in MPI Programs on an Energy-Constrained, PowerScalable Cluster. In Proceedings of the 11th ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming, March 2006.
[32] C-L. Su and A. Despain. Cache Design Trade-offs for Power and Performance Optimization: A Case Study. International Symposium on Low Power
Electronics and Design, pages 63–68, April 1995.
[33] Ananta Tiwari, Michael A. Laurenzano, Laura Carrington, and Allan
Snavely. Auto-tuning for Energy Usage in Scientific Applications. In Proceedings of the Euro-Par 2011 Workshops Part II, pages 178 – 187, 2012.
[34] A. Varma, B. Ganesh, M. Sen, S. R. Choudhury, L. Srinivasan, and B. L.
Jacob. A Control-Theoretic Approach to Dynamic Voltage Scheduling.
In Proceedings of the International Conference on Compilers, Architectures and
Synthesis for Embedded Systems, October 2003.
[35] Q. Wu, P. Juang, M. Martonosi, and D. W. Clark. Formal Online Methods
for Voltage/Frequency Control in Multiple Clock Domain Microprocessors. In Proceedings of the International Conference on Architectural Support
for Programming Languages and Operating Systems, 2000.
[36] Q. Wu, M. Martonosi, D. Clark, V. Reddi, D. Connors, Y. Wu, J. Lee, and
D. Brooks. Dynamic Compiler-Driven Control for Microprocessor Energy
and Performance. IEEE Micro, 26(3), 2006.
91