A Novel Technique to Improve Parallel Program Performance Co
Transcription
A Novel Technique to Improve Parallel Program Performance Co
A Novel Technique to Improve Parallel Program Performance Co-executing with Dynamic Workloads Murali Krishna Emani Michale O’Boyle School of Informatics University of Edinburgh, UK [email protected] School of Informatics University of Edinburgh, UK [email protected] Abstract—In the current multi and many core computing systems, multiple parallel programs co-execute by sharing the resources in a highly dynamic environment. This dynamicity includes variations in hardware, software, input data and external workloads. This leads to a contention in the system and seriously degrading the performance of few or all co-executing programs. Many of existing solutions assume program execution in isolation and devise techniques with this wrong assumption. Here we propose a machine leaning based technique to improve program performance when it executes along with external workloads which are dynamic in nature. We use program static features, dynamic runtime features obtained during compilation and execution phases respectively. We show that our approach improves speedup over 1.5x over best existing scheme on a 12 core machine. Keywords Parallelism mapping, Workloads, Compile and runtime optimization, Machine Learning I. I NTRODUCTION Multicore-based parallel systems now dominate the computing landscape from data centers to mobile devices. Efficient mapping techniques for programs onto underlying multi and many-cores is highly essential in improving efficieny of programs performance in presence of dynamic environment. Specifically, designing such soultions for parallel programs is quite challenging in these scenarios owing to the complex underlying implementation of parallel programming models. General research solutions to broad problem of parallelism mapping tend to ignore the basic reality of shared and interactive execution environment. In any realistic scenario, the computing environment is dynamic in nature. These parameters include program input data, hardware, software, load caused by external programs and others. External load causing programs tend to share computing resources with wide variety of emerging workloads that span from light to heavy leading to significant resource contention. [1], [2] Hardware is becoming increasingly heterogeneous with processors of asymmetric computing capabilities. Any failure of hardware changes the amount of available computing resources. If this occurs during a program execution, it needs to adapt instantaneously to the available resources. Input datadriven applications are emerging in day-to-day computing where the input data size varies during program execution. This has a profound effect on the memory, I/O systems. The latest trend BigData adds more complexity when the parallel programs need to process huge amounts of data. [3] Software upgrades are quite frequent where an upgraded versions may provide different set of computing programming environment with different set of features to improve application performance. Thus the existing thread-to-core parallel mapping solutions may not be appropriate in these scenarios. There is a critical requirement that the mapping solutions need to be revised considering the dynamic environment into consideration. Given this highly dynamic environment, the applications need to adapt to varying parameters and autotune in order to execute efficiently with minimal intervention from the application programmer. A widespread assumption in research community of parallel computing is that the program under consideration is the only execution unit in the system with the resources being the same throughout its execution. This may be true in certain applications but in reality for majority of applications, this assumption no longer holds true. a) Hardware Adaptability: Modern NUMA machines are made up of multi and many heterogeneous cores. They vary in operating frequency, multi vs gpu Applications executing on these systems need to leverage maximum potential of these processors. As the hardware is prone to different types of failures which are not unusual in any computing environment, special mechanims are employed to ensure that there is minimal disruption to the running applications. Planned outages are widespread employed method to ensure that computing units are either switched off or migrated elsewhere during the hardware repair or maintenance. The major problem occurs when there is a sudden hardware failure giving minimal time for providing alternate computing resources for executing applications. One of prominent hardware failures is drop or malfunctioning of processors as shown in figure 1. This reduces the number of available computing resources which can show adverse effects on the applications. For latency-sensitive applications, there can be drastic dregrade in applications performance due to the delay caused by the shortage of computing units. Several techinques exist today to ensure smooth running of applications when a hardware failure occurs. However these techniques do not reduce the load in proportion to available computing resources. Cloud computing fits in this scenario where the applications executed in a cloud are resilient to hardware failures owing to elastic nature of the cloud. However cloud computing deals with this problem at a macro level and is still far from reach for many computing applications which can be migrated directly to a cloud. Fig. 1. Thread mapping strategies for two programs P1, P2 (a) default with fully-functional processors (b) default with faulty processors (c) ideal with faulty processors b) Co-execution and contention: to modify The complexity of managing smooth execution of an application during hardware failure is further increased by the contention caused by external programs co-executing with the current application. Minimizing the contention caused by the competition for shared resources by co-executing programs is a widely studied area. One widely used assumption in solutions proposed for paralle programs mapping is that a computing machine is fully for an application and all the resources are static throughout the program lifetime. This assumption is necessarily not true in majority of computing platforms. In most pure-static compiler approaches, program structure and machine characteristics are analysed to determine the best mapping of a program. These approaches do not have knowledge of program behaviour during execution at runtime and they typically make simplifying assumptions about resource availablity and external workloads. On other hand, pure runtime systems approches are generic in adapting to environment change. However they do not have sufficient program knowledge which is a great source of performance improvement potential. Solutions employing Machine learning based approaches [4],[2],[5],[6],[7] are proving to be highly reliable and promising as they are significantly showing promising results in improving program performance during parallelism mapping. These approaches are generally trained offline using a training set of data. Features are collected during these training runs and the model is learnt using different methods. During deployment, these features are extracted from the system and input to the learnt model that predicts the optimal mapping policy. Existing techinques rely either on program featuers or runtime features only or may degrade external workloads’ performance trying to optimize current target program and reduce contention [8]. In this work we aim to imporve a parallel programs’ performance undere resource contention when it is executing with varying external programs. We propose an approach where this machine learning model uses both static and dynamic features to deliver better execution efficiency in unseen dynamic environments. Fig. 2. Graph showing the performance degradation of a program coexecuting with different external workloads • We have no impact on external workloads using our technique and don’t degrade their performance. Throughout this paper, we mean Target to be the program we are trying to optimize, Workload to be any other program co-executing with the target program that generates load in the system. We use Core or a Processor interchangebly to denote a processing unit. II. M OTIVATION To depict that a program’s performance is degraded significantly when it is co-executing with another program, we ran a target program cg from NAS parallel benchmark along with other program chosen from same benchmark. We measure the speedup over OpenMP default with different number of target program threads. To see the variation in nature of workloads, we repeated the experiments with increasing number of workload threads. Figure 2 shows the resultant speedup of target program. We observe that the default behaviour of the target is severly affected in presence of external workload and the amount of degradation increases with a increase in number of workload therads.This proves that if a programs is run as it is with openmp default policy it always gets executed with same number of threads which is equal to the number of available maximum processors and greatly slows down. This is due to the increased contention arising out of resource contention by multiple programs executing at same time. • We propose a novel technique using a machine learning model to enable a parallel program to adapt to changing workloads. In figure 3 we show a microscopic view of thread configurations assigned by different policies when a target program is co-executed with a workload [1]. We observe that openmp default assigns same number of threads irrespective of any external programs. A state-of-art technique uses hill climbing optimization policy where it assigns thread numbers in unit steps. The best possible scheme oracle thread configuration is also shown. All existing techniques vary greatly in threads assigned to the parallel loops of the program. • We show effectiveness of our approach by achieving better speedup improvement over OpenMP default, and 1.5x over best existing scheme [9] . Speedup obtained by various approaches in this scenario is observed in figure 4. OpenMP default scheme performs barely same as sequential, a best static and hill climbing methods Contributions Our contributions include # Threads of Target Program 12 10 8 6 4 2 Default Hill Climbing 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Time Workloads W 0 W 1 2 4 6 8 2 Type Feature Static Static Static Dynamic Runtime Runtime Runtime Runtime Total number of Load/Store instructions Total number of Branches Total number of Instructions Number of avaiable Processors # Workload threads Run queue length cpu load ldavg-1 cpu load ldavg-5 TABLE I. L IST OF FEATURES 10 12 14 16 18 20 22 24 26 28 30 Time Fig. 3. A microscopic view of thread numbers assignment by different approaches Fig. 4. Comparison of speedup of various techniques. We observe that there is a great scope for performance improvement. Our solution aims to fill this gap improve over default. However we observe the best possible oracle scheme can further improve the program performance and current existing techniques are no where near the oracle. Our approach aims to fill this gap to achive best possible speedup. III. A PPROACH Our goal is to develop a model that determines optimal number of threads for each parallel section of the target program based on program static features and system runtime features representing external workload. Instead of building a model that is bound to a particular setting, we use supervised learning to automatically generate this heuristic. This prortable approach ensures we can use our technique on any platform. Figure 5 describes how our approach works. For every new target program during its compilation phase, the compiler extracts significant characteristic information about the program in the form of code features which are static. It then links this compiled program with a light-weight runtime library that consists of an heuristic that learns automatically. For each parallel section, the compiler inserts a call to the runtime where the static program features of that parallel section are passed as a parameter. During execution time, the runtime combines these program features with dynamic external workload information as inputs to our predictive model that returns optimal number of threads for this parallel section. The program then executes the parallel loop with newly determined optimal thread number. We build our machine learning model based on a generic three-step process for supervised learning. These include (1) generate training data (2) train a model (3)use the heuristic. We generate training data by exhaustively running each training program together with a workload program. During the training, we vary the number of threads used for the target and the workload programs and record their execute time. We collect a set of features during generation of training data that is used to characterize target program and external workload. This training data is used to build the model offline. Once this model is deployed, no further learning takes place. The fatures set is a collection of several feature vectors. Each such vector consists of numerical values of chosen features of program and dynamic workload. Feature vector The set of features used are arranged asa numerical vector. Static features of a program include number of total instructions, memory and branch summary information where the corresponding values are normalized to the total number of instructions. Workload is characterized by the load it generates on the cpu. We obtain this information from proc file system of the linux kernel to collect dynamic workload features. Linux kernel provides a very useful tool sar that collects cumulative activity counters of the operating system and can be used to obtain system characteristics at every time unit. To characterize the runtime environment, we use three features from /proc filesystem: run queue, ldavg-1, ldavg-5. The run queue length represents the number of processes waiting for scheduling in the Linux kernel which gives an indication of how many tasks are running on the system. The ldavg-n is system load average calculated as the average number of runnable or running tasks and the number of tasks in uninterruptible sleep over an interval of n (n = 1, 5) minutes. These runtime features reflect the load created by the external workloads. Number of workload threads and number of cores form rest of the feature vector. These 8 features as mentioned in table I constitue the feature vector that is fed as input of our machine learning model. A. Training Data We train our heuristic using training data collected from a synthetic workload setting and apply the trained heuristic to various unseen dynamic runtime environments. This is different and unique from previous approaches [10] where the model is trained for each target program, Training data are generated from experiments where each target program is executed with one workload program varying its number of threads. We vary the number of threads used by the workload program. To know the best possible scheme for each such experiment, we assigned exhaustively different thread number to each parallel loop. Then we record the best performing scheme and observe its thread setting, We extract runtime features during the training run. Those runtime and static program featuers and the best-performing thread number are Fig. 5. An overview of our approach. During compilation phase, program features are extracted. This is then now combined with runtime features if workload exists. These features are fed to the predictive model that determines optimal thread number. Else it returns openmp default number of threads. put together to form the training data set. Although producing training data takes time, it is only an one off cost incurred by our heuristic. Generating and collecting data is a completely automatic process and is performed off-line. The model is trained only once offline and frozen and no further learning takes place during program execution. Figure 6 depicts the training phase of our machine learning model. Workload Number of programs Number of threads Minimal Normal Heavy <2 [2-5] >5 <6 [6-12] >12 TABLE II. W ORKLOAD S ETTINGS were compiled using gcc 4.6 with parameters “-O3 -fopenmp”. B. Machine learning model Our machine learning model is based on an artificial neural network [11]. We employ the standard Multilayer Perceptron with 1 hidden layer that learns by back propagation algorithm. The network learns by back propagation which is a generalized form of linear mean squares algorithm. This heuristic is automatically constructed from the training data. Figure 6 describes how to train a heuristic from the training data. We supply the training algorithm with training data collected offline. Each such data item includes the static program features for the training program, the runtime features and the best mapping. The training algorithm tries to find a function γ which, takes → − in a feature set, fv , and gives a prediction, th, that closely matches actual best mapping, thoptimal in the training data set. B. Benchmarks We used all C programs from NAS parallel benchmark suite [12], SPECOMP-2006 suite [13] and Parsec benchmark suite [14]. These programs are representative parallel programs, which provide a pool of wide variety parallel programs and emerging workloads. C. Varying workloads To introduce dynamicity in the workloads, we invoke workload programs at low frequency and high frequency where the inter-arrival time between two programs is 2and 5 seconds respectively. To show variation in nature of workloads, we define three categories of workloads, minimal, normal and heavy as shown in table II C. Deployment Once we have gathered training data and built the heuristic we can use it to select the mapping for any unseen, new program. During execution time, the library is called and checks whether there is a workload program running on the system. If any workload program is detected, runtime features from /proc are collected and act as inputs to the neural network which outputs the optimal number of threads for the target program. The runtime uses this number of threads to execute the corresponding parallel region. If there is no workload, the target program runs with default configuration using all available physical threads. IV. E XPERIMENTAL SETUP A. Hardware and Software Configurations We carried out experiments to evalute our approach on an Intel Xeon platform with two 2.4 GHz six-core processors (12 threads in total) and 16GB RAM. with Red Hat 4.1.2-50 operating system running Linux kernel 2.6.18. All programs V. R ESULTS In this section we compare the performance improvement gained by our approach compared to existing state-of-art technique. We first summarize the performance of our approach against alternative approaches across all workload settings. Due to limited space, we omit detailed results and performance graphs for each workload setting and each workload frequency. Then, we evaluate our approach on a workload scenario that is derived from a large scale warehouse system as a casestudy. We show the performance improvement averaged for all benchmark programs for target program for six experimental settings. To show the effectiveness of our technique in another dimension, we show the impact of our approach on external workloads for each benchmark program averaged across different experimental scenarios. Figure 7 shows the performance results on six different workload scenarios averaged across all benchmark programs. These scenarios are formed by varying two levels of frequency + Best Mappings + Runtime Features Training Programs Program Extraction Learning Algorithm Training Runs Neural Network Static Features Fig. 6. Training phase of the machine learning model used in our approach. We use Artificial Neural Networks to build our model with each of workload, nominal, normal and heavy. In a given workload setting, the speedup improvement varies for different programs. Hence, the min-max bars in this graph show the range of speedups achieved across various target programs. Our approach not only gives better performance when compare to OpenMP default but also significantly outperforms best existing technique that uses hill climbing optimization technique across all workload scenarios. For nominal workloads ,OpenMP default scheme performs reasonably well as the amount of resource contention is minimum. Under such a setting, our automatic approach gives the least improvement with a speedup of 1.5x. This still translates to 1.15 times of improvement over best existing scheme. When considering medium and heavy workload settings, our approach has a clear advantage with speedups above 2.4x (up to 3.2x) over the OpenMP default scheme. This translates to a speedup over 1.36 (up to 2.3x) when comparing to the hill climbing approach. By looking on the min-max statistical bars, it is clear that our technique delivers stable performance for all workload scenarios without slowing down any program. Overall, the automatic approach achieves a geometric mean speedup of 2.3x. This translates to a 1.5 times improvement over best existing scheme. Fig. 9. Performance with live system workload workload scenario, figure9 shows the speedup of one target program lu, with different schemes. It can be observed that our predictive model fares better than OpenMP default and stateof-the-art technique by 1.37 and 1.22 times performance improvement. This clearly shows that our model adapts well with the dynamic external workload programs in any computing environment. Even in this experiment, the impact on workload by our approach is minimal creating a win-win situation for both target and workload programs. VI. A. Effect on workload As seen in figure 8, where we compare speedups of external workload under various approaches, we observe that default and best existing scheme affects the workload and degrade its performance. This is undesirable as any optimization technique motive should be to improve a program’s performance by depleting and degrading other programs in a greedy fashion. Our approach doesn’t impact any workload in any experimental setting as we reduce the system contention to a greater extent which indirectly benefits the workload as well. Hence we observe a mild improvement in workload performance as well. B. Case study To validate our approach in a real world setting, we selected a workload environment based on a sample of an inhouse high performance cluster of computing systems. Large number of different jobs are submitted to this cluster by many departments that require extensive computational resources. The distribution of the arrival of jobs in this cluster and the number of requested processors over a period of 30 hours are obtained from the inbuilt system log. We extracted jobs from a 15 minute snapshot this real-world workload from a log that recorded system activity over this period. This snapshot was selected to highlight variation in workload pattern. Over this W ORK IN PROGRESS : L EARN - ON - THE - FLY Machine learning models show siginificant performance improvement when the experimental settings for evaluation are in a similar setting that they are trained for. Exploring the exhaustive number of possible states to find best scheme during offline training is not always possible. If some parameter of the execution environment changes during the program execution for which the model was not trained for, it is highly likely that the predictions are not optimal for the new changed environment. In all existing machine learning model based mapping techniques, there is no mechanism to verify if the prediction made was indeed the best possible one. Moreover without realizing if predictions were faulty, the models continue the same logic in the new environment. We are currently working to tackle this problem of how to determine if the model predictions are invalid in a new changed execution environment. In such cases if the model can be enable to learnon-the-fly, it can avoid the pitfalls of mispredictions. We use advanced concepts of Reinforcement learning to get feedback for the computing environment to verify the quality of the predictions and if necessary learn and update the model to improve the prediction quality on-the-fly. Figure 10 shows an overview for a generic reinforcement learning framework. VII. C ONCLUSION This paper has introduced a novel technique based on predictive modeling to devise optimal mapping policy for a Speedup over default lig Hill Climbing 6 Our Approach 5 4 3 2 1 0 h o t.l w fr e q lig h t.h ig h fr m e e q d iu m .lo w m fr e e d q iu m .h ig h fr e h q e a .lo vy w fr h e e q a .h vy ig h fr e q e M a n Fig. 7. Comparison of our approach over OpenMP default and state-of-art scheme. We improve program performance by 1.5x over best existing scheme. Ranges over bars denote extent of speedup improvement for wide variety of benchmark programs. Fig. 8. Comparison of effect of various techniques on external workload. Our technique doesn’t penalize the workload in any case creating a win-win situation for target and workloads. [3] [4] [5] Fig. 10. Reinforcement Learning framework where the agent improves its control logic based on the feedback obtained from its interaction with the environment parallel program co-executing with dynamic external workloads. This approach employs static and dynamic parameters in form of program features and system runtime features to optimize an application. Our method improves program performance significantly (1.5x) over best existing technique inspite of severe resource contention with minimal impact of external workloads. To strengthen our proposal, we evaluted this method in a real world casestudy. Further, we envision to improve this technique to enable any parallel program to adapt to dynamic environment using online learning as its key strength. and to exploit heterogeneous cores with a mix of OpenMP and OpenCL programs. R EFERENCES [1] M. K. Emani, Z. Wang, and M. F. P. O’Boyle, “Smart, adaptive mapping of parallelism in the presence of external workload,” Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), vol. 0, pp. 1–10, 2013. [2] D. Grewe, Z. Wang, and M. F. P. O’Boyle, “A workload-aware mapping approach for data-parallel programs,” in HiPEAC ’11, pp. 117–126. [6] [7] [8] [9] [10] [11] [12] [13] [14] D. Vengerov, L. Mastroleon, D. Murphy, and N. Bambos, “Adaptive data-aware utility-based scheduling in resource-constrained systems,” J. Parallel Distrib. Comput., vol. 70, no. 9, pp. 871–879, 2010. J. Martinez and E. Ipek, “Dynamic multicore resource management: A machine learning approach,” in Micro ’09, pp. 8–17. ˇ P. Radojkovi´c, V. Cakarevi´ c, M. Moret´o, J. Verd´u, A. Pajuelo, F. J. Cazorla, M. Nemirovsky, and M. Valero, “Optimal task assignment in multithreaded processors: a statistical approach,” in ASPLOS ’12, pp. 235–248. Z. Wang and M. F. O’Boyle, “Mapping parallelism to multi-cores: a machine learning based approach,” in PPoPP ’09, pp. 75–84. R. Bitirgen, E. Ipek, and J. F. Martinez, “Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach,” in Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture, MICRO 41, pp. 318– 329, 2008. J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa, “Contention aware execution: online contention detection and response,” in CGO ’10, pp. 257–265. A. Raman, A. Zaks, J. W. Lee, and D. I. August, “Parcae: a system for flexible parallel execution,” in PLDI ’12, pp. 133–144. R. W. Moore and B. R. Childers, “Using utility prediction models to dynamically choose program thread counts,” in ISPASS ’12, pp. 135– 144. C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., 2006. “NAS parallel benchmarks 2.3, OpenMP C version.” http://phase.hpcc. jp/Omni/benchmarks/NPB/index.html. “SPECOMP Benchmark suite.” http://www.spec.org/omp/. “Parsec benchmark suite.” http://parsec.cs.princeton.edu/.