NEWSLETTER - SUPER SciDAC
Transcription
NEWSLETTER - SUPER SciDAC
NEWSLETTER November 2013 Upcoming Events: SC’13: sc13.supercomputing.org/ November 17-22, 2013; Denver, Colorado SUPER meeting at SC’13, 9:00-11:00am, Monday, November 18, 2013, Room 703 PACT 2014, www.pactconf.org, August 23-27, 2014, Edmonton, Alberta, Canada, Abstract and paper due February 25, 2014 ICS 2014, www.icsconference.org, June 10-13, 2014, Munich, Germany, Abstracts due January 10, 2014; Papers due January 17, 2014 From the Director The SUPER Institute has started its third year after a productive first two years. We have had a major impact by pushing forward architecture-aware and application-aware research in the areas of performance engineering, energy reduction, resilience, and co-optimization of multiple objectives from these areas. New methodologies for optimizing performance, energy consumption, and resilience are all driven by changes in architectures as systems evolve to exploit advances in technology. Collaborations with the DOE computational sciences community both focus our research on real challenges as well as ensure broad and immediate impact of our research results. This issue of the SUPER newsletter focuses on the resilience area. Our researchers have developed new formal methods for assessing the resilience of particular regions of code and algorithms using fault injection. They have also developed a novel fault injector that can automate the process of assessing the resilience of specific areas of code. They have developed programmer-assisted annotations that mark areas of the code with low resilience, and they have begun to automate compiler techniques that can be used to armor regions of the code that are so annotated. They have also made significant strides in improving checkpoint/restart for practical usage at large scales. Furthermore, they have conducted a detailed study of the overhead and effectiveness of resilience-enhancing compiler transformations when used independently and together. Together, these efforts further SUPER’s progress towards a comprehensive, automatic tool set for vulnerability assessment and fault mitigation. The articles in this issue highlight some of these research efforts. Report of DFG/NSF/SRC International Workshop: Cross-layer Resiliency A joint workshop sponsored by NSF (US), DFG (German NSF), and the Semiconductor Research Corporation (SRC) was held in Austin, Texas on July 11-12, 2013. This write up presents a brief description of the project from University of Utah and University of Southern California (USC) presented at the conference followed by an overview of the workshop and broader community interactions summarized by Ganesh Gopalakrishnan. The project "Localized, Layered, Formal Hardware/Software Resilience Methods" from USC (Pedro) and Utah (Gopalakrishnan, Zvonimir) presented at the Principal Investigator’s presentation session on the 12th July The project is meant to investigate system resilience from the perspective of the NSF/SRC community. The plan is to study and explore fault handling in embedded multicore systems considering localized as well as layered approaches for maximum efficacy. The work has been underway for six months. LLVM-level fault injectors are being developed and the use of system invariants for fault detection will be explored. Considering embedded multicore systems and networks on chips, under this project, the kinds of detectors appropriate for use (a) within micro-processor cores ,(b) within the network-on-chip (NOC) hardware, as well as (c) in communication libraries, will be investigated. Page 2 November 2013 Report of DFG/NSF/SRC International Workshop: Cross-layer Resiliency (cont.) Finally the resulting ideas will be evaluated within an FPGA-setting. The project website is being prepared and when deployed it will be linked to (i) http://www.cs.utah.edu/fv as well as (ii) http://soarlab.org. Fig: Vision of the "Localized, Layered, Formal Hardware/Software Resilience Methods" Project Summary of Overview and Broader Community Interactions at the DFG/ NSF/SRC International Workshop Ganesh Gopalakrishnan Overview : The main takeaway from this workshop (viewed from the SUPER community) was the realization that there are a few other communities already engaged in system resilience. While their focus at this point covers many issues at lower levels of the programming abstraction as well as hardware such as (thermal stresses, spread of cracks, aging effects, single points of failure in network on chips, etc., these are crucial issues that can, nevertheless, be informative for the resilience community focused on higher levels of abstraction. Another prominent theme was that of variability. This was from an NSF expeditions project led by Dr. Rajesh Gupta of UCSD. Their URL is variability.org. The amount of margins being allowed within chips is growing, as a percentage. This is because smaller devices and lowering voltages imply that many devices will find themselves precariously close to allowed operating margins. There were advanced body-bias adjustment schemes proposed, which help put the devices in nearly the same operating points. A case in point is that there were also presentations from industrial representatives on their work on active power management schemes..This is clearly a topic of concern for HPC. It is also a concern for researchers focused on lower levels of abstraction especially variability. The workshop also hosted a speaker from Japan (Dr. Onodera of Kyoto University) and a few speakers from Germany offering Japanese and German perspectives, respectively, as well as details on some projects. November 2013 Page 3 Summary of Overview and Broader Community Interactions at the DFG/ NSF/SRC International Workshop (cont.) Broader Community Interactions While the need for broader community interactions was brought up, no specific mechanisms or action-items were discussed. Due to the splintering of conferences and workshops, the interest groups are widely dispersed. Perhaps, working with NSF and SRC, which has funded many PIs working under the Failure Resistant Systems (FRS) program, one might be able to invite prominent players in the NSF/SRC effort to give talks or keynote addresses in HPC conferences and workshops. I also hope this move will be reciprocated by having HPC folks interested in resilience speak to the NSF/SRC community thereby creating rich interactions. While thinking along these lines, Augusto Vega, Alper Buyuktosunoglu and Pradip Bose (IBM) are organizing an IEEE Micro Special Issue called "Harsh Chips": http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=31691©ownerid=44235 Although this call seems to have an embedded systems focus, this is of interest to us because embedded and high-end communities share many concerns. Architectural experiments such as CARMA (CUDA + ARM) and FAWN (Fast Array of Wimpy Nodes, a project at CMU) suggest that there could be tradeoffs in the direction of gaining energy efficiency by adopting mobile technologies at the high end. Bill Harrod’s talk at SC’12 did suggest similar directions as well. It seems reasonable to have people like Dr. Bose speak to our community especially because he worked in this area in conjunction with Power6 and Power7 architectures and we can certainly compile a list of interesting speakers in this context. In summary, interactions between the NSF/SRC community and the DOE community focused on resilience can be a win/win for both sides. Finally, the website for DFG/NSF/SRC International Workshop on Cross-layer Resiliency provides a summary of the speakers and activities and can be accessed at : http://ces.itec.kit.edu/cdnc/ws-cross-layer.htm A Holistic Approach to Failure Mitigation George Bosilca, Aurelien Bouteiller, Thomas Herault, and Jack Dongarra Most predictions of Exascale machines picture billion-way parallelism, encompassing not only millions of cores, but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Failures may manifest in a variety of forms. The most visible is the processor crash, in which One node fails completely, severing the application from one of its process and the associated part of the dataset. A more insidious form of failures result from silent data corruptions, when some bits of data are modified, thereby modifying the computed solution, possibly without being detected. Addressing both type of failures is of paramount importance to maintain future scientific productivity and confidence in produced results. In the past we have designed models to capture the variety of commonly employed failure recovery strategies. These models are intended to help steer fault tolerance research in the right direction, by pointing at potential performance issues before we encounter them in production. These models have highlighted the fact that traditional fault tolerance approaches, based on check pointing or replication, will exhibit a considerable overhead in terms of performance, but even more in terms of energy efficiency. The sheer size of the envisioned distributed memory systems calls for more advanced failure mitigation strategies. Page 4 November 2013 A Holistic Approach to Failure Mitigation (cont.) Algorithm Based Fault Tolerant (ABFT) strategies promise to improve drastically the efficiency of failure recovery and mitigation, by considering mathematical properties of the algorithm in the design phase of its protection scheme. Typically, during the initial phase of the application, the dataset is augmented with some redundant data, linked with the original dataset by a linear mathematical relation (for example, being the sum of the elements on a row). During the course of the algorithm, mathematical transformations are applied on both the original dataset, in order to compute the solution, and to the redundant protection data, to maintain the invariant relation between the protection data and the original data. In many algorithms, this approach yields excellent scalability properties, because it avoids the cave at of relying on I/O to offer protection, and incurs a purely computational overhead that often reaches asymptotic zero when the number of participating resources grows (as illustrated by the performance of the QR factorization, PDGEQRF, when protected against fail-stop failures by ABFT, see Fig.1). Furthermore, rupture of the mathematical relation between the protection data and the original data permits detection and possibly correction of otherwise silent insidious memory corruption errors. Again, the performance penalty incurred by the protection scheme is small, even when considering extreme scale distributed memory machines (see Fig.2, where the LU factorization, PDGETRF is protected against silent memory corruptions by ABFT). Fig 1 Page 5 November 2013 A Holistic Approach to Failure Mitigation (cont.) Fig 2 The deployment of these advanced ABFT techniques in widely used parallel environments and support libraries such as MPI, is very important to ensure an incremental migration path of today's production codes. However, typical standard MPI implementations offer extremely limited support of software-level fault tolerance approaches. The novel idea of Checkpoint-on-Failure enable advanced recovery strategies like ABFT on MPI (or other communication stack) that cannot continue to transport messages after a first failure has struck. When a failure damages the capability to communicate, the only mandate from the communication library is that it returns control and raises an appropriate error. The application code can then checkpoint the surviving processes after the failure occurred. The entire application is then restarted from the checkpoint, except for actually failed processes (for which no checkpoint is available), which are restarted blank. ABFT recovery algorithm is then employed to restore the dataset of these blank restarted processes after which the application can continue. Note that this scheme achieves the optimal number of checkpoints, one per failure, in sharp contrast with the customary periodic checkpoint scheme. In the light of all these efforts and their modeling, a possible path forward is suggested. A path where no single fault management strategy can efficiently handle the complexity of the all types of failures, but instead they are all part of the solution as every single one of them bring features and capabilities that together will allow for the design and implementation of a more generic and efficient solution. November 2013 Page 6 Featured SUPER Researcher: Todd Gamblin 1. Where do you work and how are you involved with SUPER? I work in the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory (LLNL). I've been involved with SUPER since before its inception. Its predecessor, PERI, funded part of my Ph.D. work on performance analysis at UNC Chapel Hill. Back then, I was developing parallel compression tools and clustering algorithms for performance data. Since SUPER started, I've been working with Bronis de Supinski on resilience. Recently, I've started developing a tool called dragnet, which is designed to detect silent memory errors on large machines. It was inspired by some mysterious problems we've seen in full-scale LINPACK runs on the Sequoia machine at LLNL. 2. Can you briefly summarize your educational and work background? I was a double Computer Science and Japanese major at Williams College, and after graduating there I spent a year in Tokyo doing web development. I came back to the U.S. for graduate school at UNC. After a brief stint doing research on clockless logic, I switched over to doing performance analysis work with Dan Reed. I ended up collaborating with Bronis de Supinski at LLNL as part of my dissertation work, and after I graduated I took a job at LLNL as a postdoc. I've been here ever since. 3. Where are you from originally? From second grade on, I grew up in Raleigh, North Carolina. Before that I moved around a lot because my father was in the Navy. 4. What are your research areas of interest? Scalability, performance modeling, parallel performance analysis, visualizing performance data, resilience, and programming models. I'm also very interested in ways to bridge the gap between research and production. I would like to get more research tools directly into users' hands. 5.What do you see yourself doing five years from now? I'd like to be at LLNL, still working on scalability and leading a larger research group. 6.What are some things you enjoy doing that don’t involve computers? I was a swimmer all through college, but these days I enjoy indoor and outdoor climbing. Page 7 November 2013 SUPER Activities at SC’13 PAPERS No Date/Day 1 11-20- 2013, Wednesday Time Session Presentation Speaker Location 3:30PM 4:00PM Engineering Scalable Applications Kinetic Turbulence Simulations at Extreme Scale on Leadership-Class Systems Bei Wang, Stephane Ethier, William Tang, Timothy Williams, Khaled Ibrahim, Kamesh Madduri, Samuel Williams, Leonid Oliker 401/402/403 More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=pap402 2 11-20-2013, Thursday 10:30AM 11:00AM Matrix Computations Parallel Reduction Yulu Jia, George 401/402/403 to Hessenberg Bosilca, Piotr Form with Luszczek, Jack Algorithm-Based Dongarra Fault Tolerance More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=pap342 Page 8 November 2013 SUPER Activities at SC’13 (cont.) WORKSHOP PAPERS No Date/Day Time Session Presentation Speakers Location 1 Monday the 18th 9:00AM 5:30PM 4th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) 4th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) Vassil Alexandrov, Jack Dongarra, Al Geist, Christian Engelmann 507 More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=wksp138 2 Monday the 18th 9:00AM5:30PM 4th International Workshop on Performance Modeling, Benchmarking and Simulation of HPC Systems (PMBS13) Multi-Objective Optimization of {HPC} Kernels for Performance, Power, and Energy Prasanna Balaprakash and Ananta Tiwari and Stefan M. Wild Room 502 More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=wksp120 3 Monday the 18th 9:00AM5:30PM Extreme-Scale Programming Tools Building on Lessons Learned from Over a Decade of MRNet Research and Development B. P. Miller (with significant contributions from P.C. Roth and other current and former members of the Paradyn research group; the workshop schedule only lists the first author) More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=wksp118 4 Monday the 18th 12:00PM12:30PM Room 502 5 Monday the 18th 12:00PM12:30PM Room 502 4th International Performance Tuning H. Shan, W. Jong, L. Oliker, Workshop on of Fock Matrix and N. Wright, and B. Austin Performance Two-Electron Modeling, Integral Calculations Benchmarking for NWChem on and Simulation Leading HPC of HPC Systems platforms (PMBS13) More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=wksp120 4th International Quantifying J. Vetter, S. Lee, D. Li, G. Workshop on Architectural Marin, C. McCurdy, J. Performance Requirements of Meredith, P. Roth, K. Spafford Modeling, Contemporary Benchmarking Extreme-Scale and Simulation Scientific of HPC Systems Applications (PMBS13) More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=wksp120 Room 501 Page 9 November 2013 SUPER Activities at SC’13 (cont.) POSTERS No Date/Day Time Session Presentation Speakers Location 1 Tuesday the 19th 5:15PM 7:00PM Research Poster Reception Generating Customized Eigen value Solutions Using Lighthouse Luke Groeninger, Ramya Nair, Sa-Lin Cheng Bernstein, Javed Hossain, Elizabeth R. Jessup, Boyana Norris Mile High Pre-Function More information : http://sc13.supercomputing.org/schedule/event_detail.php?evid=post266 2 Tuesday the 19th 5:15PM 7:00PM Research Poster Reception Framework for Optimizing Power, Energy, and Performance (nominated for best poster award) Prasanna Balaprakash, Ananta Tiwari, Stefan Wild Mile High Pre-Function More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=post154 3 Tuesday the 19th 5:15PM 7:00PM Research Poster Reception Designing and Auto-Tuning Parallel 3-D FFT for ComputationCommunication Overlap Sukhyun Song, Jeffrey K. Hollingsworth Mile High Pre-Function More information : http://sc13.supercomputing.org/schedule/event_detail.php?evid=post220 4 Tuesday the 19th 5:15PM 7:00PM ACM Student Research Competition Poster Reception Towards CoEvolution of AutoTuning and Parallel Languages Ray S. Chen Mile High Pre-Function More information : http://sc13.supercomputing.org/schedule/event_detail.php?evid=spost115 5 Tuesday the 19th 5:15PM 7:00PM Research Poster Reception 6 Tuesday the 19th 5:15PM 7:00PM Research Poster Reception Hybrid Eduardo F. D'Azevedo, Mile-High MPI/OpenMP/GPU Jianying Lang, Patrick H. Pre-Function Parallelization of Worley, Stephane A. Ethier, XGC1 Fusion Seung-Hoe Ku, ChoongSimulation Code Seock Chang More information : http://sc13.supercomputing.org/schedule/event_detail.php?evid=post215 Structural Comparison of Parallel Applications Matthias Weber, Kathryn Mohror, Martin Schulz, Holger Brunst, Bronis R. de Supinski, Wolfgang E. Nagel More information : http://sc13.supercomputing.org/schedule/event_detail.php?evid=post180 Mile High Pre-Function Page 10 November 2013 Activities at SUPER 2013 (cont.) 7 Tuesday the 19th 5:15PM 7:00PM Research Poster Reception Test-Driven Parallelization of a Legacy Fortran Program Damian W. I. Rouson, Hari Radhakrishnan, Karla Morris, Sameer Shende, Stavros C. Kassinos Mile High Pre-Function More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=post257 8 Tuesday the 19th 5:15PM 7:00PM Research Poster Reception High-Performance Damian W. I. Rouson, Karla Design Patterns for Morris, Magne Haveraaen, Modern Fortran Jim Xia, Sameer Shende More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=post259 Mile High Pre-Function PANELS No Date/Day Time Session Presentation Speakers Location 1 Wednesday the 20th 3:30PM 5:00PM Fault Tolerance/ Resilience at Petascale/Exasc ale: Is it Really Critical? Are Solutions Necessarily Disruptive? Fault Tolerance/ Resilience at Petascale/Exascale: Is it Really Critical? Are Solutions Necessarily Disruptive? Franck Cappello, Marc Snir, Bronis De Supinski, Al Geist, John Daly, Ana Gainaru, Satoshi Matsuoka 301/302/303 More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=pan101 2 Thursday the 21st 3:30PM 5:00PM SC13 Silver Anniversary: Retrospective on Supercomputing Technologies SC13 Silver Anniversary: Retrospective on Supercomputing Technologies Mary Hall, Fran Berman, David Keyes, Kenichi Miura, Warren Washington, Hans Zima Mile High \More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=pan109 3 Friday the 22nd 10:30AM 12:00PM Big Computing: From the ExaScale to the Sensor-Scale Big Computing: From the Exa-Scale to the Sensor-Scale Frederica Darema, James Kahle, Sangtae Kim, Robert Lucas, Burton Smith, Kathy Yelick More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=pan115 405/406/407 November 2013 Page 11 SUPER Activities at SC’13 (cont.) TUTORIALS AND HPC UNDERGRADUATE No Date/Day Time Session Presentation Speakers Location 1 Sunday the 17th 8:30AM 5:00PM Hands-On Practical Hybrid Parallel Application Performance Engineering Hands-On Practical Hybrid Parallel Application Performance Engineering Markus Geimer, Sameer Shende, Bert Wesarg, Brian Wylie 301 More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=tut120 2 Monday the 18th 8:30AM 5:00PM Linear Algebra Libraries for HPC: Scientific Computing with Multicore and Accelerators Linear Algebra Libraries for HPC: Scientific Computing with Multicore and Accelerators Jack Dongarra, James Demmel, Michael Heroux, Jakub Kurzak 407 More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=tut143 3 Monday the 18th 3:00PM5:00PM Experiencing HPC For Undergraduates Orientation HPC for Undergraduates "boot camp" Jeffrey K. Hollingsworth, Alan Sussman Room 703 More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=pec131 4 Tuesday the 19th 10:30AM 12:00PM Introduction to HPC Research Topics Introduction to HPC Research Topics John Mellor-Crummey, Chris Johonson, Laura Carrington, Andrew Chien More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=pec135 703 Page 12 November 2013 SUPER Activities at SC’13 (cont.) BROADER ENGAGEMENT, BOOTH ACTIVITIES, AND BoFs No Date/Day Time Session Presentation Speakers Location 1 Sunday the 17th 11:00AM 11:30AM Fortran 2008 Coarrays and Performance Analysis Performance Evaluation Using the TAU Performance System Sameer Shende 705/707/709/ 711 More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=bespkr114 2 Sunday the 17th 11:30AM 12:00PM Fortran 2008 Coarrays and Performance Analysis Panel on Fortran 2008 Coarrays and Performance Analysis Fernanda Foeretter, Karla Morris, Sameer Shende 705/707/709/ 711 More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=bespkr115 Monday 18th 6:00PM to 3 9:00PM Tuesday 19th 9:00AM to 5:30PM Tuesday 19th 6:00PM to 9:00PM Wednesday 20th 9:00AM to 5:30PM Thursday 21st 9:00AM to 2:00PM Emerging Technologies Scalable Tools for Debugging, Performance Analysis and Performance Visualization Martin Schulz, Abhinav Bhatele, Peer-Timo Bremer, Todd Gamblin Booth 3547 More information: :http://sc13.supercomputing.org/schedule/event_detail.php?evid=emt127_3 4 Tuesday the 19th 5 Wednesday the 20th 12:15PM Open MPI State Open MPI State of Jeffrey Squyres, George 1:15PM of the Union the Union Bosilca More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=bof130 3:30PM5:00PM Broader Engagement Session VI: Journal Publishing 101 Journal Publishing 101 for Postdocs and New Faculty Jeff Hollingsworth, Rebecca Capone More information: http://sc13.supercomputing.org/schedule/event_detail.php?evid=pec124 301/302/303 Room 703 November 2013 Page 13 SUPER Activities at SC’13 (cont.) SERVICE BY MEMBERS OF SUPER SI.no Name Service 1. Boyana Norris Tutorial Committee member 2. David H. Bailey ACM Gordon Bell Award Chair, ACM Gordon Bell Prize Committee Member, Member Performance, Analysis, & Tools (PAT), Test of Time Award Committee Member, Silver Anniversary Committee Member. 3. Leonid Oliker Test of Time Award Vice-Chair 4. Bronis R. de Supinski Posters Committee Member, Member System Software, Tutorials Chair 5. Todd Gamblin Panels Committee Member, Member Performance, Analysis, & Tools (PAT) 6. Philip C. Roth BoFs Chair, Posters Committee Member, Member Performance, Analysis, & Tools (PAT), co-organizer of the 2013 International Workshop on Data Intensive Scalable Computing Systems (DISCS-2013), in cooperation with SIGHPC 7. Patrick Haven Worley Member Algorithm 8. Jeffrey K. Hollingsworth Test of Time Award Committee Member, Experiencing HPC for Undergraduates Chair 9. Allen Malony Area Co-Chair, Performance, Analysis & Tools (PAT) 10. Sameer Shende Member Performance, Analysis, & Tools (PAT) 11. Jacqueline Chame BoFs Committee Member 12. Robert F. Lucas Emerging Technologies Co-Chair, Invited Talks Committee Member, Member System Software 13. Jack Dongarra Technical Program Deputy Chair, Test of Time Award Committee Member 14. Shirley Moore Member Performance, Analysis, & Tools (PAT), Tutorials Committee Member(s) SUPER website: http://www.super-scidac.org/ Contact: Bob Lucas, [email protected] Newsletter Editors: Shirley Moore, [email protected] and Sarala Arunagiri Webpage Creator: Akshita Gurram