SDC
Transcription
SDC
April 6th 2015 – San José, CA How to Deal with Radiation: Evaluation and Mitigation of GPUs Soft-Errors Paolo Rech Motivation: Automotive Applications Pedestrian Detection System: embedded GPUs increase cars security Paolo Rech – GTC2016, San José, CA 2 Motivation: Automotive Applications Pedestrian Detection System: embedded GPUs increase cars security Observed error Paolo Rech – GTC2016, San José, CA 2 Motivation: Automotive Applications Pedestrian Detection System: embedded GPUs increase cars security The insurance does not cover those accidents caused by: […] error Observed exposure to ionizing radiation* *Paolo’s car insurance Paolo Rech – GTC2016, San José, CA 2 Motivation: HPC Industry Titan (Oak Ridge National Lab): 18,688 GPUs High probability of having a GPU corrupted Titan MTBF is ~44h* *(field data from Tiwari et al. HPCA’15) Paolo Rech – GTC2016, San José, CA 3 Motivation: HPC Industry Titan (Oak Ridge National Lab): 18,688 GPUs High probability of having a GPU corrupted Titan MTBF is ~44h* *(field data from Tiwari et al. HPCA’15) Only Crashes/Hangs considered (correct output is unknown) We perform radiation experiments to measure Silent Data Corruption (SDC) rates Paolo Rech – GTC2016, San José, CA 3 Outline Radiation Effects Essentials Evaluation of GPU Radiation Sensitivity - Experimental Setup - Parallel Algorithms Error Rates Hardening Solution Efficiency Codes Optimizations Effects on HPC Reliability What’s the Plan? Paolo Rech – GTC2016, San José, CA 4 Outline Radiation Effects Essentials Evaluation of GPU Radiation Sensitivity - Experimental Setup - Parallel Algorithms Error Rates Hardening Solution Efficiency Codes Optimizations Effects on HPC Reliability What’s the Plan? Paolo Rech – GTC2016, San José, CA Terrestrial Radiation Environment Galactic cosmic rays interact with atmosphere shower of energetic particles: Muons, Pions, Protons, Gamma rays, Neutrons 13 n/(cm2h) @sea level Paolo Rech – GTC2016, San José, CA 5 Terrestrial Radiation Environment Galactic cosmic rays interact with atmosphere shower of energetic particles: Muons, Pions, Protons, Gamma rays, Neutrons 13 n/(cm2h) @sea level neutron flux increases exponentially with altitude Paolo Rech – GTC2016, San José, CA 5 Radiation Effects - Soft Errors Soft Errors: the device is not permanently damaged, but the particle may generate: • One or more bit-flips Single Event Upset (SEU) Multiple Bit Upset (MBU) 0 Paolo Rech – GTC2016, San José, CA 1 6 Radiation Effects - Soft Errors Soft Errors: the device is not permanently damaged, but the particle may generate: IONIZING PARTICLE • One or more bit-flips Single Event Upset (SEU) Multiple Bit Upset (MBU) 0 1 1 0 Paolo Rech – GTC2016, San José, CA 6 Radiation Effects - Soft Errors Soft Errors: the device is not permanently damaged, but the particle may generate: IONIZING PARTICLE • One or more bit-flips Single Event Upset (SEU) Multiple Bit Upset (MBU) 0 1 1 0 IONIZING PARTICLE • Transient voltage pulse Single Event Transient (SET) Logic Paolo Rech – GTC2016, San José, CA FF 6 Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Blocks Scheduler and Dispatcher Instruction Cache Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit SM SM SM SM SM SM SM SM Register File SM SM SM SM core L2 Cache DRAM core core core … core core core core Shared Memory / L1 Cache Paolo Rech – GTC2016, San José, CA 7 Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Blocks Scheduler and Dispatcher Instruction Cache Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit SM SM SM SM SM SM SM SM Register File SM SM SM SM core L2 Cache DRAM X X core core core … core core core core Shared Memory / L1 Cache Paolo Rech – GTC2016, San José, CA 7 Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Blocks Scheduler and Dispatcher Instruction Cache Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit SM SM SM SM SM SM SM SM Register File SM SM SM SM core L2 Cache DRAM X X core core core … core core core core Shared Memory / L1 Cache Paolo Rech – GTC2016, San José, CA 7 Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Blocks Scheduler and Dispatcher Instruction Cache Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit SM SM SM SM SM SM SM SM Register File SM SM SM SM core L2 Cache DRAM X X X core core core … core core core core X Shared Memory / L1 Cache Paolo Rech – GTC2016, San José, CA 7 Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Blocks Scheduler and Dispatcher Instruction Cache Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit SM SM SM SM SM SM SM SM Register File SM SM SM SM core L2 Cache DRAM X X X core core core core … core core core core X Shared Memory / L1 Cache Paolo Rech – GTC2016, San José, CA 7 Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU X Blocks Scheduler and Dispatcher Instruction Cache Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit SM SM SM SM SM SM SM SM Register File SM SM SM SM core X L2 Cache DRAM X X X core core core core … core core core core X Shared Memory / L1 Cache Paolo Rech – GTC2016, San José, CA 7 Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU X Blocks Scheduler and Dispatcher Instruction Cache Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit SM SM SM SM SM SM SM SM Register File SM SM SM SM core X L2 Cache DRAM X X X core core core core … core core core core X Shared Memory / L1 Cache Paolo Rech – GTC2016, San José, CA 7 Silent Data Corruption vs Crash&Hang Errors in: - data cache - register files - logic gates (ALU) - scheduler Silent Data Corruption Paolo Rech – GTC2016, San José, CA 8 Silent Data Corruption vs Crash&Hang Errors in: - data cache - register files - logic gates (ALU) - scheduler Errors in: - instruction cache - scheduler / dispatcher - PCI-e bus controller Silent Data Corruption Crash & Hang Paolo Rech – GTC2016, San José, CA 8 Outline Radiation Effects Essentials Evaluation of GPU Radiation Sensitivity - Experimental Setup - Parallel Algorithms Error Rates Hardening Solution Efficiency Codes Optimizations Effects on HPC Reliability What’s the Plan? Paolo Rech – GTC2016, San José, CA Radiation Test Facilities Weapon Nuclear Research Paolo Rech – GTC2016, San José, CA 9 Neutrons Spectrum @LANSCE 1.8x109 n/(cm2 h) @NYC 13 n/(cm2 h) cross section [cm2] = errors/s flux (n/cm2/s) cross section x flux (13 n/(cm2h)) = Error Rate Paolo Rech – GTC2016, San José, CA 10 Neutrons Spectrum @LANSCE 1.8x109 n/(cm2 h) @NYC 13 n/(cm2 h) probability for 1 neutron to generate an output error cross section [cm2] = errors/s flux (n/cm2/s) cross section x flux (13 n/(cm2h)) = Error Rate Paolo Rech – GTC2016, San José, CA 10 GPU Radiation Test Setup SoC Flash SoC FPGA GPU APU microcontrollers FPGA Paolo Rech – GTC2016, San José, CA 11 GPU Radiation Test Setup Intel Xeon-Phi NVIDIA K20 AMD APU GPU power control circuitry is out of beam desktop PCs Paolo Rech – GTC2016, San José, CA 23/48 Outline Radiation Effects Essentials Evaluation of GPU Radiation Sensitivity - Experimental Setup - Parallel Algorithms Error Rates Hardening Solution Efficiency Codes Optimizations Effects on HPC Reliability What’s the Plan? Paolo Rech – GTC2016, San José, CA Tested Parallel Codes -Matrix Multiplication (linear algebra) -Matrix Transpose (memory) -FFT (signal processing) -Needleman–Wunsch (biology) -lavaMD (physical simulations) -Hotspot (physical simulations) -HOG (pedestrian detection) The selected algorithms are heterogeneous and representative Paolo Rech – GTC2016, San José, CA 13 Experimental Results (ECC OFF) SDC rate varies ~3 orders of magnitude (details on Oliveira et al. Trans. Comp. 2015) Failure In Time @NYC execution dominated by memory latencies 10000 Crashes SDC 1000 100 10 1 MxM MTrans FFT NW lavaMD Hotspot Paolo Rech – GTC2016, San José, CA 14 Experimental Results (ECC OFF) SDC rate varies ~3 orders of magnitude (details on Oliveira et al. Trans. Comp. 2015) Failure In Time @NYC execution dominated by memory latencies codes that heavily employ registers 10000 Crashes SDC 1000 100 10 1 MxM MTrans FFT NW lavaMD Hotspot Paolo Rech – GTC2016, San José, CA 14 Experimental Results (ECC OFF) SDC rate varies ~3 orders of magnitude (details on Oliveira et al. Trans. Comp. 2015) Failure In Time @NYC higher codes that heavily 2 FIT Matrix Multiplication: 6.4610 #instructions employ registers 1 error every 15 years 10000 1000 Titan: 18,688 errors every 15 years Crashes (1 error every 7.3h) SDC 100 10 1 MxM MTrans FFT NW lavaMD Hotspot Paolo Rech – GTC2016, San José, CA 14 Error Correction Code - SDC ECC reduces the SDC FIT of ~1 order of magnitude (there is almost no code dependence) ECC OFF ECC ON Failure In Time @NYC 10000 1000 100 10 1 MxM FFT NW lavaMD Hotspot Paolo Rech – GTC2016, San José, CA 15 Error Correction Code - Crash ECC increases the Crash FIT of about 50% (there is almost no code dependence) Failure In Time @NYC 10000 ECC OFF ECC ON 1000 Double Bit Errors cause a crash scheduler is not protected 100 10 1 MxM FFT NW lavaMD Hotspot Paolo Rech – GTC2016, San José, CA 16 ECC ON – SDC vs Crashes When the ECC is ON Crashes are more likely to occur than SDCs (this is GOOD for HPC centers!) Failure In Time @NYC 10000 Crash SDC 1000 100 10 1 MxM FFT NW lavaMD Hotspot Paolo Rech – GTC2016, San José, CA 17 Outline Radiation Effects Essentials Evaluation of GPU Radiation Sensitivity - Experimental Setup - Parallel Algorithms Error Rates Hardening Solution Efficiency Codes Optimizations Effects on HPC Reliability What’s the Plan? Paolo Rech – GTC2016, San José, CA Algorithm Based Fault Tolerance ABFT: technique designed specifically for an algorithm. ABFT requires: input coding, algorithm modification, and output decoding with error detection/correction A checksum x B = ∑ M X row-check row-sum ∑ checksum Huang and Abraham ’84 Rech et al., TNS ‘13 X col-check X col-sum Freivalds ’79 Paolo Rech – GTC2016, San José, CA 18 FFT Hardening Idea unhardened FFT J.Y. Jou and Abraham ’88 Pilla et at., TNS’13 input coding output de-coding error detection Paolo Rech – GTC2016, San José, CA 19 ECC vs ABFT ECC reduces FIT of ~10 times, ABFT of ~56 times! FIT [log scale] 10000 Unhardened ECC 1000 ABFT 100 10 1 SDC crash MxM SDC crash FFT Paolo Rech – GTC2016, San José, CA 20 ECC vs ABFT ECC reduces FIT of ~10 times, ABFT of ~56 times! ECC increases Crashes of 50% ABFT of 10%! FIT [log scale] 10000 Unhardened ECC 1000 ABFT 100 10 1 SDC crash MxM SDC crash FFT Paolo Rech – GTC2016, San José, CA 20 ECC vs ABFT normalized execution time ECC overhead for MxM is 10%, for FFT 50%! ABFT overhead is less than 20% 1,6 1,4 1,2 1 Unhardened ECC ABFT 0,8 0,6 0,4 0,2 0 MxM FFT Paolo Rech – GTC2016, San José, CA 21 Duplication With Comparison SM0 a b c d SM1 a' b' c' d' Spatial: block i and i+N are duplicated time Paolo Rech – GTC2016, San José, CA 22 Duplication With Comparison SM0 a b c d SM1 a' b' c' d' Spatial: block i and i+N are duplicated time SM0 b b' d d' SM1 a a' c c' E-O Spatial: block i and i+1 are duplicated time Paolo Rech – GTC2016, San José, CA 22 Duplication With Comparison SM0 a b c d SM1 a' b' c' d' Spatial: block i and i+N are duplicated time SM0 b b' d d' SM1 a a' c c' E-O Spatial: block i and i+1 are duplicated time SM0 b & b' d & d' SM1 a & a' c & c' Time: a thread executes twice the operations time Paolo Rech – GTC2016, San José, CA 22 Hotspot - DWC results* Spatial DWC detects all SDC Spatial E-O detects 80% of SDC Time DWC detects 90% of SDC FIT [log scale] 1000 Unhardened ECC Spatial DWC E-O Spatial DWC Time DWC 100 10 1 SDC crash *details on Oliveira et al. Trans. Nucl. Sci., 2014 Paolo Rech – GTC2016, San José, CA 23 Hotspot - DWC results* Spatial DWC detects all SDC Spatial E-O detects 80% of SDC Time DWC detects 90% of SDC FIT [log scale] 1000 Only Time DWC reduces Crashes (no additional Blocks scheduling required) Unhardened ECC Spatial DWC E-O Spatial DWC Time DWC 100 10 1 SDC crash *details on Oliveira et al. Trans. Nucl. Sci., 2014 Paolo Rech – GTC2016, San José, CA 23 Hotspot - DWC results* Spatial DWC detects all SDC Only Time DWC reduces Spatial E-O detects 80% of SDC Crashes (no additional Time detects it90% of SDCeasily Blocks scheduling DWCDWC is promising: is generic, implemented, and 1000 effective… required) FIT [log scale] BUT execution time overhead for Spatial DWC and Spatial E-O is 2.5x and for Time DWC is 2x (data is not copied) Unhardened ECC Spatial DWC E-O Spatial DWC Time DWC 100 10 Duplicate only the code’s critical portions 1 SDC crash *details on Oliveira et al. Trans. Nucl. Sci., 2014 Paolo Rech – GTC2016, San José, CA 23 Outline Radiation Effects Essentials Evaluation of GPU Radiation Sensitivity - Experimental Setup - Parallel Algorithms Error Rates Hardening Solution Efficiency Codes Optimizations Effects on HPC Reliability What’s the Plan? Paolo Rech – GTC2016, San José, CA Codes Optimizations (just baked!) Novel and incremental algorithm implementations are continuously developed [Rodinia suite]. Code optimizations impact GPUs reliability? Three case studies (naïve vs optimized) Matrix Multiplication FFT Needleman–Wunsch different input sizes (on GPUs optimizations depends on workload) Paolo Rech – GTC2016, San José, CA 24 Experimental Results – MxM Opt-MxM FIT is higher. Errors in obsolete data are NOT critical: higher hit rate in the caches = higher FIT normalized FIT [a.u.] 2,60E+01 2,10E+01 Naive-SDC Naive-Crash Opt-SDC Opt-Crash 1,60E+01 1,10E+01 6,00E+00 1,00E+00 1024 2048 4096 8192 Paolo Rech – GTC2016, San José, CA 25 Experimental Results – MxM Opt-MxM FIT is higher. Errors in obsolete data are NOT critical: higher hit rate in the caches = higher FIT ~20% FIT increase with input size caused by additional threads instantiated normalized FIT [a.u.] 2,60E+01 2,10E+01 Naive-SDC Naive-Crash Opt-SDC Opt-Crash 1,60E+01 1,10E+01 6,00E+00 1,00E+00 1024 2048 4096 8192 Paolo Rech – GTC2016, San José, CA 25 Mean Workload Between Failures Opt. cross section and FIT Paolo Rech – GTC2016, San José, CA 26 Mean Workload Between Failures cross section and FIT Opt. execution time neutrons hitting the GPU We need to consider cross section, execution time, and throughput 600 GFLOPs 500 400 MxM-naive 300 MxM-opt 200 100 0 1024 2048 4096 8192 Mean WORKLOAD Between Failure: amount of data produced before failure Paolo Rech – GTC2016, San José, CA 26 MxM - MWBF MWBF [data elaborated] Opt-MxM produces more correct data than Naïve-MxM 4,00E+13 3,00E+13 Naive-SDC Opt-SDC 2,00E+13 1,00E+13 1,00E+00 1024 2048 4096 8192 Paolo Rech – GTC2016, San José, CA 27 MxM - MWBF Opt-MxM produces more correct data than Naïve-MxM MWBF [data elaborated] Opt-MxM efficiency increases with input size! If the code is optimized the throughput increases more than the error rate! 4,00E+13 3,00E+13 Naive-SDC Opt-SDC 2,00E+13 1,00E+13 1,00E+00 1024 2048 4096 8192 Paolo Rech – GTC2016, San José, CA 27 Outline Radiation Effects Essentials Evaluation of GPU Radiation Sensitivity - Experimental Setup - Parallel Algorithms Error Rates Hardening Solution Efficiency Codes Optimizations Effects on HPC Reliability What’s the Plan? Paolo Rech – GTC2016, San José, CA What’s The Plan? Exascale = 55x Titan. Can we afford a 55x error rate? Probably not. Self Driving Cars. Reliability is a major concern! How we can help: Paolo Rech – GTC2016, San José, CA 28 What’s The Plan? Exascale = 55x Titan. Can we afford a 55x error rate? Probably not. Self Driving Cars. Reliability is a major concern! How we can help: -Understand SDC criticality. Not all errors significantly affect output: are there “acceptable” SDC? -Propose selective-hardening solutions for GPUs (duplicate only what matters, what REALLY matters) Paolo Rech – GTC2016, San José, CA 28 What’s The Plan? Exascale = 55x Titan. Can we afford a 55x error rate? Probably not. Self Driving Cars. Reliability is a major concern! How we can help: -Understand SDC criticality. Not all errors significantly affect output: are there “acceptable” SDC? -Propose selective-hardening solutions for GPUs (duplicate only what matters, what REALLY matters) - Understand how algorithm/code/compiler optimizations will impact future machines error rate - Fault-injection to better understand error propagation Paolo Rech – GTC2016, San José, CA 28 Acknowledgments Caio Lunardi Caroline Aguiar Laercio Pilla Daniel Oliveira Vinicius Frattin Philippe Navaux Luigi Carro Nathan DeBardeleben Sean Blanchard Heather Quinn Thomas Fairbanks Steve Wender Timothy Tsai Siva Hari Steve Keckler Chris Frost David Kaeli NUCAR group Paolo Rech – GTC2016, San José, CA