SDC

Transcription

SDC
April 6th 2015 – San José, CA
How to Deal with Radiation:
Evaluation and Mitigation
of GPUs Soft-Errors
Paolo Rech
Motivation: Automotive Applications
Pedestrian Detection System:
embedded GPUs
increase cars
security
Paolo Rech – GTC2016, San José, CA
2
Motivation: Automotive Applications
Pedestrian Detection System:
embedded GPUs
increase cars
security
Observed error
Paolo Rech – GTC2016, San José, CA
2
Motivation: Automotive Applications
Pedestrian Detection System:
embedded GPUs
increase cars
security
The insurance does not cover
those accidents caused by:
[…] error
Observed
exposure to ionizing radiation*
*Paolo’s car insurance
Paolo Rech – GTC2016, San José, CA
2
Motivation: HPC Industry
Titan (Oak Ridge National Lab): 18,688 GPUs
High probability of having a GPU corrupted
Titan MTBF is ~44h*
*(field data from Tiwari et al. HPCA’15)
Paolo Rech – GTC2016, San José, CA
3
Motivation: HPC Industry
Titan (Oak Ridge National Lab): 18,688 GPUs
High probability of having a GPU corrupted
Titan MTBF is ~44h*
*(field data from Tiwari et al. HPCA’15)
Only Crashes/Hangs considered (correct output is unknown)
We perform radiation experiments to measure
Silent Data Corruption (SDC) rates
Paolo Rech – GTC2016, San José, CA
3
Outline
 Radiation Effects Essentials
 Evaluation of GPU Radiation Sensitivity
- Experimental Setup
- Parallel Algorithms Error Rates
 Hardening Solution Efficiency
 Codes Optimizations Effects on HPC Reliability
 What’s the Plan?
Paolo Rech – GTC2016, San José, CA
4
Outline
 Radiation Effects Essentials
 Evaluation of GPU Radiation Sensitivity
- Experimental Setup
- Parallel Algorithms Error Rates
 Hardening Solution Efficiency
 Codes Optimizations Effects on HPC Reliability
 What’s the Plan?
Paolo Rech – GTC2016, San José, CA
Terrestrial Radiation Environment
Galactic cosmic rays interact with atmosphere
shower of energetic particles:
Muons, Pions, Protons, Gamma rays, Neutrons
13 n/(cm2h) @sea level
Paolo Rech – GTC2016, San José, CA
5
Terrestrial Radiation Environment
Galactic cosmic rays interact with atmosphere
shower of energetic particles:
Muons, Pions, Protons, Gamma rays, Neutrons
13 n/(cm2h) @sea level
neutron flux
increases
exponentially with
altitude
Paolo Rech – GTC2016, San José, CA
5
Radiation Effects - Soft Errors
Soft Errors: the device is not permanently damaged,
but the particle may generate:
• One or more bit-flips
Single Event Upset (SEU)
Multiple Bit Upset (MBU)
0
Paolo Rech – GTC2016, San José, CA
1
6
Radiation Effects - Soft Errors
Soft Errors: the device is not permanently damaged,
but the particle may generate:
IONIZING PARTICLE
• One or more bit-flips
Single Event Upset (SEU)
Multiple Bit Upset (MBU)
0
1
1
0
Paolo Rech – GTC2016, San José, CA
6
Radiation Effects - Soft Errors
Soft Errors: the device is not permanently damaged,
but the particle may generate:
IONIZING PARTICLE
• One or more bit-flips
Single Event Upset (SEU)
Multiple Bit Upset (MBU)
0
1
1
0
IONIZING
PARTICLE
• Transient voltage pulse
Single Event Transient (SET)
Logic
Paolo Rech – GTC2016, San José, CA
FF
6
Radiation Effects on GPUs
Streaming Multiprocessor
CUDA GPU
Blocks Scheduler and Dispatcher
Instruction Cache
Warp Scheduler
Warp Scheduler
Dispatch Unit
Dispatch Unit
SM
SM
SM
SM
SM
SM
SM
SM
Register File
SM
SM
SM
SM
core
L2 Cache
DRAM
core
core
core
…
core
core
core
core
Shared Memory / L1 Cache
Paolo Rech – GTC2016, San José, CA
7
Radiation Effects on GPUs
Streaming Multiprocessor
CUDA GPU
Blocks Scheduler and Dispatcher
Instruction Cache
Warp Scheduler
Warp Scheduler
Dispatch Unit
Dispatch Unit
SM
SM
SM
SM
SM
SM
SM
SM
Register File
SM
SM
SM
SM
core
L2 Cache
DRAM
X
X
core
core
core
…
core
core
core
core
Shared Memory / L1 Cache
Paolo Rech – GTC2016, San José, CA
7
Radiation Effects on GPUs
Streaming Multiprocessor
CUDA GPU
Blocks Scheduler and Dispatcher
Instruction Cache
Warp Scheduler
Warp Scheduler
Dispatch Unit
Dispatch Unit
SM
SM
SM
SM
SM
SM
SM
SM
Register File
SM
SM
SM
SM
core
L2 Cache
DRAM
X
X
core
core
core
…
core
core
core
core
Shared Memory / L1 Cache
Paolo Rech – GTC2016, San José, CA
7
Radiation Effects on GPUs
Streaming Multiprocessor
CUDA GPU
Blocks Scheduler and Dispatcher
Instruction Cache
Warp Scheduler
Warp Scheduler
Dispatch Unit
Dispatch Unit
SM
SM
SM
SM
SM
SM
SM
SM
Register File
SM
SM
SM
SM
core
L2 Cache
DRAM
X
X
X
core
core
core
…
core
core
core
core
X
Shared Memory / L1 Cache
Paolo Rech – GTC2016, San José, CA
7
Radiation Effects on GPUs
Streaming Multiprocessor
CUDA GPU
Blocks Scheduler and Dispatcher
Instruction Cache
Warp Scheduler
Warp Scheduler
Dispatch Unit
Dispatch Unit
SM
SM
SM
SM
SM
SM
SM
SM
Register File
SM
SM
SM
SM
core
L2 Cache
DRAM
X
X
X
core
core
core
core
…
core
core
core
core
X
Shared Memory / L1 Cache
Paolo Rech – GTC2016, San José, CA
7
Radiation Effects on GPUs
Streaming Multiprocessor
CUDA GPU
X
Blocks Scheduler and Dispatcher
Instruction Cache
Warp Scheduler
Warp Scheduler
Dispatch Unit
Dispatch Unit
SM
SM
SM
SM
SM
SM
SM
SM
Register File
SM
SM
SM
SM
core
X
L2 Cache
DRAM
X
X
X
core
core
core
core
…
core
core
core
core
X
Shared Memory / L1 Cache
Paolo Rech – GTC2016, San José, CA
7
Radiation Effects on GPUs
Streaming Multiprocessor
CUDA GPU
X
Blocks Scheduler and Dispatcher
Instruction Cache
Warp Scheduler
Warp Scheduler
Dispatch Unit
Dispatch Unit
SM
SM
SM
SM
SM
SM
SM
SM
Register File
SM
SM
SM
SM
core
X
L2 Cache
DRAM
X
X
X
core
core
core
core
…
core
core
core
core
X
Shared Memory / L1 Cache
Paolo Rech – GTC2016, San José, CA
7
Silent Data Corruption vs Crash&Hang
Errors in:
- data cache
- register files
- logic gates (ALU)
- scheduler
Silent Data Corruption
Paolo Rech – GTC2016, San José, CA
8
Silent Data Corruption vs Crash&Hang
Errors in:
- data cache
- register files
- logic gates (ALU)
- scheduler
Errors in:
- instruction cache
- scheduler / dispatcher
- PCI-e bus controller
Silent Data Corruption
Crash & Hang
Paolo Rech – GTC2016, San José, CA
8
Outline
 Radiation Effects Essentials
 Evaluation of GPU Radiation Sensitivity
- Experimental Setup
- Parallel Algorithms Error Rates
 Hardening Solution Efficiency
 Codes Optimizations Effects on HPC Reliability
 What’s the Plan?
Paolo Rech – GTC2016, San José, CA
Radiation Test Facilities
Weapon Nuclear Research
Paolo Rech – GTC2016, San José, CA
9
Neutrons Spectrum
@LANSCE 1.8x109 n/(cm2 h)
@NYC 13 n/(cm2 h)
cross section
[cm2]
=
errors/s
flux (n/cm2/s)
cross section x flux (13 n/(cm2h)) = Error Rate
Paolo Rech – GTC2016, San José, CA
10
Neutrons Spectrum
@LANSCE 1.8x109 n/(cm2 h)
@NYC 13 n/(cm2 h)
probability for 1 neutron to
generate an output error
cross section
[cm2]
=
errors/s
flux (n/cm2/s)
cross section x flux (13 n/(cm2h)) = Error Rate
Paolo Rech – GTC2016, San José, CA
10
GPU Radiation Test Setup
SoC
Flash
SoC
FPGA
GPU
APU
microcontrollers
FPGA
Paolo Rech – GTC2016, San José, CA
11
GPU Radiation Test Setup
Intel
Xeon-Phi
NVIDIA
K20
AMD
APU
GPU power control
circuitry is out of beam
desktop
PCs
Paolo Rech – GTC2016, San José, CA
23/48
Outline
 Radiation Effects Essentials
 Evaluation of GPU Radiation Sensitivity
- Experimental Setup
- Parallel Algorithms Error Rates
 Hardening Solution Efficiency
 Codes Optimizations Effects on HPC Reliability
 What’s the Plan?
Paolo Rech – GTC2016, San José, CA
Tested Parallel Codes
-Matrix Multiplication (linear algebra)
-Matrix Transpose (memory)
-FFT (signal processing)
-Needleman–Wunsch (biology)
-lavaMD (physical simulations)
-Hotspot (physical simulations)
-HOG (pedestrian detection)
The selected algorithms are heterogeneous and
representative
Paolo Rech – GTC2016, San José, CA
13
Experimental Results (ECC OFF)
SDC rate varies ~3 orders of magnitude
(details on Oliveira et al. Trans. Comp. 2015)
Failure In Time @NYC
execution dominated by
memory latencies
10000
Crashes
SDC
1000
100
10
1
MxM MTrans FFT
NW
lavaMD Hotspot
Paolo Rech – GTC2016, San José, CA
14
Experimental Results (ECC OFF)
SDC rate varies ~3 orders of magnitude
(details on Oliveira et al. Trans. Comp. 2015)
Failure In Time @NYC
execution dominated by
memory latencies
codes that heavily
employ registers
10000
Crashes
SDC
1000
100
10
1
MxM MTrans FFT
NW
lavaMD Hotspot
Paolo Rech – GTC2016, San José, CA
14
Experimental Results (ECC OFF)
SDC rate varies ~3 orders of magnitude
(details on Oliveira et al. Trans. Comp. 2015)
Failure In Time @NYC
higher
codes that heavily
2 FIT
Matrix
Multiplication:
6.4610
#instructions
employ registers
1 error every 15 years
10000
1000
Titan: 18,688 errors every
15 years
Crashes
(1 error every 7.3h)
SDC
100
10
1
MxM MTrans FFT
NW
lavaMD Hotspot
Paolo Rech – GTC2016, San José, CA
14
Error Correction Code - SDC
ECC reduces the SDC FIT of ~1 order of magnitude
(there is almost no code dependence)
ECC OFF
ECC ON
Failure In Time @NYC
10000
1000
100
10
1
MxM
FFT
NW
lavaMD Hotspot
Paolo Rech – GTC2016, San José, CA
15
Error Correction Code - Crash
ECC increases the Crash FIT of about 50%
(there is almost no code dependence)
Failure In Time @NYC
10000
ECC OFF
ECC ON
1000
Double Bit Errors
cause a crash
scheduler is not
protected
100
10
1
MxM
FFT
NW
lavaMD Hotspot
Paolo Rech – GTC2016, San José, CA
16
ECC ON – SDC vs Crashes
When the ECC is ON Crashes are more likely to occur
than SDCs (this is GOOD for HPC centers!)
Failure In Time @NYC
10000
Crash
SDC
1000
100
10
1
MxM
FFT
NW
lavaMD
Hotspot
Paolo Rech – GTC2016, San José, CA
17
Outline
 Radiation Effects Essentials
 Evaluation of GPU Radiation Sensitivity
- Experimental Setup
- Parallel Algorithms Error Rates
 Hardening Solution Efficiency
 Codes Optimizations Effects on HPC Reliability
 What’s the Plan?
Paolo Rech – GTC2016, San José, CA
Algorithm Based Fault Tolerance
ABFT: technique designed specifically for an algorithm.
ABFT requires: input coding, algorithm modification,
and output decoding with error detection/correction
A
checksum
x
B
=
∑
M
X
row-check
row-sum
∑
checksum
Huang and Abraham ’84
Rech et al., TNS ‘13
X
col-check
X
col-sum
Freivalds ’79
Paolo Rech – GTC2016, San José, CA
18
FFT Hardening Idea
unhardened FFT
J.Y. Jou and Abraham ’88
Pilla et at., TNS’13
input coding
output de-coding
error detection
Paolo Rech – GTC2016, San José, CA
19
ECC vs ABFT
ECC reduces FIT of ~10
times, ABFT of ~56 times!
FIT [log scale]
10000
Unhardened
ECC
1000
ABFT
100
10
1
SDC
crash
MxM
SDC
crash
FFT
Paolo Rech – GTC2016, San José, CA
20
ECC vs ABFT
ECC reduces FIT of ~10
times, ABFT of ~56 times!
ECC increases Crashes
of 50% ABFT of 10%!
FIT [log scale]
10000
Unhardened
ECC
1000
ABFT
100
10
1
SDC
crash
MxM
SDC
crash
FFT
Paolo Rech – GTC2016, San José, CA
20
ECC vs ABFT
normalized execution time
ECC overhead for MxM is
10%, for FFT 50%!
ABFT overhead is less
than 20%
1,6
1,4
1,2
1
Unhardened
ECC
ABFT
0,8
0,6
0,4
0,2
0
MxM
FFT
Paolo Rech – GTC2016, San José, CA
21
Duplication With Comparison
SM0
a
b
c
d
SM1
a'
b'
c'
d'
Spatial: block i and i+N are
duplicated
time
Paolo Rech – GTC2016, San José, CA
22
Duplication With Comparison
SM0
a
b
c
d
SM1
a'
b'
c'
d'
Spatial: block i and i+N are
duplicated
time
SM0
b
b'
d
d'
SM1
a
a'
c
c'
E-O Spatial: block i and i+1
are duplicated
time
Paolo Rech – GTC2016, San José, CA
22
Duplication With Comparison
SM0
a
b
c
d
SM1
a'
b'
c'
d'
Spatial: block i and i+N are
duplicated
time
SM0
b
b'
d
d'
SM1
a
a'
c
c'
E-O Spatial: block i and i+1
are duplicated
time
SM0
b & b'
d & d'
SM1
a & a'
c & c'
Time: a thread executes
twice the operations
time
Paolo Rech – GTC2016, San José, CA
22
Hotspot - DWC results*
Spatial DWC detects all SDC
Spatial E-O detects 80% of SDC
Time DWC detects 90% of SDC
FIT [log scale]
1000
Unhardened
ECC
Spatial DWC
E-O Spatial DWC
Time DWC
100
10
1
SDC
crash
*details on Oliveira et al.
Trans. Nucl. Sci., 2014
Paolo Rech – GTC2016, San José, CA
23
Hotspot - DWC results*
Spatial DWC detects all SDC
Spatial E-O detects 80% of SDC
Time DWC detects 90% of SDC
FIT [log scale]
1000
Only Time DWC reduces
Crashes (no additional
Blocks scheduling
required)
Unhardened
ECC
Spatial DWC
E-O Spatial DWC
Time DWC
100
10
1
SDC
crash
*details on Oliveira et al.
Trans. Nucl. Sci., 2014
Paolo Rech – GTC2016, San José, CA
23
Hotspot - DWC results*
Spatial DWC detects all SDC
Only Time DWC reduces
Spatial E-O detects 80% of SDC
Crashes (no additional
Time
detects it90%
of SDCeasily Blocks
scheduling
DWCDWC
is promising:
is generic,
implemented,
and
1000
effective…
required)
FIT [log scale]
BUT execution time overhead for Spatial DWC and Spatial E-O
is 2.5x and for Time DWC is 2x (data is not copied)
Unhardened
ECC
Spatial DWC
E-O Spatial DWC
Time DWC
100
10
Duplicate only the
code’s critical portions
1
SDC
crash
*details on Oliveira et al.
Trans. Nucl. Sci., 2014
Paolo Rech – GTC2016, San José, CA
23
Outline
 Radiation Effects Essentials
 Evaluation of GPU Radiation Sensitivity
- Experimental Setup
- Parallel Algorithms Error Rates
 Hardening Solution Efficiency
 Codes Optimizations Effects on HPC Reliability
 What’s the Plan?
Paolo Rech – GTC2016, San José, CA
Codes Optimizations (just baked!)
Novel and incremental algorithm implementations are
continuously developed [Rodinia suite].
Code optimizations impact GPUs reliability?
Three case studies (naïve vs optimized)
Matrix Multiplication
FFT
Needleman–Wunsch
different input sizes
(on GPUs optimizations
depends on workload)
Paolo Rech – GTC2016, San José, CA
24
Experimental Results – MxM
Opt-MxM FIT is higher. Errors in obsolete data are NOT critical:
higher hit rate in the caches = higher FIT
normalized FIT [a.u.]
2,60E+01
2,10E+01
Naive-SDC
Naive-Crash
Opt-SDC
Opt-Crash
1,60E+01
1,10E+01
6,00E+00
1,00E+00
1024
2048
4096
8192
Paolo Rech – GTC2016, San José, CA
25
Experimental Results – MxM
Opt-MxM FIT is higher. Errors in obsolete data are NOT critical:
higher hit rate in the caches = higher FIT
~20% FIT increase with input size caused by additional threads
instantiated
normalized FIT [a.u.]
2,60E+01
2,10E+01
Naive-SDC
Naive-Crash
Opt-SDC
Opt-Crash
1,60E+01
1,10E+01
6,00E+00
1,00E+00
1024
2048
4096
8192
Paolo Rech – GTC2016, San José, CA
25
Mean Workload Between Failures
Opt.
cross section and FIT
Paolo Rech – GTC2016, San José, CA
26
Mean Workload Between Failures
cross section and FIT
Opt.
execution time
neutrons hitting the GPU
We need to consider cross section, execution time, and
throughput
600
GFLOPs
500
400
MxM-naive
300
MxM-opt
200
100
0
1024
2048
4096
8192
Mean
WORKLOAD
Between Failure:
amount of data
produced before
failure
Paolo Rech – GTC2016, San José, CA
26
MxM - MWBF
MWBF [data elaborated]
Opt-MxM produces more correct data than Naïve-MxM
4,00E+13
3,00E+13
Naive-SDC
Opt-SDC
2,00E+13
1,00E+13
1,00E+00
1024
2048
4096
8192
Paolo Rech – GTC2016, San José, CA
27
MxM - MWBF
Opt-MxM produces more correct data than Naïve-MxM
MWBF [data elaborated]
Opt-MxM efficiency increases with input size!
If the code is optimized the throughput
increases more than the error rate!
4,00E+13
3,00E+13
Naive-SDC
Opt-SDC
2,00E+13
1,00E+13
1,00E+00
1024
2048
4096
8192
Paolo Rech – GTC2016, San José, CA
27
Outline
 Radiation Effects Essentials
 Evaluation of GPU Radiation Sensitivity
- Experimental Setup
- Parallel Algorithms Error Rates
 Hardening Solution Efficiency
 Codes Optimizations Effects on HPC Reliability
 What’s the Plan?
Paolo Rech – GTC2016, San José, CA
What’s The Plan?
Exascale = 55x Titan. Can we afford a 55x error rate?
Probably not.
Self Driving Cars. Reliability is a major concern!
How we can help:
Paolo Rech – GTC2016, San José, CA
28
What’s The Plan?
Exascale = 55x Titan. Can we afford a 55x error rate?
Probably not.
Self Driving Cars. Reliability is a major concern!
How we can help:
-Understand SDC criticality. Not all errors significantly
affect output: are there “acceptable” SDC?
-Propose selective-hardening solutions for GPUs
(duplicate only what matters, what REALLY matters)
Paolo Rech – GTC2016, San José, CA
28
What’s The Plan?
Exascale = 55x Titan. Can we afford a 55x error rate?
Probably not.
Self Driving Cars. Reliability is a major concern!
How we can help:
-Understand SDC criticality. Not all errors significantly
affect output: are there “acceptable” SDC?
-Propose selective-hardening solutions for GPUs
(duplicate only what matters, what REALLY matters)
- Understand how algorithm/code/compiler
optimizations will impact future machines error rate
- Fault-injection to better understand error propagation
Paolo Rech – GTC2016, San José, CA
28
Acknowledgments
Caio Lunardi
Caroline Aguiar
Laercio Pilla
Daniel Oliveira
Vinicius Frattin
Philippe Navaux
Luigi Carro
Nathan DeBardeleben
Sean Blanchard
Heather Quinn
Thomas Fairbanks
Steve Wender
Timothy Tsai
Siva Hari
Steve Keckler
Chris Frost
David Kaeli
NUCAR group
Paolo Rech – GTC2016, San José, CA

Similar documents

Ideal case: extension-inflation test

Ideal case: extension-inflation test  Layer specific material properties aorta

More information