SDC

Transcription

SDC

April 6th 2015 – San José, CA
How to Deal with Radiation:
Evaluation and Mitigation
of GPUs Soft-Errors
Paolo Rech
Motivation: Automotive Applications
Pedestrian Detection System:
embedded GPUs
increase cars
security
Paolo Rech – GTC2016, San José, CA
2
embedded GPUs
increase cars
security
Observed error
2
embedded GPUs
increase cars
security
The insurance does not cover
those accidents caused by:
[…] error
Observed
exposure to ionizing radiation*
*Paolo’s car insurance
2
Motivation: HPC Industry
Titan (Oak Ridge National Lab): 18,688 GPUs
High probability of having a GPU corrupted
Titan MTBF is ~44h*
*(field data from Tiwari et al. HPCA’15)
3
Motivation: HPC Industry
Titan (Oak Ridge National Lab): 18,688 GPUs
High probability of having a GPU corrupted
Titan MTBF is ~44h*
*(field data from Tiwari et al. HPCA’15)
Only Crashes/Hangs considered (correct output is unknown)
We perform radiation experiments to measure
Silent Data Corruption (SDC) rates
3
Outline
 Radiation Effects Essentials
 Evaluation of GPU Radiation Sensitivity
- Experimental Setup
- Parallel Algorithms Error Rates
 Hardening Solution Efficiency
 Codes Optimizations Effects on HPC Reliability
 What’s the Plan?
4
Outline
Terrestrial Radiation Environment
Galactic cosmic rays interact with atmosphere
shower of energetic particles:
Muons, Pions, Protons, Gamma rays, Neutrons
13 n/(cm2h) @sea level
5
Terrestrial Radiation Environment
Galactic cosmic rays interact with atmosphere
shower of energetic particles:
Muons, Pions, Protons, Gamma rays, Neutrons
13 n/(cm2h) @sea level
neutron flux
increases
exponentially with
altitude
5
Radiation Effects - Soft Errors
Soft Errors: the device is not permanently damaged,
but the particle may generate:
• One or more bit-flips
Single Event Upset (SEU)
Multiple Bit Upset (MBU)
0
1
6
IONIZING PARTICLE
0
1
1
0
6
IONIZING PARTICLE
0
1
1
0
IONIZING
PARTICLE
• Transient voltage pulse
Single Event Transient (SET)
Logic
FF
6
Radiation Effects on GPUs
Streaming Multiprocessor
CUDA GPU
Blocks Scheduler and Dispatcher
Instruction Cache
Warp Scheduler
Warp Scheduler
Dispatch Unit
Dispatch Unit
SM
SM
SM
SM
SM
SM
SM
SM
Register File
SM
SM
SM
SM
core
L2 Cache
DRAM
core
core
core
…
core
core
core
core
Shared Memory / L1 Cache
7
CUDA GPU
Instruction Cache
Warp Scheduler
Warp Scheduler
Dispatch Unit
Dispatch Unit
SM
SM
SM
SM
SM
SM
SM
SM
Register File
SM
SM
SM
SM
core
L2 Cache
DRAM
X
X
core
core
core
…
core
core
core
core
7
CUDA GPU
Instruction Cache
Warp Scheduler
Warp Scheduler
Dispatch Unit
Dispatch Unit
SM
SM
SM
SM
SM
SM
SM
SM
Register File
SM
SM
SM
SM
core
L2 Cache
DRAM
X
X
core
core
core
…
core
core
core
core
7
CUDA GPU
Instruction Cache
Warp Scheduler
Warp Scheduler
Dispatch Unit
Dispatch Unit
SM
SM
SM
SM
SM
SM
SM
SM
Register File
SM
SM
SM
SM
core
L2 Cache
DRAM
X
X
X
core
core
core
…
core
core
core
core
X
7
CUDA GPU
Instruction Cache
Warp Scheduler
Warp Scheduler
Dispatch Unit
Dispatch Unit
SM
SM
SM
SM
SM
SM
SM
SM
Register File
SM
SM
SM
SM
core
L2 Cache
DRAM
X
X
X
core
core
core
core
…
core
core
core
core
X
7
CUDA GPU
X
Instruction Cache
Warp Scheduler
Warp Scheduler
Dispatch Unit
Dispatch Unit
SM
SM
SM
SM
SM
SM
SM
SM
Register File
SM
SM
SM
SM
core
X
L2 Cache
DRAM
X
X
X
core
core
core
core
…
core
core
core
core
X
7
CUDA GPU
X
Instruction Cache
Warp Scheduler
Warp Scheduler
Dispatch Unit
Dispatch Unit
SM
SM
SM
SM
SM
SM
SM
SM
Register File
SM
SM
SM
SM
core
X
L2 Cache
DRAM
X
X
X
core
core
core
core
…
core
core
core
core
X
7
Silent Data Corruption vs Crash&Hang
Errors in:
- data cache
- register files
- logic gates (ALU)
- scheduler
Silent Data Corruption
8
Silent Data Corruption vs Crash&Hang
Errors in:
- data cache
- register files
- logic gates (ALU)
- scheduler
Errors in:
- instruction cache
- scheduler / dispatcher
- PCI-e bus controller
Silent Data Corruption
Crash & Hang
8
Outline
Radiation Test Facilities
Weapon Nuclear Research
9
Neutrons Spectrum
@LANSCE 1.8x109 n/(cm2 h)
@NYC 13 n/(cm2 h)
cross section
[cm2]
=
errors/s
flux (n/cm2/s)
cross section x flux (13 n/(cm2h)) = Error Rate
10
Neutrons Spectrum
@LANSCE 1.8x109 n/(cm2 h)
@NYC 13 n/(cm2 h)
probability for 1 neutron to
generate an output error
cross section
[cm2]
=
errors/s
flux (n/cm2/s)
cross section x flux (13 n/(cm2h)) = Error Rate
10
GPU Radiation Test Setup
SoC
Flash
SoC
FPGA
GPU
APU
microcontrollers
FPGA
11
GPU Radiation Test Setup
Intel
Xeon-Phi
NVIDIA
K20
AMD
APU
GPU power control
circuitry is out of beam
desktop
PCs
23/48
Outline
Tested Parallel Codes
-Matrix Multiplication (linear algebra)
-Matrix Transpose (memory)
-FFT (signal processing)
-Needleman–Wunsch (biology)
-lavaMD (physical simulations)
-Hotspot (physical simulations)
-HOG (pedestrian detection)
The selected algorithms are heterogeneous and
representative
13
Experimental Results (ECC OFF)
SDC rate varies ~3 orders of magnitude
(details on Oliveira et al. Trans. Comp. 2015)
Failure In Time @NYC
execution dominated by
memory latencies
10000
Crashes
SDC
1000
100
10
1
MxM MTrans FFT
NW
lavaMD Hotspot
14
execution dominated by
memory latencies
codes that heavily
employ registers
10000
Crashes
SDC
1000
100
10
1
MxM MTrans FFT
NW
lavaMD Hotspot
14
higher
codes that heavily
2 FIT
Matrix
Multiplication:
6.4610
#instructions
employ registers
1 error every 15 years
10000
1000
Titan: 18,688 errors every
15 years
Crashes
(1 error every 7.3h)
SDC
100
10
1
MxM MTrans FFT
NW
lavaMD Hotspot
14
Error Correction Code - SDC
ECC reduces the SDC FIT of ~1 order of magnitude
(there is almost no code dependence)
ECC OFF
ECC ON
10000
1000
100
10
1
MxM
FFT
NW
lavaMD Hotspot
15
Error Correction Code - Crash
ECC increases the Crash FIT of about 50%
(there is almost no code dependence)
10000
ECC OFF
ECC ON
1000
Double Bit Errors
cause a crash
scheduler is not
protected
100
10
1
MxM
FFT
NW
lavaMD Hotspot
16
ECC ON – SDC vs Crashes
When the ECC is ON Crashes are more likely to occur
than SDCs (this is GOOD for HPC centers!)
10000
Crash
SDC
1000
100
10
1
MxM
FFT
NW
lavaMD
Hotspot
17
Outline
Algorithm Based Fault Tolerance
ABFT: technique designed specifically for an algorithm.
ABFT requires: input coding, algorithm modification,
and output decoding with error detection/correction
A
checksum
x
B
=
∑
M
X
row-check
row-sum
∑
checksum
Huang and Abraham ’84
Rech et al., TNS ‘13
X
col-check
X
col-sum
Freivalds ’79
18
FFT Hardening Idea
unhardened FFT
J.Y. Jou and Abraham ’88
Pilla et at., TNS’13
input coding
output de-coding
error detection
19
ECC vs ABFT
ECC reduces FIT of ~10
times, ABFT of ~56 times!
FIT [log scale]
10000
Unhardened
ECC
1000
ABFT
100
10
1
SDC
crash
MxM
SDC
crash
FFT
20
ECC vs ABFT
ECC reduces FIT of ~10
times, ABFT of ~56 times!
ECC increases Crashes
of 50% ABFT of 10%!
FIT [log scale]
10000
Unhardened
ECC
1000
ABFT
100
10
1
SDC
crash
MxM
SDC
crash
FFT
20
ECC vs ABFT
normalized execution time
ECC overhead for MxM is
10%, for FFT 50%!
ABFT overhead is less
than 20%
1,6
1,4
1,2
1
Unhardened
ECC
ABFT
0,8
0,6
0,4
0,2
0
MxM
FFT
21
Duplication With Comparison
SM0
a
b
c
d
SM1
a'
b'
c'
d'
Spatial: block i and i+N are
duplicated
time
22
SM0
a
b
c
d
SM1
a'
b'
c'
d'
duplicated
time
SM0
b
b'
d
d'
SM1
a
a'
c
c'
E-O Spatial: block i and i+1
are duplicated
time
22
SM0
a
b
c
d
SM1
a'
b'
c'
d'
duplicated
time
SM0
b
b'
d
d'
SM1
a
a'
c
c'
E-O Spatial: block i and i+1
are duplicated
time
SM0
b & b'
d & d'
SM1
a & a'
c & c'
Time: a thread executes
twice the operations
time
22
Hotspot - DWC results*
Spatial DWC detects all SDC
Spatial E-O detects 80% of SDC
Time DWC detects 90% of SDC
FIT [log scale]
1000
Unhardened
ECC
Spatial DWC
E-O Spatial DWC
Time DWC
100
10
1
SDC
crash
*details on Oliveira et al.
Trans. Nucl. Sci., 2014
23
Time DWC detects 90% of SDC
FIT [log scale]
1000
Only Time DWC reduces
Crashes (no additional
Blocks scheduling
required)
Unhardened
ECC
Spatial DWC
E-O Spatial DWC
Time DWC
100
10
1
SDC
crash
23
Only Time DWC reduces
Crashes (no additional
Time
detects it90%
of SDCeasily Blocks
scheduling
DWCDWC
is promising:
is generic,
implemented,
and
1000
effective…
required)
FIT [log scale]
BUT execution time overhead for Spatial DWC and Spatial E-O
is 2.5x and for Time DWC is 2x (data is not copied)
Unhardened
ECC
Spatial DWC
E-O Spatial DWC
Time DWC
100
10
Duplicate only the
code’s critical portions
1
SDC
crash
23
Outline
Codes Optimizations (just baked!)
Novel and incremental algorithm implementations are
continuously developed [Rodinia suite].
Code optimizations impact GPUs reliability?
Three case studies (naïve vs optimized)
Matrix Multiplication
FFT
Needleman–Wunsch
different input sizes
(on GPUs optimizations
depends on workload)
24
Experimental Results – MxM
Opt-MxM FIT is higher. Errors in obsolete data are NOT critical:
higher hit rate in the caches = higher FIT
normalized FIT [a.u.]
2,60E+01
2,10E+01
Naive-SDC
Naive-Crash
Opt-SDC
Opt-Crash
1,60E+01
1,10E+01
6,00E+00
1,00E+00
1024
2048
4096
8192
25
Experimental Results – MxM
Opt-MxM FIT is higher. Errors in obsolete data are NOT critical:
higher hit rate in the caches = higher FIT
~20% FIT increase with input size caused by additional threads
instantiated
normalized FIT [a.u.]
2,60E+01
2,10E+01
Naive-SDC
Naive-Crash
Opt-SDC
Opt-Crash
1,60E+01
1,10E+01
6,00E+00
1,00E+00
1024
2048
4096
8192
25
Mean Workload Between Failures
Opt.
cross section and FIT
26
Mean Workload Between Failures
cross section and FIT
Opt.
execution time
neutrons hitting the GPU
We need to consider cross section, execution time, and
throughput
600
GFLOPs
500
400
MxM-naive
300
MxM-opt
200
100
0
1024
2048
4096
8192
Mean
WORKLOAD
Between Failure:
amount of data
produced before
failure
26
MxM - MWBF
MWBF [data elaborated]
Opt-MxM produces more correct data than Naïve-MxM
4,00E+13
3,00E+13
Naive-SDC
Opt-SDC
2,00E+13
1,00E+13
1,00E+00
1024
2048
4096
8192
27
MxM - MWBF
Opt-MxM produces more correct data than Naïve-MxM
MWBF [data elaborated]
Opt-MxM efficiency increases with input size!
If the code is optimized the throughput
increases more than the error rate!
4,00E+13
3,00E+13
Naive-SDC
Opt-SDC
2,00E+13
1,00E+13
1,00E+00
1024
2048
4096
8192
27
Outline
What’s The Plan?
Exascale = 55x Titan. Can we afford a 55x error rate?
Probably not.
Self Driving Cars. Reliability is a major concern!
How we can help:
28
What’s The Plan?
Probably not.
How we can help:
-Understand SDC criticality. Not all errors significantly
affect output: are there “acceptable” SDC?
-Propose selective-hardening solutions for GPUs
(duplicate only what matters, what REALLY matters)
28
What’s The Plan?
Probably not.
How we can help:
-Understand SDC criticality. Not all errors significantly
affect output: are there “acceptable” SDC?
-Propose selective-hardening solutions for GPUs
(duplicate only what matters, what REALLY matters)
- Understand how algorithm/code/compiler
optimizations will impact future machines error rate
- Fault-injection to better understand error propagation
28
Acknowledgments
Caio Lunardi
Caroline Aguiar
Laercio Pilla
Daniel Oliveira
Vinicius Frattin
Philippe Navaux
Luigi Carro
Nathan DeBardeleben
Sean Blanchard
Heather Quinn
Thomas Fairbanks
Steve Wender
Timothy Tsai
Siva Hari
Steve Keckler
Chris Frost
David Kaeli
NUCAR group

SDC

Transcription

Similar documents

Ideal case: extension-inflation test