Fault Injection Methodologies - CAD Group

Transcription

Fault Injection Methodologies - CAD Group
DMA
Assessing and implementing the fault tolerance of digital circuits
Assessing and Implementing the
Fault Tolerance of
Digital Circuits
Celia López Ongil & Marta Portela García
Microelectronics Design & Applications Group
Electronic Technology Department
BELAS School. Torino, June 1st 2016
1
DMA
Assessing and implementing the fault tolerance of digital circuits
CARLOS III UNIVERSITY OF MADRID
Students
19,872
Foreign Students
18%
Master
22%
PhD
44%
1st year employed
22%
Bachelor Degrees
29 programs
Engineering: 13
Staff (Teaching&Research)
627 professors & assoc. prof.+
1,112 PhD in teaching staff
50%
bilingual
option
Foreign Teaching Staff
13%
Doubled Degrees 11
Master Degrees
64 programs
Doctoral Courses Engineering
29
19 programs
BELAS School. Torino, June 1st 2016
2
DMA
Assessing and implementing the fault tolerance of digital circuits
ELECTRONIC TECHNOLOGY
DEPARTMENT
Subjects
100
Bachelor Deg.
71
Master Degree in
Electronic Systems
and Applications
21 subjects
Master: 28
2 Tracks: research &
innovative technologies
Bachelor / Master
Theses: 100
Seminars & Companies
practices
PhD Program in Electrical,
Electronics and Industrial
Automation
Seminars & PhD annual
meetings
High level research
(H2020 projects,
International
projects…)
PhD: ∼10
BELAS School. Torino, June 1st 2016
3
DMA
Assessing and implementing the fault tolerance of digital circuits
Motivation: A tale
Once upon a
time… Solar activity periods every 10-12 years
Solar storms
Very few particles reach to earth surface
BELAS School. Torino, June 1st 2016
4
DMA
Assessing and implementing the fault tolerance of digital circuits
Motivation: A tale
At the beginning…
And, some time after…
But then…
… and …
BELAS School. Torino, June 1st 2016
5
DMA
Assessing and implementing the fault tolerance of digital circuits
Motivation: A tale
… and …
… now…
we produce complex technology equipment
• Smaller devices
• Lower voltages
• Higher frequencies
for....
1. Learning more
2. Being more comfortable
3. …
• Higher complexity
And they are not
working properly for
ever and ever…
BELAS School. Torino, June 1st 2016
6
DMA
Assessing and implementing the fault tolerance of digital circuits
Motivation: Electronic Circuits Do Fail
“Solar storms could disrupt satellite navigation systems
causing errors in positioning of tens of meters”
July 2012
http://www.bbc.co.uk/news/science-environment-19905878
…
•
•
•
•
New technologies
Lower voltages
Higher frequencies
Higher complexity
Hard Errors
Soft Errors
SEEs
SEUs
Fault
Tolerance
SETs
And there are also circuits at sea level that can also fail
BELAS School. Torino, June 1st 2016
7
DMA
Assessing and implementing the fault tolerance of digital circuits
Motivation: Electronic Circuits Do Fail
Hardening process
DESIGN
DESIGN
Hardened
Mitigation
Assessment
OK?
Yes
No
How can we assess the Fault Tolerance of a Digital
Electronic Device?
1. Low cost
2. Short times
3. Early assessment
BELAS School. Torino, June 1st 2016
8
DMA
Assessing and implementing the fault tolerance of digital circuits
Assessing and Implementing
the Fault Tolerance of Digital Circuits
Part 0: Radiation Effects in Semiconductor Devices
Part 1: Assessing Fault Tolerance of Digital Circuits
Part 2: Error Mitigation Techniques against Single Event Effects
BELAS School. Torino, June 1st 2016
9
DMA
Assessing and implementing the fault tolerance of digital circuits
Radiation: a bit of History
Radiation: can be defined as the propagation of energy through matter or
space. It can be in the form of electromagnetic waves or energetic
particles.
Ionizing radiation has the ability to knock an electron from an atom, i.e. to
ionize.
Alpha particles
Beta particles
Neutrons
Gamma rays
X-rays
Non-ionizing radiation does not have enough energy to ionize atoms in
the material it interacts with.
Microwaves
Visible light
Radio waves
Source: Coderre, Jeffrey. 22.01 Introduction to Ionizing Radiation. Fall 2006. (Massachusetts Institute of Technology: MIT OpenCourseWare),
http://ocw.mit.edu (Accessed 27 March, 2014). License: Creative Commons BY-NC-SA
BELAS School. Torino, June 1st 2016
10
DMA
Assessing and implementing the fault tolerance of digital circuits
Radiation: a bit of History
Source: Lawrence Berkeley Laboratories, “MicroWorlds: Electrmagnetic Spectrum.”
http://www.lbl.gov/MicroWorlds/ALSTool/EMSpec/EMSpec2.html (Accessed 27 March, 2014)
BELAS School. Torino, June 1st 2016
11
DMA
Assessing and implementing the fault tolerance of digital circuits
Radiation: a bit of History
Radiation discovery
Röntgen 1895 (X-Rays)
Radioactivity
Beckquerel 1896
Curie 1897 (U, Th, Po)
Rutherford 1899 (α and β rays)
Villard 1900 (γ rays)
BELAS School. Torino, June 1st 2016
12
DMA
Assessing and implementing the fault tolerance of digital circuits
Radiation: a bit of History
Natural Radiation
Victor Hess 1911
Ionizing radiation measured in the atmosphere
(electroscope)
Above 1km altitude the level of radiation increases
considerably
Robert Millikan 1925
Cosmic Rays
Source: American Physic Society
Source: HongKong Observatory.”
http://www.hko.gov.hk/education/edu02rga/radiation/radiation_06-e.htm
(Accessed 27 March, 2014)
BELAS School. Torino, June 1st 2016
13
DMA
Assessing and implementing the fault tolerance of digital circuits
Radiation: a bit of History
In the 1930s and 1940s correlations of global electromagnetic disturbances were
made with Solar Activity. Simultaneous effects at both poles.
Monthly averaged sunspot cycles
Source: Coderre, Jeffrey. 22.01 Introduction to Ionizing Radiation. Fall 2006. (Massachusetts Institute of Technology: MIT OpenCourseWare),
http://ocw.mit.edu (Accessed 27 March, 2014). License: Creative Commons BY-NC-SA
BELAS School. Torino, June 1st 2016
14
DMA
Assessing and implementing the fault tolerance of digital circuits
Radiation in the Earth’s issues
External sources of Ionizing radiation
Galactic Cosmic Rays
• Outside the Solar System and even the Milky Way
Sun activity (Solar wind and Solar flares)
• Periodical “solar storms”
Interstellar particles
Terrestrial sources of Ionizing radiation
https://www.youtube.com/watch?v=HFT7ATLQQx8
https://www.youtube.com/watch?v=nmDZhQAIeXM
Alpha particles
• From the natural disintegration of radioactive materials
Neutrons
• From secondary radiation in the atmosphere
BELAS School. Torino, June 1st 2016
15
DMA
Assessing and implementing the fault tolerance of digital circuits
Radiation Sources around the Earth
Solar effects on Magnetosphere: sources
Solar wind:
Modify Magnetosphere Shape
Solar flares (& CMEs)
Visible Effects
Van Allen Belts
Source: NASA
http://www.bbc.com/news/science-environment-26381685
Source: Lifeng Astronomy Web. Astronomy&Physics Courses. http://lifeng.lamost.org/
BELAS School. Torino, June 1st 2016
16
DMA
Assessing and implementing the fault tolerance of digital circuits
Radiation Sources around the Earth
Solar effects on Magnetosphere: consequences
Auroras (Borealis and Australis)
2014.http://www.bbc.com/news/uk-26378027
Source: http://science.nasa.gov/science-news/science-at-nasa/1998/ast29oct98_1/
BELAS School. Torino, June 1st 2016
http://www.swpc.noaa.gov/products/aurora-3-day-forecast
17
DMA
Assessing and implementing the fault tolerance of digital circuits
Radiation Sources around the Earth
Solar effects on Magnetosphere: consequences
Van Allen Belts (Trapped particles: protons and electrons )
Protons:
40 MeV@ 3 radii, 8 MeV@ 4 radii
Source: NASA: ROSAT Project.
https://heasarc.gsfc.nasa.gov/docs/rosat/gallery/display/saa.html
Source: Lifeng Astronomy Web. Astronomy&Physics Courses. http://lifeng.lamost.org/
BELAS School. Torino, June 1st 2016
18
DMA
Assessing and implementing the fault tolerance of digital circuits
Ionizing Particles (Space Radiation Environment)
Heavy Ions
Protons
Electrons
Alpha Particles
Photons
Source: Coderre, Jeffrey. 22.01 Introduction to Ionizing Radiation. Fall 2006. (Massachusetts Institute of Technology: MIT OpenCourseWare),
http://ocw.mit.edu (Accessed 27 March, 2014). License: Creative Commons BY-NC-SA
BELAS School. Torino, June 1st 2016
19
DMA
Assessing and implementing the fault tolerance of digital circuits
Ionizing Particles (Space Radiation Environment)
Earth orbits?
Source: NASA. http://www.nasa.gov/multimedia/imagegallery/image_feature_1283.html
BELAS School. Torino, June 1st 2016
20
DMA
Assessing and implementing the fault tolerance of digital circuits
Ionizing Particles (Natural/Earth Radiation Environment)
Neutrons
Alpha Particles
Protons
Electrons
Source: J.F. Ziegler et al., “IBM experiments in soft fails in computer electronics
(1978-1994)”, IBM J. Res. Dev., vol. 40, no. 1, pp. 3–18, Jan.1996
Source: Coderre, Jeffrey. 22.01 Introduction to Ionizing Radiation. Fall 2006. (MIT OpenCourseWare),
http://ocw.mit.edu (Accessed 27 March, 2014). License: Creative Commons BY-NC-SA
BELAS School. Torino, June 1st 2016
21
DMA
Assessing and implementing the fault tolerance of digital circuits
Ionizing particles (Natural/Earth Radiation Environment)
Alpha particles are generated by traces of radioactive material in the
packaging. They have a very high ability to generate upsets.
Industry requirements call for low to ultra low alpha emissions of materials
(.01 to .002 alpha/cm2/hr)
Chip manufacturers need to verify the emission level of their package and
the immunization of their design.
Highly Critical in flip chip package
IRPS 1978 - May and Woods (Intel) paper: definition of “Soft Errors”, in
2107-series 16-kb DRAMs, caused by alpha particles emitted in the
radioactive decay of Uranium and Thorium impurities
BELAS School. Torino, June 1st 2016
22
DMA
Assessing and implementing the fault tolerance of digital circuits
Effects of Ionizing Particles
Nuclear Physics and Semiconductor Device Physics
Source: D.Alexandrescu. “Radiation_Environments_&_Anomalies”. SERESSA School, Ansan, Korea, 2012.
BELAS School. Torino, June 1st 2016
23
Assessing and implementing the fault tolerance of digital circuits
DMA
Effects of Ionizing Particles
Nuclear Physics and Semiconductor Device Physics
G
SiO2
S
n+
G
D
+
n+
+ - +
+ - + + - + + + substrate p
+ MOS TRANSISTOR
D
SiO2
+++++++
S
n+
n+
channel
substrate p
TID
Charging particle
BELAS School. Torino, June 1st
2016
24
DMA
Assessing and implementing the fault tolerance of digital circuits
Effects of Ionizing Particles
n+
I
++ -n+
++p
Transient fault
/ SET
BELAS School. Torino, June 1st
2016
V
“Soft” error
/ SEU
25
DMA
Assessing and implementing the fault tolerance of digital circuits
Effects of Ionizing Particles
Total Ionizing Dose (krad) TID
Slow gradual degradation of device’s performance. Changes in Threshold
voltages, frequency in crystal oscillators, etc.
Prompt Dose
Survivability
Single Event Effects (SEE): interaction of an energetic particle with a
microelectronic device
Hard errors (permanent)
Single Event Latchup (SEL)
Burnout
Soft errors (transient)
Single Event Upset (SEU)
Single Event Transient (SET)
Displacement Damage (crystal lattice semi-permanent displacement) DD
BELAS School. Torino, June 1st 2016
26
DMA
Assessing and implementing the fault tolerance of digital circuits
Effects of Ionizing Particles
Single Event Functional Interrupts (SEFIs) is a SEU that causes circuit to
stop operating. SEFIs occur in a register that controls configuration in, for
example, processors, reconfigurable FPGAs or SDRAMs
The stored bit upsets, i.e., “1” goes to “0” or “0” goes to “1”.
There is no permanent damage to the device.
The error can be corrected by rewriting the original information which might
involve hard reboot (power cycle) or soft reboot (software restart)
SEFIs (functional interrupt) are more serious than SEUs (e.g. data upset)
Multiple Bit Upsets (MBUs) consists of multiple SEUs caused by a single
ion
MBUs occur through charge spreading (diffusion) or through track intersection with
more than one storage element.
On a 16 Mbit DRAM a single ion produced more than 50 SEUs
MBUs are harder to mitigate using error detection and correction
To avoid MBUs in memories physically separate bits in same word
BELAS School. Torino, June 1st 2016
27
DMA
Assessing and implementing the fault tolerance of digital circuits
Effects of Ionizing Particles: Real examples
Satellites
Binder et al. reported four anomalies occurred in satellite electronics during a
period of 17 years, mainly due to solar wind (100 MeV iron atoms) causing SEU in
Flip-flops, 1975.
R. Harboe-Sorensen reported on an ESA satellite disaster, 2005:
A satellite was launched from Baikonor Cosmodrome, Kazakhstan on 31 May 2005
Failed during the 5th orbit instead of after a total of 253 orbits (2% use)
A latch-up condition in a SRAM was concluded as the possible cause
Source: Reno Harboe-Sørensen, ESA-ESTEC, ” Radiation Effects in Spacecraft Electronics“, 5th LHC Radiation Workshop –Nov-29, 2005.
BELAS School. Torino, June 1st 2016
28
DMA
Assessing and implementing the fault tolerance of digital circuits
Effects of Ionizing Particles: Real Examples
Commercial Devices
Alpha particles
Intel, 1978. Memories of 2017-series (DRAM) showed soft errors due to
Uranium and Thorium impurities in the package material. May and
Woods, 1978. (Old uranium mine contaminating water used for
manufacturing process of ceramic packages)
This paper started the tradition of research on soft errors in sealevel applications
Ziegler (cosmic rays (neutron produced) also affect commercial memory
devices)
Baumann (Texas Inst.) showed the high influence of neutrons on Boron
Boron Phosphosilicate Glass layers removed (BPSG)
IBM, 1986. LSI memories manufactured in USA showed soft errors while
identical Europe products behaved error-free. 210Po contamination in the
cleaning process of nitric acid bottles for manufacturing process.
BELAS School. Torino, June 1st 2016
29
DMA
Assessing and implementing the fault tolerance of digital circuits
Effects of Ionizing Particles: Real Examples
Commercial Devices
SEU in SRAM
2001. SUN UltraEntreprise Workstations (300,000$)
– SEU provoked by neutron effects in SRAM Memory Cache
– No mitigation techniques included (EDAC codes)
– Tens of millions of dollars extra cost for SUN
2004. CISCO Systems 12,000 Series Routers (200,000$)
– SEU provoked by neutron effects in memories and ASICs
– Software mitigation techniques (OS)
Belgium elections (March 5th, 2003)
– Electronic voting machines
– 4,096 votes more than registered people!
– SEU provoked by neutron effects
BELAS School. Torino, June 1st 2016
30
DMA
Assessing and implementing the fault tolerance of digital circuits
Radiation Effects vs Technology Scaling
• Smaller devices
• Lower supply voltages
DRAMs and SRAMs:
Higher system-level Soft Error Rate
• Higher frequencies
• Higher complexity
Saturation of the system-level SER in
the future (currently FIT/Mbit is
decreasing)
Not the same trend with flip-flops,
latches and combinational logic
BELAS School. Torino, June 1st 2016
31
DMA
Assessing and implementing the fault tolerance of digital circuits
Radiation Effects vs Technology Scaling
SRAM:
SER per bit is constant, but device SER is
continuously growing
MBU and MCU!!
Source: N. Seifert et al., “Radiation-induced soft error rates of advanced CMOS bulk devices”, in Proc. Int’l Rel. Phys. Symp. (IRPS), pp. 217-225, 2006
BELAS School. Torino, June 1st 2016
32
DMA
Assessing and implementing the fault tolerance of digital circuits
Radiation Effects vs Technology Scaling
DRAM
3D Storage capacitors are less vulnerable
Address and control logic circuitry is becoming more susceptible
Multiple error bit detection and correction should be developed and
applied
Bit error rate for DRAM, neutron flux at New York City (1Gb chip)
Source: L. Borocky “Comparison of accelerated DRAM soft error rates measured at component and system level”, in Proc. Int’l Rel. Phys. Symp.
(IRPS), pp. 482-487, 2008.)
BELAS School. Torino, June 1st 2016
33
DMA
Assessing and implementing the fault tolerance of digital circuits
Radiation Effects vs Technology Scaling
FFs
More difficult to harden
Low level mitigation techniques
SER per bit is decreased beyond 90 nm.
Modern CMOS processes, average SER/bit of a flip-flop is comparable
to one of SRAM
Source: T. Heijmen et al ““A Comprehensive Study on the Soft-Error Rate of Flip-Flops from 90-nm Production Libraries”, IEEE Trans. Dev. Mater.
Reliab., vol. 7, no. 1, pp. 84-96, Mar. 2007.
BELAS School. Torino, June 1st 2016
34
DMA
Assessing and implementing the fault tolerance of digital circuits
Radiation Effects vs Technology Scaling
Combinational logic
Soft Error Transient are produced in combinational gates
Only dangerous if reaching a memory element
Increasing with operational frequency
Three masking effects reduce the probability that an SET is captured
Timing Masking
Electrical Masking
Estimated contribution of Soft Error Rate
Logical Masking
11%
SET
40%
Combinational
SEU logic
Sequential elements
Unprotected SRAM
SEU
89%
49%
Source: Mitra et al “Robust System Design with Built-In Soft-Error Resilience”, Computer vol. 38, no. 2, pp. 43-52, Feb. 2005.
BELAS School. Torino, June 1st 2016
35
DMA
Assessing and implementing the fault tolerance of digital circuits
Conclusions
Effects of ionizing particles on semiconductor devices
avoidance
Hard errors
SEL current measuring
Soft errors
SEU, SBU, MBU, MCU redundancy at different abstraction levels
SETs ¿needed? Yes!
“Because of the increasing system sizes and complexities,
ongoing researches are necessary to reduce the reliability risk
for electronic systems”
Source: T. Heijmen et al “Soft Errors from Space to Ground: Historical Overview”, Soft Errors in Modern Electronic Systems. Springer, Chapter 1. 2011
BELAS School. Torino, June 1st 2016
36
DMA
Assessing and implementing the fault tolerance of digital circuits
What can we do?
Digital VLSI
circuits
Electronics circuits do fail in
field!
• Why does the circuit fail?
•Testing circuits is necessary:
-Is my circuit robust enough?
-Which are the weakest
areas?
• Hardening methodologies:
- Exist different techniques
- Designer: Hardening by
design
BELAS School. Torino, June 1st 2016
37
DMA
Assessing and implementing the fault tolerance of digital circuits
Assessing Fault Tolerance: Objectives
• To verify that the circuit satisfies the reliability and fault
tolerance requirements.
• To forecast the circuit behavior under faults. Could a fault
provoke a catastrophic or unacceptable failure?
• To identify critical parts (the most sensitive to soft errors) of
the circuit.
• Validate the mitigation approaches
BELAS School. Torino, June 1st
2016
38
Assessing and implementing the fault tolerance of digital circuits
DMA
Assessing Fault Tolerance: Metrics
• Soft Error Rate:
• Failure rate (Failure In Time, FIT)
Fault
Error
Failure
▪ 1FIT = 1failure / 109hours
• Mean Time To Failure (MTTF)/ Mean Time Between Failure (MTBF)
• Fault coverage
▪ τ = #Failures/#Faults
• Cross section (sensitivity)
▪ σ= (#events/fluence) ; fluence = #particles per area unit
BELAS School. Torino, June 1st 2016
39
Assessing and implementing the fault tolerance of digital circuits
DMA
Soft Error Sensitivity Assessment Approaches
• Analytical methods
• Obtain a numerical estimation of dependability parameters by means
of mathematical models (combinatorial models, stochastic models,
etc.)
• Mathematical models are complex and difficult to apply to real circuits
• Experimental methods
• Direct measure:
▪ Observe the circuit in-field to analyze the response to real faults
and extract dependability data (e.g. MTBF)
• Fault Injection
▪ The deliberate injection of faults into a Circuit Under Test (CUT)
BELAS School. Torino, June 1st
2016
40
DMA
Assessing and implementing the fault tolerance of digital circuits
In-field assessment
ASTEP (Altitude SEE Test European Platform);
Pic de Bures, French Alps (2,552 m)
7Gbits of SRAM
~300 days
~10 SEU/MCU per month
Real-time SER setup involving 512 SRM circuits manufactured in 40nm technology by STMicroelectronics
(www.astep.eu)
… On-board
satellites:
Space
Testbeds (NASA)
BELAS
School. Torino,
June
1st Environment
2016
“Living wih a Star”
41
Assessing and implementing the fault tolerance of digital circuits
DMA
Fault Injection
• Fault Injection: The deliberate injection of faults into a Circuit
Under Test (CUT)
Golden
Faulty
comparison
➢ Failure
➢ Latent
➢ Silent (no effect)
Fault injection
•
•
•
•
•
Accelerate
Higher number of faults
Earlier assessment (also at design stages)
High controllability
Identification of critical parts
Fault coverage =
BELAS School. Torino, June 1st 2016
42
Assessing and implementing the fault tolerance of digital circuits
DMA
Fault Injection System
Injection Elements
Stimuli
Generator
(Workload)
Golden
Faulty
Fault
Injection
Faults
Activation:
Workload
Circuit
Results
SE
Measurement
s
comparison
➢ Failure
➢ Latent
➢ Silent (no effect)
SER
BELAS School. Torino, June 1st
2016
43
DMA
Assessing and implementing the fault tolerance of digital circuits
Classification of Fault Injection methods
Target
Perturbation source
Fault Injection method
Manufactured circuit
Physical Fault Injection
(external source)
•Irradiation
•Laser
•EMI
•Pin level
Logical Fault Injection
•Software-based (SWIFI)
(Fault Model + circuit resources) •JTAG / scan chain
•On-Chip Debugger (OCD)
•Reconfiguration
Design
Logical Fault Injection
(Fault Model +
Design & Prototyping tools)
BELAS School. Torino, June 1st 2016
•Simulation
•Emulation (FPGA)
•Hybrid
44
DMA
Assessing and implementing the fault tolerance of digital circuits
1.3.
Test
Test
strategy
strategy
• Soft Error sensitivity depends on:
• Technology
• Workload
• Static test to assess the technology and architectural factors
→ Physical Fault Injection methods
• Dynamic test to take into account the effect of a given
workload (only static test is a pessimistic measurement)
• σ(SEU) = σ·τ
(cross section)
• σ is the result of the static test
• τ is the result of the dynamic test (non-physical FI method) → reducing
the cost of injection campaigns.
BELAS School. Torino, June 1st 2016
45
DMA
Assessing and implementing the fault tolerance of digital circuits
1.3.
Test
Test
strategy
strategy
• A soft error rate (FIT) for the final application in the real
environment can be estimated by combining the obtained
σ(SEU) with the environmental information (orbit conditions)
SER(in-field) = σ(SEU) · Expected_fluence
• CREME96 → Standard model for cosmic ray environment
• SRIM (tool), GEANT4 (libraries)→ Monte Carlo methods to
calculate LET
• CAD → SPACERAD and SPENVIS
BELAS School. Torino, June 1st 2016
46
DMA
Assessing and implementing the fault tolerance of digital circuits
Physical Fault Injection
• Physical Fault Injection: use a physical means to inject faults
• Main techniques
• Forced irradiation:
▪ protons, neutrons, alpha particles, heavy ions.
• Laser Fault Injection
• Pin level
• EMI-based
• Power supply
BELAS School. Torino, June 1st 2016
Ground
testing for
space
applications
47
DMA
Assessing and implementing the fault tolerance of digital circuits
Forced Irradiation
• Observe the circuit behavior under the same environmental
conditions than the final application
• SEEs→ Particle accelerators
• TID → Co60 sources, protons, electrons
• Main characteristics
• Require at least a circuit prototype and
cannot be used during the design phase
• Limited observability and controllability
• Risk of damage of the Device Under Test (DUT): TID, SEL
• Required for device qualification in particular application environments
• Special facilities: expensive tests
BELAS School. Torino, June 1st 2016
48
Assessing and implementing the fault tolerance of digital circuits
DMA
Forced Irradiation: Facilities in Europe
CYCLONE-Belgium
HI (>10MeV), protons (65MeV)
JYFL- Finland
HI (>10MeV), protons (60MeV)
GANIL-France
HI (100 MeV)
CPO-France
Protons (200 MeV)
IPN-France
HI (<10MeV), protons (20MeV)
LNL-Italy
HI (<10MeV)
SIRAD-Italy
Protons (28 MeV)
PSI-Switzerland
HI (300MeV)
CNA-Spain
Protons (up to 18MeV)
BELAS School. Torino, June 1st 2016
49
DMA
Assessing and implementing the fault tolerance of digital circuits
Forced Irradiation: CNA-Spain
BELAS School. Torino, June 1st 2016
50
DMA
Assessing and implementing the fault tolerance of digital circuits
Forced Irradiation: CNA-Spain
BELAS School. Torino, June 1st 2016
51
DMA
Assessing and implementing the fault tolerance of digital circuits
Logical Fault Injection on the final implementation
• The injected fault is a model of the physical effect. It is
inserted in the final implementation of the circuit under test.
(Similar to manufacturing tests)
• Logic resources of the circuit under test are used to perform
the fault injection.
• Main characteristics:
•
•
•
•
The accessibility to sensitive locations is limited.
Fault injection can be performed in the real environment.
Useful during the design stage of a system with COTS elements.
Accuracy of results are limited by the accuracy of the fault model.
BELAS School. Torino, June 1st 2016
52
Assessing and implementing the fault tolerance of digital circuits
DMA
Logical Fault Injection on the final implementation:
Fault Models
• Bit-flip: a temporary change of logic value in a memory
element
Feature
SEU/MBU
Effect
Single/Multiple Bit-flip
Where?
Any FF
When?
Any clock cycle
For how long?
1 clock cycle (typically)
F. Alcalá et al. “Fault Tolerant VHDL Architectures for Space
Applications” Proc.VHDL Users’ Forum. Toledo, Spain. 1997
• Others:
• Permanent faults: stuck-at, bridge, open, etc.
• A variety of models can be used to represent the effect of faults
BELAS School. Torino, June 1st
2016
53
Assessing and implementing the fault tolerance of digital circuits
DMA
Logical Fault Injection on the final implementation
FPGAs
Microprocessors
Reconfiguration
resources
And
interconnections
1
0
0
1
LUT
0
0
0
1
SEU
0
1
0
1
0
0
1
1
0
0
0
0
i1
s
Debugging
resources
i2
i
x
1
i2
Software
s
Stuck-at 1
i1
s
x
i2
Stuck-at 1
i1
i
2
x s
Stuck-at 0
BELAS School. Torino, June 1st 2016
54
DMA
Assessing and implementing the fault tolerance of digital circuits
Logical Fault Injection on the final implementation : FPGAs
• Actually most faults in FPGAs appear in configuration memory
• Bitstreams are quite large and only a fraction of bits is
generally used
• Tools:
• FLIPPER (Xilinx)
M. Alderighi et al. “Evaluation of Single Event Upset
Mitigation Schemes for SRAM based FPGAs using the
FLIPPER Fault Injection Platform“
22nd IEEE Int. Symp. on Defect and Fault Tolerance in VLSI
Systems. 2007. pp 105-113
• SEM (Soft Error Mitigation) core of Xilinx
▪ Emulation of SEUs by injecting errors into the configuration
memory (7-Series, Virtex-6, Spartan-6)
BELAS School. Torino, June 1st 2016
55
Assessing and implementing the fault tolerance of digital circuits
DMA
Logical Fault Injection on the final implementation:
Microprocessors
• Software-Implemented Fault Injection (SWIFI) techniques
• Modify code execution to emulate fault effects
• Technique also used for addressing other kind of faults (manufacturing,
aging, design,…)
• Fault Injection through internal resources
• Use internal resources which are not intended for operational working
modes (e.g., test and debugging resources)
On-Chip
Debugger
Circuit under test
ARM-OCD
Scan-Chain 2
Host (SW or HW)
CPU
OCD
Scan-Chain 1
Interface
ADDR
µP
DATA
Memory
TAP Controller
TCK
BELAS School. Torino, June 1st 2016
TDI
TMS
TDO
56
Assessing and implementing the fault tolerance of digital circuits
DMA
Simulation-based Fault Injection
• At which level of abstraction?
System
1
RT
Circuit
characterization
HDL
Logic
Transistor
Library / cell
characterization
2
Physical
Accuracy
Performance
BELAS School. Torino, June 1st 2016
57
DMA
Assessing and implementing the fault tolerance of digital circuits
Simulation Fault Injection at transistor /physical level
•
BELAS School. Torino, June 1st 2016
58
DMA
Assessing and implementing the fault tolerance of digital circuits
Simulation Fault Injection at transistor /physical level
• Simulations at electrical level
• The results of 2D/3D physical simulations are the input for simulations
at transistor levels.
• SPICE
• Electrical masking model: degradation of the current pulse through
the logic gates.
▪ Circuit delays can modify the rise and fall time of a pulse
▪ A short pulse can be eventually filtered
• Still time-consuming
BELAS School. Torino, June 1st 2016
59
DMA
Assessing and implementing the fault tolerance of digital circuits
Emulation-based Fault Injection
• Simulation-based Fault Injection is very flexible, but also
rather slow:
• Typical fault injection rate in the order of 1 fault/s
• Fault list collapsing could reduce dramatically test campaigns
• Solution: improve speed by emulating circuit in a FPGA
• Similar idea to FPGA prototyping
• Much faster: circuit can run nearly at-speed
• May support interaction with real environments
• Emulation-based Fault Injection approaches
Design
Injection
FPGA
Comparison
• Partial or total FPGA reconfiguration
• Circuit instrumentation: substitute every FF by a modified FF that
supports fault injection
BELAS School. Torino, June 1st 2016
60
DMA
Assessing and implementing the fault tolerance of digital circuits
Emulation-based Fault Injection: Reconfiguration
• A dedicated board for Fault Injection by partial reconfiguration
FT-UNSHADES 2
Univ. Sevilla- Spain
J.M. Mogollón et al
"FTUnshades2: A Novel Platform for Early Evaluation
of Robustness Against SEEs”
RADECS 2011. Seville, Spain
BELAS School. Torino, June 1st 2016
61
Assessing and implementing the fault tolerance of digital circuits
DMA
Emulation-based Fault Injection: Autonomous
No se puede mostrar la imagen en este momento.
No se puede mostrar la imagen en este momento.
No se puede mostrar la imagen en este momento.
No se puede mostrar la imagen en este momento.
Computer (host)
Emulation Board
FPGA
Stimuli
Generator
Emulation
Control
On-board
RAM:
Fault
injection
Circuit
Under Test
Interface
Fault
Classification
Fault
Dictionary
Fault
Dictionary
104 faults/s
106 faults/s
No se puede mostrar la imagen en este momento.
Lopez-Ongil et al. “Autonomous Fault Emulation: A New FPGA-based
Acceleration System for Hardness Evaluation”
BELAS
Torino,Science.
June 1stFebruary 2007
IEEESchool.
T. on Nuclear
2016
62
Assessing and implementing the fault tolerance of digital circuits
DMA
Autonomous Emulation: Advanced Fault Classification
FAULT CLASSIFICATION
Silent
(no-effect)
Failure
Latent
Compare
outputs
0
FPGA
1
2
Compare
FF Values
3
4
5
Fault-free emulation
Injection
6
7
8
9
10 11 12
Cycle (TB)
13 14 15 16 17 18 19 20
21
Fault emulation
Failure/Silent
detection
BELAS School. Torino, June 1st
2016
Latent
detection
63
DMA
Assessing and implementing the fault tolerance of digital circuits
Turbo Decoder
Weakest areas
No se puede mostrar la imagen en este momento.
No se puede mostrar la imagen en este momento.
Latents
No se puede mostrar la imagen en este momento.
SEU
No se puede mostrar la imagen en este momento.
30 million
fault injection
campaign
Thales Alenia Space
(AMERHIS)
No se puede mostrar la imagen en este momento.
Selective insertion
of fault mitigation
structures
No se puede mostrar la imagen en este momento.
“Analysis of Turbo Decoder Robustness against SEU Effects”
Hardening 9% of FFs,
M. Portela-García et al.
School.2009
Torino, June 1st54%
2016of Failures are eliminated
IEEE Transactions on Nuclear Science. Vol 56.BELAS
No. 4. August
64
DMA
Assessing and implementing the fault tolerance of digital circuits
SET Fault Injection at RTL
• SET Fault Injection is much more difficult
• Access to internal combinational logic is required
• Need to consider ASIC delays (SDF back-annotation)
• The SET space to explore is huge…
• SETs may occur at any gate, at any time and pulse duration
• … but most SETs have no effect!
• Need to inject a large number of SETs to achieve significant sensitivity
results
• Simulation / Emulation?
• Timing simulation is very slow
• Synthesizing the CUT for an FPGA would produce an equivalent
functional circuit model but with different delays
BELAS School. Torino, June 1st 2016
65
Assessing and implementing the fault tolerance of digital circuits
DMA
SET Fault Injection approaches
• Simulation-based (Verilog PLI):
• SOCFIT (IRoC)
• Complemented by statistical de-rating estimation tool (TFIT)
No se puede mostrar la imagen en este momento.
A Practical Approach to Single Event Transients Analysis for Highly Complex Designs
D. Alexandrescu et al.
IEEE Int. Symp. On Defect and Fault Tolerance in VLSI and Nanotechnology Systems
(DFT) 2011
• Hybrid:
Fault Injection & combinational
logic propagation
RTL propagation
across clock cycles
Simulation-based Fault Injection
Emulation-based Fault Injection
• Emulation-based:
• AMUSE tool
BELAS School. Torino, June 1st 2016
66
Assessing and implementing the fault tolerance of digital circuits
DMA
SET Emulation
SOLUTION:
CLK PERIOD QUANTIZATION
• Delay Quantization
• Implement a delay with a shift register (1 LUT)
[García-Valderas,
JETTA, 2008]
• Voltage-Time Quantization
• Quantize transition curves
• Implement delay with Non-Linear Counters
• Covers dynamic delay effects
(accuracy close to an electrical simulator)
BELAS School. Torino, June 1st 2016
Vi
111 0
1
0
VQ
NLC
Vo
[Entrena, TNS,
2009]
67
Assessing and implementing the fault tolerance of digital circuits
DMA
The AMUSE System
(Autonomous MUltilevel emulation system for Soft Error evaluation)
Emulation Platform
FPGA
Emulation Controller
Input
Stimuli
Interface
Emulation
Manager
Fault
Dictionary
(RAM)
AMUSE
Fault
Classification
Fault List
Fault
Injection
RTL CUT
CUT State
GL CUT
Soft Error Sensitivity Evaluation of Microprocessors by Multilevel EmulationBased Fault Injection
L.Entrena et al.
IEEE T. on Computers, vol 61, Issue 3, pp 313-322
No se puede mostrar la imagen en este momento.
Univ. Carlos III - Madrid
BELAS School. Torino, June 1st 2016
68
Assessing and implementing the fault tolerance of digital circuits
DMA
Summary and Conclusions
• Fault Injection provides practical solutions for assessment of
SEU and SET effects
Irradiation
Autonomous
Emulation
Simulation
☹ 10,000 faults
2.38 h
☺ 11.6 min
1.17 faults/s
☺ 185,588 faults/s
•
☹ 0.28 faults/s
•
8.200 €
☹ 0.82 euro/fallo
☺ 129,750,000 faults
☹ 10,000 faults
☹ 10 h
40 €
•
14 €
0.01 euro/fallo
0,001 €
☺ 0.11 e-06 euro/fallo
For 10,000 faults
BELAS School. Torino, June 1st
2016
69
DMA
Assessing and implementing the fault tolerance of digital circuits
Summary and Conclusions
• Fault Injection provides practical solutions for assessment of
SEU and SET effects
Irradiation
☺ Realistic
Autonomous
Emulation
Simulation
•
Fault model and
circuit description
•
Fault model and
circuit description
☹ Final implem.
☺ Design Stage
☺ Design stage
☹ Few faults
☹ Low control.
and observab
☺ Many faults
☺ 100% control. and
observab.
☺ Millions of faults
☺ High control. and
observab.
BELAS School. Torino, June 1st 2016
70
DMA
Assessing and implementing the fault tolerance of digital circuits
What can we do?
Digital VLSI
circuits
Electronics circuits do fail in
field!
• Why does the circuit fail?
•Testing circuits is necessary:
-Is my circuit robust enough?
-Which are the weakest
areas?
• Hardening methodologies:
- Exist different techniques
- Designer: Hardening by
design
BELAS School. Torino, June 1st 2016
71
DMA
Assessing and implementing the fault tolerance of digital circuits
Classic hardening techniques
BELAS School. Torino, June 1st 2016
72
DMA
Assessing and implementing the fault tolerance of digital circuits
Hardware redundancy
Replicate hardware components of the circuit and check the
result of each replica
Types of techniques:
Passive techniques: mask fault effects
Active techniques: detect the occurrence of faults and implement
recovery actions
Hybrid techniques: use a combination of both
Operation
Operation
VOTER
Error
Signal
Operation
COMP
Operation
Result
Operation
BELAS School. Torino, June 1st 2016
73
Assessing and implementing the fault tolerance of digital circuits
DMA
Information redundancy
Use redundant codes
Types of codes
Error detecting codes (EDCs)
Error correcting codes (ECCs)
Error Signal
Data
Encoder
Memory
Data
Decoder
& Checker
BELAS School. Torino, June 1st 2016
Data Out
74
Assessing and implementing the fault tolerance of digital circuits
DMA
Time Redundancy
Repeat computations two or more times and compare the
results
t0
Function
Reg.
t0+t1
Function
Reg.
t0+t2
Function
Reg.
Voter
BELAS School. Torino, June 1st 2016
75
DMA
Assessing and implementing the fault tolerance of digital circuits
Specific hardening techniques
Memories
ASIC
FPGA
Microprocessors
BELAS School. Torino, June 1st 2016
76
DMA
Assessing and implementing the fault tolerance of digital circuits
Memories: characteristics
Very sensitive to SEU / MBU
DRAM: data stored in capacitor
SRAM: data stored in flip-flop
Soft Error Rate
Around 1.000-10.000 FIT/Mbit
Increments with altitude: x5 at 2500m, compared to sea level
Error characteristics
SRAM: 100% single bit errors
DRAM: 94% single bit errors, 6% adjacent bits
BELAS School. Torino, June 1st 2016
77
DMA
Assessing and implementing the fault tolerance of digital circuits
Memory hardening: techniques
Parity
Frequently used in the past, mainly due to manufacturing defects
Error detection halts the computer
ECC memory (Error Checking and Correction, Error Correcting
Code)
Extra bits per 64-bit word to check (2 errors) and correct (1 error)
Implemented adding an extra chip to every memory module
Correcting more than one bit in a word is rarely required
Scrubbing
ECC: data is checked only when it is accessed
Use a background task to inspect memory periodically
Prevents errors to accumulate
BELAS School. Torino, June 1st 2016
78
Assessing and implementing the fault tolerance of digital circuits
DMA
Memory hardening: techniques
Interleaving
Bits of the same word are NOT adjacent, in order to avoid a MBU to
affect several bits in the same word
Bit 0
Bit 1
Bit 2
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
8
9
10
11
12
13
14
15
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
16
17
18
19
20
21
22
23
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
24
25
26
27
28
29
30
31
24
25
26
27
28
29
30
31
Word 0
MBU
BELAS School. Torino, June 1st 2016
Four different words
with a single fault
79
DMA
Assessing and implementing the fault tolerance of digital circuits
Memory hardening: general case
Interleaving
Avoid multiple errors in the same word
ECC:
Correct 1 error, detect 2 errors
Hamming code
Scrubbing
Prevent error accumulation
BELAS School. Torino, June 1st 2016
80
DMA
Assessing and implementing the fault tolerance of digital circuits
Specific hardening techniques
Memories
ASIC
FPGA
Microprocessors
BELAS School. Torino, June 1st 2016
81
DMA
Assessing and implementing the fault tolerance of digital circuits
ASIC: characteristics
Technology: different radiation sensitivity
SEL: immunity
TID: enough resistance for the purpose (environment and duration)
Sequential logic: sensitive to SEU
Flip-flops
Memory
Combinational logic: sensitive to SET
Usually neglected: %SEU >> %SET
Modern technologies more sensitive to SET
BELAS School. Torino, June 1st 2016
82
Assessing and implementing the fault tolerance of digital circuits
DMA
Technology solutions
Modify manufacturing process to reduce sensitivity to radiation
e.g. Silicon-On-Sapphire (SOS) and Silicon-On-Insulator (SOI) technologies
A1
A2
Rad-hard / Rad-tolerant technologies: features
Increased resistance to TID
SRAM cell
Reduced sensitivity to SEL, even immunity
Reduced sensitivity to SEU and SET (unlikely)
Rad-hard / Rad-tolerant technologies: drawbacks
H1
H2
H3
H4
Higher cell area to increase TID tolerance
Higher area to integrate redundancy (TMR flip-flops)
Non general purpose technology: high cost
Dual Interlocked storage Cell (DICE)
BELAS School. Torino, June 1st 2016
83
DMA
Assessing and implementing the fault tolerance of digital circuits
DARE-180 ASIC library
Developed at IMEC (Belgium) within ESA projects
Technology: UMC 180nm (general purpose)
Little sensitivity to SET
Features
TID: 1Mrad
SEL immune
Low SEU sensitivity for “normal” flip-flops and RAM
SEU immunity for HIT flip-flops (DICE based)
Can use standard ASIC design flow
Drawbacks
Circuit size: 2-4 times the equivalent for standard 180nm
Power: 2.2 times the equivalent for standard 180nm
Must use EDAC for RAM
BELAS School. Torino, June 1st 2016
84
Assessing and implementing the fault tolerance of digital circuits
DMA
Future: DARE-90 ASIC library
Technology: UMC 90nm
Few cells available
90 nm technology is VERY sensitive to SET
Need to use SET filtering cells: big area
BELAS School. Torino, June 1st 2016
85
DMA
Assessing and implementing the fault tolerance of digital circuits
Specific hardening techniques
Memories
ASIC
FPGA
Microprocessors
BELAS School. Torino, June 1st 2016
86
DMA
Assessing and implementing the fault tolerance of digital circuits
FPGA: technologies
SRAM based (Xilinx, Altera, Atmel)
Large size, lots of resources
Configuration memory is SRAM: sensitive to SEU
Errors in configuration memory behave as PERMANENT errors
Volatile configuration: need configuration ROM
Flash based (Actel-Microsemi)
Medium-large size
Higher power
Configuration memory is Flash: not sensitive to SEU
Non-volatile configuration
Low radiation
tolerance
Medium
radiation
tolerance
Antifuse based (Actel-Microsemi)
Medium size
One time programmable (not reconfigurable)
No configuration memory
Traditionally used for critical applications
BELAS School. Torino, June 1st 2016
High
radiation
tolerance
87
DMA
Assessing and implementing the fault tolerance of digital circuits
Radiation hardened FPGAs
General
Latchup immune
TID resistant: depends on the environment required
Antifuse
Insensitive configuration memory
SEU insensitive: internal TMR
SET sensitive
Flash
Insensitive configuration memory
SEU and SET sensitive
SRAM
Sensitive configuration memory
SEU and SET sensitive
BELAS School. Torino, June 1st
2016
88
Assessing and implementing the fault tolerance of digital circuits
DMA
Radiation hardened FPGAs: protections
SRAM based FPGAs configuration memory
Scrubbing:
Read configuration memory in background
Partial reconfiguration to correct errors
Xilinx SEM IP:
Virtex-6 and Virtex-7
Scrubbing and fault injection
Errors in configuration memory
behave like PERMANENT faults:
• Logic misbehaviour
• Wrong interconnections
SEU and SET protection
Global TMR (Xilinx XTMR)
Offers additional protection against configuration memory errors
Local TMR (SEU)
Local TMR + pulse filtering (SEU+SET)
Additional tools
RoRA: custom place & root to reduce critical bits in conf. memory
BELAS School. Torino, June 1st 2016
89
DMA
Assessing and implementing the fault tolerance of digital circuits
Specific hardening techniques
Memories
ASIC
FPGA
Microprocessor
BELAS School. Torino, June 1st 2016
90
Assessing and implementing the fault tolerance of digital circuits
DMA
Microprocessor hardening: SWIFT
SWIFT: Software Implemented Fault Tolerance
Used with COTS microprocessors (Commercial Off-The-Shelf):
no other protection mechanism available
Based in time and information redundancy
Active redundancy: faults are detected and actively corrected
SEU
No effect
Op-code
Transient effect
Data
Address
Persistent effect
Performance
(cache flush)
BELAS School. Torino, June 1st 2016
Execution
flow
SEFI
(proc. hang)
91
Assessing and implementing the fault tolerance of digital circuits
DMA
Microprocessor hardening: SWIFT
SWIFT can not handle errors which “hang” the processor
Additional HW required: watchdog
SWIFT require compiler optimizations are disabled
Avoid redundancy removal
Techniques
Data oriented
Control flow oriented
BELAS School. Torino, June 1st 2016
92
Assessing and implementing the fault tolerance of digital circuits
DMA
Microprocessor hardening: hybrid techniques
Usually hardware-software combinations
Politecnico di
Torino
Use Infrastructure IP (I-IP)
I-IP are not functional, they are embedded just for reliability purposes
I-IP may detect, analyze and correct errors on-line
CPU-checker
UC3M
Functions are executed twice
CPUChecker calculates signatures
from execution: opcode, data
Signatures from different
executions are checked by
CPUchecker
Microprocessor
Memory
I-IP
BELAS School. Torino, June 1st 2016
Peripherals
App. Spec.
Logic
93
Assessing and implementing the fault tolerance of digital circuits
DMA
Hardening: other considerations
COTS in harsh environments
Commercial off-the-shelf
Compared to rad-hard components
Low cost, high performance
Non reliable
Different hardening approaches
Distributed hardening: same task is replicated in different devices
BELAS School. Torino, June 1st 2016
94
DMA
Assessing and implementing the fault tolerance of digital circuits
OPTOS: optical nanosat by INTA
Low-cost triple-cube nanosatellite
Distributed OBDH (On-Board Data Handling)
subsystem based on FPGAs and CPLDs
No se puede mostrar la imagen en este momento.
Optical wireless communication system
(OBCom) with a reduced CAN (Controller Area
Network)
No se puede mostrar la imagen en este momento.
No se puede mostrar la imagen en este momento.
BELAS School. Torino, June 1st 2016
95
DMA
Assessing and implementing the fault tolerance of digital circuits
Conclusions
Fault tolerance is not for free!
A large area or performance overhead is usually involved (even higher
than 100% in many cases)
Full fault tolerance is not feasible nor required in general
Only the most critical parts of the design should be hardened
For critical designs, fault tolerance is a design requisite, as
much as area or timing
BELAS School. Torino, June 1st 2016
96
Assessing and implementing the fault tolerance of digital circuits
DMA
Conclusions
Fault tolerant design tasks
Identify the most critical parts of the design and apply the best
technique for each case
Validate the fault tolerant design (check the circuit behaviour in the
presence of errors)
Evaluate the fault tolerance achieved and repeat the process until the
design satisfies dependability specs
It is possible to produce robust digital circuits?
Yes, we can assess SEU/SET sensitivity in early stages of design cycle
Yes, we can use / propose mitigation techniques at different
abstraction levels
Yes, we must assess fault tolerance with a forced irradiation
campaign
BELAS School. Torino, June 1st 2016
97
DMA
Assessing and implementing the fault tolerance of digital circuits
Acknowledments
Spanish National Research Funding Programme:
Project RENASER+
Efectos de radiación en sistemas aeroespaciales,
investigación sobre emulación (ESP2007-65914-C03-01)
Plataforma de Diseño y Análisis Integral de Cicuitos para
Aplicaciones Aeroespaciales. Sistemas distribuidos (TEC201022095-C03-03)
Universidad de Sevilla
Centro Nacional de Aceleradores
Universidad Carlos III de Madrid
BELAS School. Torino, June 1st 2016
98
DMA
Assessing and implementing the fault tolerance of digital circuits
Thank you for your
attention
BELAS School. Torino, June 1st 2016
99