Fault Injection Methodologies - CAD Group
Transcription
Fault Injection Methodologies - CAD Group
DMA Assessing and implementing the fault tolerance of digital circuits Assessing and Implementing the Fault Tolerance of Digital Circuits Celia López Ongil & Marta Portela García Microelectronics Design & Applications Group Electronic Technology Department BELAS School. Torino, June 1st 2016 1 DMA Assessing and implementing the fault tolerance of digital circuits CARLOS III UNIVERSITY OF MADRID Students 19,872 Foreign Students 18% Master 22% PhD 44% 1st year employed 22% Bachelor Degrees 29 programs Engineering: 13 Staff (Teaching&Research) 627 professors & assoc. prof.+ 1,112 PhD in teaching staff 50% bilingual option Foreign Teaching Staff 13% Doubled Degrees 11 Master Degrees 64 programs Doctoral Courses Engineering 29 19 programs BELAS School. Torino, June 1st 2016 2 DMA Assessing and implementing the fault tolerance of digital circuits ELECTRONIC TECHNOLOGY DEPARTMENT Subjects 100 Bachelor Deg. 71 Master Degree in Electronic Systems and Applications 21 subjects Master: 28 2 Tracks: research & innovative technologies Bachelor / Master Theses: 100 Seminars & Companies practices PhD Program in Electrical, Electronics and Industrial Automation Seminars & PhD annual meetings High level research (H2020 projects, International projects…) PhD: ∼10 BELAS School. Torino, June 1st 2016 3 DMA Assessing and implementing the fault tolerance of digital circuits Motivation: A tale Once upon a time… Solar activity periods every 10-12 years Solar storms Very few particles reach to earth surface BELAS School. Torino, June 1st 2016 4 DMA Assessing and implementing the fault tolerance of digital circuits Motivation: A tale At the beginning… And, some time after… But then… … and … BELAS School. Torino, June 1st 2016 5 DMA Assessing and implementing the fault tolerance of digital circuits Motivation: A tale … and … … now… we produce complex technology equipment • Smaller devices • Lower voltages • Higher frequencies for.... 1. Learning more 2. Being more comfortable 3. … • Higher complexity And they are not working properly for ever and ever… BELAS School. Torino, June 1st 2016 6 DMA Assessing and implementing the fault tolerance of digital circuits Motivation: Electronic Circuits Do Fail “Solar storms could disrupt satellite navigation systems causing errors in positioning of tens of meters” July 2012 http://www.bbc.co.uk/news/science-environment-19905878 … • • • • New technologies Lower voltages Higher frequencies Higher complexity Hard Errors Soft Errors SEEs SEUs Fault Tolerance SETs And there are also circuits at sea level that can also fail BELAS School. Torino, June 1st 2016 7 DMA Assessing and implementing the fault tolerance of digital circuits Motivation: Electronic Circuits Do Fail Hardening process DESIGN DESIGN Hardened Mitigation Assessment OK? Yes No How can we assess the Fault Tolerance of a Digital Electronic Device? 1. Low cost 2. Short times 3. Early assessment BELAS School. Torino, June 1st 2016 8 DMA Assessing and implementing the fault tolerance of digital circuits Assessing and Implementing the Fault Tolerance of Digital Circuits Part 0: Radiation Effects in Semiconductor Devices Part 1: Assessing Fault Tolerance of Digital Circuits Part 2: Error Mitigation Techniques against Single Event Effects BELAS School. Torino, June 1st 2016 9 DMA Assessing and implementing the fault tolerance of digital circuits Radiation: a bit of History Radiation: can be defined as the propagation of energy through matter or space. It can be in the form of electromagnetic waves or energetic particles. Ionizing radiation has the ability to knock an electron from an atom, i.e. to ionize. Alpha particles Beta particles Neutrons Gamma rays X-rays Non-ionizing radiation does not have enough energy to ionize atoms in the material it interacts with. Microwaves Visible light Radio waves Source: Coderre, Jeffrey. 22.01 Introduction to Ionizing Radiation. Fall 2006. (Massachusetts Institute of Technology: MIT OpenCourseWare), http://ocw.mit.edu (Accessed 27 March, 2014). License: Creative Commons BY-NC-SA BELAS School. Torino, June 1st 2016 10 DMA Assessing and implementing the fault tolerance of digital circuits Radiation: a bit of History Source: Lawrence Berkeley Laboratories, “MicroWorlds: Electrmagnetic Spectrum.” http://www.lbl.gov/MicroWorlds/ALSTool/EMSpec/EMSpec2.html (Accessed 27 March, 2014) BELAS School. Torino, June 1st 2016 11 DMA Assessing and implementing the fault tolerance of digital circuits Radiation: a bit of History Radiation discovery Röntgen 1895 (X-Rays) Radioactivity Beckquerel 1896 Curie 1897 (U, Th, Po) Rutherford 1899 (α and β rays) Villard 1900 (γ rays) BELAS School. Torino, June 1st 2016 12 DMA Assessing and implementing the fault tolerance of digital circuits Radiation: a bit of History Natural Radiation Victor Hess 1911 Ionizing radiation measured in the atmosphere (electroscope) Above 1km altitude the level of radiation increases considerably Robert Millikan 1925 Cosmic Rays Source: American Physic Society Source: HongKong Observatory.” http://www.hko.gov.hk/education/edu02rga/radiation/radiation_06-e.htm (Accessed 27 March, 2014) BELAS School. Torino, June 1st 2016 13 DMA Assessing and implementing the fault tolerance of digital circuits Radiation: a bit of History In the 1930s and 1940s correlations of global electromagnetic disturbances were made with Solar Activity. Simultaneous effects at both poles. Monthly averaged sunspot cycles Source: Coderre, Jeffrey. 22.01 Introduction to Ionizing Radiation. Fall 2006. (Massachusetts Institute of Technology: MIT OpenCourseWare), http://ocw.mit.edu (Accessed 27 March, 2014). License: Creative Commons BY-NC-SA BELAS School. Torino, June 1st 2016 14 DMA Assessing and implementing the fault tolerance of digital circuits Radiation in the Earth’s issues External sources of Ionizing radiation Galactic Cosmic Rays • Outside the Solar System and even the Milky Way Sun activity (Solar wind and Solar flares) • Periodical “solar storms” Interstellar particles Terrestrial sources of Ionizing radiation https://www.youtube.com/watch?v=HFT7ATLQQx8 https://www.youtube.com/watch?v=nmDZhQAIeXM Alpha particles • From the natural disintegration of radioactive materials Neutrons • From secondary radiation in the atmosphere BELAS School. Torino, June 1st 2016 15 DMA Assessing and implementing the fault tolerance of digital circuits Radiation Sources around the Earth Solar effects on Magnetosphere: sources Solar wind: Modify Magnetosphere Shape Solar flares (& CMEs) Visible Effects Van Allen Belts Source: NASA http://www.bbc.com/news/science-environment-26381685 Source: Lifeng Astronomy Web. Astronomy&Physics Courses. http://lifeng.lamost.org/ BELAS School. Torino, June 1st 2016 16 DMA Assessing and implementing the fault tolerance of digital circuits Radiation Sources around the Earth Solar effects on Magnetosphere: consequences Auroras (Borealis and Australis) 2014.http://www.bbc.com/news/uk-26378027 Source: http://science.nasa.gov/science-news/science-at-nasa/1998/ast29oct98_1/ BELAS School. Torino, June 1st 2016 http://www.swpc.noaa.gov/products/aurora-3-day-forecast 17 DMA Assessing and implementing the fault tolerance of digital circuits Radiation Sources around the Earth Solar effects on Magnetosphere: consequences Van Allen Belts (Trapped particles: protons and electrons ) Protons: 40 MeV@ 3 radii, 8 MeV@ 4 radii Source: NASA: ROSAT Project. https://heasarc.gsfc.nasa.gov/docs/rosat/gallery/display/saa.html Source: Lifeng Astronomy Web. Astronomy&Physics Courses. http://lifeng.lamost.org/ BELAS School. Torino, June 1st 2016 18 DMA Assessing and implementing the fault tolerance of digital circuits Ionizing Particles (Space Radiation Environment) Heavy Ions Protons Electrons Alpha Particles Photons Source: Coderre, Jeffrey. 22.01 Introduction to Ionizing Radiation. Fall 2006. (Massachusetts Institute of Technology: MIT OpenCourseWare), http://ocw.mit.edu (Accessed 27 March, 2014). License: Creative Commons BY-NC-SA BELAS School. Torino, June 1st 2016 19 DMA Assessing and implementing the fault tolerance of digital circuits Ionizing Particles (Space Radiation Environment) Earth orbits? Source: NASA. http://www.nasa.gov/multimedia/imagegallery/image_feature_1283.html BELAS School. Torino, June 1st 2016 20 DMA Assessing and implementing the fault tolerance of digital circuits Ionizing Particles (Natural/Earth Radiation Environment) Neutrons Alpha Particles Protons Electrons Source: J.F. Ziegler et al., “IBM experiments in soft fails in computer electronics (1978-1994)”, IBM J. Res. Dev., vol. 40, no. 1, pp. 3–18, Jan.1996 Source: Coderre, Jeffrey. 22.01 Introduction to Ionizing Radiation. Fall 2006. (MIT OpenCourseWare), http://ocw.mit.edu (Accessed 27 March, 2014). License: Creative Commons BY-NC-SA BELAS School. Torino, June 1st 2016 21 DMA Assessing and implementing the fault tolerance of digital circuits Ionizing particles (Natural/Earth Radiation Environment) Alpha particles are generated by traces of radioactive material in the packaging. They have a very high ability to generate upsets. Industry requirements call for low to ultra low alpha emissions of materials (.01 to .002 alpha/cm2/hr) Chip manufacturers need to verify the emission level of their package and the immunization of their design. Highly Critical in flip chip package IRPS 1978 - May and Woods (Intel) paper: definition of “Soft Errors”, in 2107-series 16-kb DRAMs, caused by alpha particles emitted in the radioactive decay of Uranium and Thorium impurities BELAS School. Torino, June 1st 2016 22 DMA Assessing and implementing the fault tolerance of digital circuits Effects of Ionizing Particles Nuclear Physics and Semiconductor Device Physics Source: D.Alexandrescu. “Radiation_Environments_&_Anomalies”. SERESSA School, Ansan, Korea, 2012. BELAS School. Torino, June 1st 2016 23 Assessing and implementing the fault tolerance of digital circuits DMA Effects of Ionizing Particles Nuclear Physics and Semiconductor Device Physics G SiO2 S n+ G D + n+ + - + + - + + - + + + substrate p + MOS TRANSISTOR D SiO2 +++++++ S n+ n+ channel substrate p TID Charging particle BELAS School. Torino, June 1st 2016 24 DMA Assessing and implementing the fault tolerance of digital circuits Effects of Ionizing Particles n+ I ++ -n+ ++p Transient fault / SET BELAS School. Torino, June 1st 2016 V “Soft” error / SEU 25 DMA Assessing and implementing the fault tolerance of digital circuits Effects of Ionizing Particles Total Ionizing Dose (krad) TID Slow gradual degradation of device’s performance. Changes in Threshold voltages, frequency in crystal oscillators, etc. Prompt Dose Survivability Single Event Effects (SEE): interaction of an energetic particle with a microelectronic device Hard errors (permanent) Single Event Latchup (SEL) Burnout Soft errors (transient) Single Event Upset (SEU) Single Event Transient (SET) Displacement Damage (crystal lattice semi-permanent displacement) DD BELAS School. Torino, June 1st 2016 26 DMA Assessing and implementing the fault tolerance of digital circuits Effects of Ionizing Particles Single Event Functional Interrupts (SEFIs) is a SEU that causes circuit to stop operating. SEFIs occur in a register that controls configuration in, for example, processors, reconfigurable FPGAs or SDRAMs The stored bit upsets, i.e., “1” goes to “0” or “0” goes to “1”. There is no permanent damage to the device. The error can be corrected by rewriting the original information which might involve hard reboot (power cycle) or soft reboot (software restart) SEFIs (functional interrupt) are more serious than SEUs (e.g. data upset) Multiple Bit Upsets (MBUs) consists of multiple SEUs caused by a single ion MBUs occur through charge spreading (diffusion) or through track intersection with more than one storage element. On a 16 Mbit DRAM a single ion produced more than 50 SEUs MBUs are harder to mitigate using error detection and correction To avoid MBUs in memories physically separate bits in same word BELAS School. Torino, June 1st 2016 27 DMA Assessing and implementing the fault tolerance of digital circuits Effects of Ionizing Particles: Real examples Satellites Binder et al. reported four anomalies occurred in satellite electronics during a period of 17 years, mainly due to solar wind (100 MeV iron atoms) causing SEU in Flip-flops, 1975. R. Harboe-Sorensen reported on an ESA satellite disaster, 2005: A satellite was launched from Baikonor Cosmodrome, Kazakhstan on 31 May 2005 Failed during the 5th orbit instead of after a total of 253 orbits (2% use) A latch-up condition in a SRAM was concluded as the possible cause Source: Reno Harboe-Sørensen, ESA-ESTEC, ” Radiation Effects in Spacecraft Electronics“, 5th LHC Radiation Workshop –Nov-29, 2005. BELAS School. Torino, June 1st 2016 28 DMA Assessing and implementing the fault tolerance of digital circuits Effects of Ionizing Particles: Real Examples Commercial Devices Alpha particles Intel, 1978. Memories of 2017-series (DRAM) showed soft errors due to Uranium and Thorium impurities in the package material. May and Woods, 1978. (Old uranium mine contaminating water used for manufacturing process of ceramic packages) This paper started the tradition of research on soft errors in sealevel applications Ziegler (cosmic rays (neutron produced) also affect commercial memory devices) Baumann (Texas Inst.) showed the high influence of neutrons on Boron Boron Phosphosilicate Glass layers removed (BPSG) IBM, 1986. LSI memories manufactured in USA showed soft errors while identical Europe products behaved error-free. 210Po contamination in the cleaning process of nitric acid bottles for manufacturing process. BELAS School. Torino, June 1st 2016 29 DMA Assessing and implementing the fault tolerance of digital circuits Effects of Ionizing Particles: Real Examples Commercial Devices SEU in SRAM 2001. SUN UltraEntreprise Workstations (300,000$) – SEU provoked by neutron effects in SRAM Memory Cache – No mitigation techniques included (EDAC codes) – Tens of millions of dollars extra cost for SUN 2004. CISCO Systems 12,000 Series Routers (200,000$) – SEU provoked by neutron effects in memories and ASICs – Software mitigation techniques (OS) Belgium elections (March 5th, 2003) – Electronic voting machines – 4,096 votes more than registered people! – SEU provoked by neutron effects BELAS School. Torino, June 1st 2016 30 DMA Assessing and implementing the fault tolerance of digital circuits Radiation Effects vs Technology Scaling • Smaller devices • Lower supply voltages DRAMs and SRAMs: Higher system-level Soft Error Rate • Higher frequencies • Higher complexity Saturation of the system-level SER in the future (currently FIT/Mbit is decreasing) Not the same trend with flip-flops, latches and combinational logic BELAS School. Torino, June 1st 2016 31 DMA Assessing and implementing the fault tolerance of digital circuits Radiation Effects vs Technology Scaling SRAM: SER per bit is constant, but device SER is continuously growing MBU and MCU!! Source: N. Seifert et al., “Radiation-induced soft error rates of advanced CMOS bulk devices”, in Proc. Int’l Rel. Phys. Symp. (IRPS), pp. 217-225, 2006 BELAS School. Torino, June 1st 2016 32 DMA Assessing and implementing the fault tolerance of digital circuits Radiation Effects vs Technology Scaling DRAM 3D Storage capacitors are less vulnerable Address and control logic circuitry is becoming more susceptible Multiple error bit detection and correction should be developed and applied Bit error rate for DRAM, neutron flux at New York City (1Gb chip) Source: L. Borocky “Comparison of accelerated DRAM soft error rates measured at component and system level”, in Proc. Int’l Rel. Phys. Symp. (IRPS), pp. 482-487, 2008.) BELAS School. Torino, June 1st 2016 33 DMA Assessing and implementing the fault tolerance of digital circuits Radiation Effects vs Technology Scaling FFs More difficult to harden Low level mitigation techniques SER per bit is decreased beyond 90 nm. Modern CMOS processes, average SER/bit of a flip-flop is comparable to one of SRAM Source: T. Heijmen et al ““A Comprehensive Study on the Soft-Error Rate of Flip-Flops from 90-nm Production Libraries”, IEEE Trans. Dev. Mater. Reliab., vol. 7, no. 1, pp. 84-96, Mar. 2007. BELAS School. Torino, June 1st 2016 34 DMA Assessing and implementing the fault tolerance of digital circuits Radiation Effects vs Technology Scaling Combinational logic Soft Error Transient are produced in combinational gates Only dangerous if reaching a memory element Increasing with operational frequency Three masking effects reduce the probability that an SET is captured Timing Masking Electrical Masking Estimated contribution of Soft Error Rate Logical Masking 11% SET 40% Combinational SEU logic Sequential elements Unprotected SRAM SEU 89% 49% Source: Mitra et al “Robust System Design with Built-In Soft-Error Resilience”, Computer vol. 38, no. 2, pp. 43-52, Feb. 2005. BELAS School. Torino, June 1st 2016 35 DMA Assessing and implementing the fault tolerance of digital circuits Conclusions Effects of ionizing particles on semiconductor devices avoidance Hard errors SEL current measuring Soft errors SEU, SBU, MBU, MCU redundancy at different abstraction levels SETs ¿needed? Yes! “Because of the increasing system sizes and complexities, ongoing researches are necessary to reduce the reliability risk for electronic systems” Source: T. Heijmen et al “Soft Errors from Space to Ground: Historical Overview”, Soft Errors in Modern Electronic Systems. Springer, Chapter 1. 2011 BELAS School. Torino, June 1st 2016 36 DMA Assessing and implementing the fault tolerance of digital circuits What can we do? Digital VLSI circuits Electronics circuits do fail in field! • Why does the circuit fail? •Testing circuits is necessary: -Is my circuit robust enough? -Which are the weakest areas? • Hardening methodologies: - Exist different techniques - Designer: Hardening by design BELAS School. Torino, June 1st 2016 37 DMA Assessing and implementing the fault tolerance of digital circuits Assessing Fault Tolerance: Objectives • To verify that the circuit satisfies the reliability and fault tolerance requirements. • To forecast the circuit behavior under faults. Could a fault provoke a catastrophic or unacceptable failure? • To identify critical parts (the most sensitive to soft errors) of the circuit. • Validate the mitigation approaches BELAS School. Torino, June 1st 2016 38 Assessing and implementing the fault tolerance of digital circuits DMA Assessing Fault Tolerance: Metrics • Soft Error Rate: • Failure rate (Failure In Time, FIT) Fault Error Failure ▪ 1FIT = 1failure / 109hours • Mean Time To Failure (MTTF)/ Mean Time Between Failure (MTBF) • Fault coverage ▪ τ = #Failures/#Faults • Cross section (sensitivity) ▪ σ= (#events/fluence) ; fluence = #particles per area unit BELAS School. Torino, June 1st 2016 39 Assessing and implementing the fault tolerance of digital circuits DMA Soft Error Sensitivity Assessment Approaches • Analytical methods • Obtain a numerical estimation of dependability parameters by means of mathematical models (combinatorial models, stochastic models, etc.) • Mathematical models are complex and difficult to apply to real circuits • Experimental methods • Direct measure: ▪ Observe the circuit in-field to analyze the response to real faults and extract dependability data (e.g. MTBF) • Fault Injection ▪ The deliberate injection of faults into a Circuit Under Test (CUT) BELAS School. Torino, June 1st 2016 40 DMA Assessing and implementing the fault tolerance of digital circuits In-field assessment ASTEP (Altitude SEE Test European Platform); Pic de Bures, French Alps (2,552 m) 7Gbits of SRAM ~300 days ~10 SEU/MCU per month Real-time SER setup involving 512 SRM circuits manufactured in 40nm technology by STMicroelectronics (www.astep.eu) … On-board satellites: Space Testbeds (NASA) BELAS School. Torino, June 1st Environment 2016 “Living wih a Star” 41 Assessing and implementing the fault tolerance of digital circuits DMA Fault Injection • Fault Injection: The deliberate injection of faults into a Circuit Under Test (CUT) Golden Faulty comparison ➢ Failure ➢ Latent ➢ Silent (no effect) Fault injection • • • • • Accelerate Higher number of faults Earlier assessment (also at design stages) High controllability Identification of critical parts Fault coverage = BELAS School. Torino, June 1st 2016 42 Assessing and implementing the fault tolerance of digital circuits DMA Fault Injection System Injection Elements Stimuli Generator (Workload) Golden Faulty Fault Injection Faults Activation: Workload Circuit Results SE Measurement s comparison ➢ Failure ➢ Latent ➢ Silent (no effect) SER BELAS School. Torino, June 1st 2016 43 DMA Assessing and implementing the fault tolerance of digital circuits Classification of Fault Injection methods Target Perturbation source Fault Injection method Manufactured circuit Physical Fault Injection (external source) •Irradiation •Laser •EMI •Pin level Logical Fault Injection •Software-based (SWIFI) (Fault Model + circuit resources) •JTAG / scan chain •On-Chip Debugger (OCD) •Reconfiguration Design Logical Fault Injection (Fault Model + Design & Prototyping tools) BELAS School. Torino, June 1st 2016 •Simulation •Emulation (FPGA) •Hybrid 44 DMA Assessing and implementing the fault tolerance of digital circuits 1.3. Test Test strategy strategy • Soft Error sensitivity depends on: • Technology • Workload • Static test to assess the technology and architectural factors → Physical Fault Injection methods • Dynamic test to take into account the effect of a given workload (only static test is a pessimistic measurement) • σ(SEU) = σ·τ (cross section) • σ is the result of the static test • τ is the result of the dynamic test (non-physical FI method) → reducing the cost of injection campaigns. BELAS School. Torino, June 1st 2016 45 DMA Assessing and implementing the fault tolerance of digital circuits 1.3. Test Test strategy strategy • A soft error rate (FIT) for the final application in the real environment can be estimated by combining the obtained σ(SEU) with the environmental information (orbit conditions) SER(in-field) = σ(SEU) · Expected_fluence • CREME96 → Standard model for cosmic ray environment • SRIM (tool), GEANT4 (libraries)→ Monte Carlo methods to calculate LET • CAD → SPACERAD and SPENVIS BELAS School. Torino, June 1st 2016 46 DMA Assessing and implementing the fault tolerance of digital circuits Physical Fault Injection • Physical Fault Injection: use a physical means to inject faults • Main techniques • Forced irradiation: ▪ protons, neutrons, alpha particles, heavy ions. • Laser Fault Injection • Pin level • EMI-based • Power supply BELAS School. Torino, June 1st 2016 Ground testing for space applications 47 DMA Assessing and implementing the fault tolerance of digital circuits Forced Irradiation • Observe the circuit behavior under the same environmental conditions than the final application • SEEs→ Particle accelerators • TID → Co60 sources, protons, electrons • Main characteristics • Require at least a circuit prototype and cannot be used during the design phase • Limited observability and controllability • Risk of damage of the Device Under Test (DUT): TID, SEL • Required for device qualification in particular application environments • Special facilities: expensive tests BELAS School. Torino, June 1st 2016 48 Assessing and implementing the fault tolerance of digital circuits DMA Forced Irradiation: Facilities in Europe CYCLONE-Belgium HI (>10MeV), protons (65MeV) JYFL- Finland HI (>10MeV), protons (60MeV) GANIL-France HI (100 MeV) CPO-France Protons (200 MeV) IPN-France HI (<10MeV), protons (20MeV) LNL-Italy HI (<10MeV) SIRAD-Italy Protons (28 MeV) PSI-Switzerland HI (300MeV) CNA-Spain Protons (up to 18MeV) BELAS School. Torino, June 1st 2016 49 DMA Assessing and implementing the fault tolerance of digital circuits Forced Irradiation: CNA-Spain BELAS School. Torino, June 1st 2016 50 DMA Assessing and implementing the fault tolerance of digital circuits Forced Irradiation: CNA-Spain BELAS School. Torino, June 1st 2016 51 DMA Assessing and implementing the fault tolerance of digital circuits Logical Fault Injection on the final implementation • The injected fault is a model of the physical effect. It is inserted in the final implementation of the circuit under test. (Similar to manufacturing tests) • Logic resources of the circuit under test are used to perform the fault injection. • Main characteristics: • • • • The accessibility to sensitive locations is limited. Fault injection can be performed in the real environment. Useful during the design stage of a system with COTS elements. Accuracy of results are limited by the accuracy of the fault model. BELAS School. Torino, June 1st 2016 52 Assessing and implementing the fault tolerance of digital circuits DMA Logical Fault Injection on the final implementation: Fault Models • Bit-flip: a temporary change of logic value in a memory element Feature SEU/MBU Effect Single/Multiple Bit-flip Where? Any FF When? Any clock cycle For how long? 1 clock cycle (typically) F. Alcalá et al. “Fault Tolerant VHDL Architectures for Space Applications” Proc.VHDL Users’ Forum. Toledo, Spain. 1997 • Others: • Permanent faults: stuck-at, bridge, open, etc. • A variety of models can be used to represent the effect of faults BELAS School. Torino, June 1st 2016 53 Assessing and implementing the fault tolerance of digital circuits DMA Logical Fault Injection on the final implementation FPGAs Microprocessors Reconfiguration resources And interconnections 1 0 0 1 LUT 0 0 0 1 SEU 0 1 0 1 0 0 1 1 0 0 0 0 i1 s Debugging resources i2 i x 1 i2 Software s Stuck-at 1 i1 s x i2 Stuck-at 1 i1 i 2 x s Stuck-at 0 BELAS School. Torino, June 1st 2016 54 DMA Assessing and implementing the fault tolerance of digital circuits Logical Fault Injection on the final implementation : FPGAs • Actually most faults in FPGAs appear in configuration memory • Bitstreams are quite large and only a fraction of bits is generally used • Tools: • FLIPPER (Xilinx) M. Alderighi et al. “Evaluation of Single Event Upset Mitigation Schemes for SRAM based FPGAs using the FLIPPER Fault Injection Platform“ 22nd IEEE Int. Symp. on Defect and Fault Tolerance in VLSI Systems. 2007. pp 105-113 • SEM (Soft Error Mitigation) core of Xilinx ▪ Emulation of SEUs by injecting errors into the configuration memory (7-Series, Virtex-6, Spartan-6) BELAS School. Torino, June 1st 2016 55 Assessing and implementing the fault tolerance of digital circuits DMA Logical Fault Injection on the final implementation: Microprocessors • Software-Implemented Fault Injection (SWIFI) techniques • Modify code execution to emulate fault effects • Technique also used for addressing other kind of faults (manufacturing, aging, design,…) • Fault Injection through internal resources • Use internal resources which are not intended for operational working modes (e.g., test and debugging resources) On-Chip Debugger Circuit under test ARM-OCD Scan-Chain 2 Host (SW or HW) CPU OCD Scan-Chain 1 Interface ADDR µP DATA Memory TAP Controller TCK BELAS School. Torino, June 1st 2016 TDI TMS TDO 56 Assessing and implementing the fault tolerance of digital circuits DMA Simulation-based Fault Injection • At which level of abstraction? System 1 RT Circuit characterization HDL Logic Transistor Library / cell characterization 2 Physical Accuracy Performance BELAS School. Torino, June 1st 2016 57 DMA Assessing and implementing the fault tolerance of digital circuits Simulation Fault Injection at transistor /physical level • BELAS School. Torino, June 1st 2016 58 DMA Assessing and implementing the fault tolerance of digital circuits Simulation Fault Injection at transistor /physical level • Simulations at electrical level • The results of 2D/3D physical simulations are the input for simulations at transistor levels. • SPICE • Electrical masking model: degradation of the current pulse through the logic gates. ▪ Circuit delays can modify the rise and fall time of a pulse ▪ A short pulse can be eventually filtered • Still time-consuming BELAS School. Torino, June 1st 2016 59 DMA Assessing and implementing the fault tolerance of digital circuits Emulation-based Fault Injection • Simulation-based Fault Injection is very flexible, but also rather slow: • Typical fault injection rate in the order of 1 fault/s • Fault list collapsing could reduce dramatically test campaigns • Solution: improve speed by emulating circuit in a FPGA • Similar idea to FPGA prototyping • Much faster: circuit can run nearly at-speed • May support interaction with real environments • Emulation-based Fault Injection approaches Design Injection FPGA Comparison • Partial or total FPGA reconfiguration • Circuit instrumentation: substitute every FF by a modified FF that supports fault injection BELAS School. Torino, June 1st 2016 60 DMA Assessing and implementing the fault tolerance of digital circuits Emulation-based Fault Injection: Reconfiguration • A dedicated board for Fault Injection by partial reconfiguration FT-UNSHADES 2 Univ. Sevilla- Spain J.M. Mogollón et al "FTUnshades2: A Novel Platform for Early Evaluation of Robustness Against SEEs” RADECS 2011. Seville, Spain BELAS School. Torino, June 1st 2016 61 Assessing and implementing the fault tolerance of digital circuits DMA Emulation-based Fault Injection: Autonomous No se puede mostrar la imagen en este momento. No se puede mostrar la imagen en este momento. No se puede mostrar la imagen en este momento. No se puede mostrar la imagen en este momento. Computer (host) Emulation Board FPGA Stimuli Generator Emulation Control On-board RAM: Fault injection Circuit Under Test Interface Fault Classification Fault Dictionary Fault Dictionary 104 faults/s 106 faults/s No se puede mostrar la imagen en este momento. Lopez-Ongil et al. “Autonomous Fault Emulation: A New FPGA-based Acceleration System for Hardness Evaluation” BELAS Torino,Science. June 1stFebruary 2007 IEEESchool. T. on Nuclear 2016 62 Assessing and implementing the fault tolerance of digital circuits DMA Autonomous Emulation: Advanced Fault Classification FAULT CLASSIFICATION Silent (no-effect) Failure Latent Compare outputs 0 FPGA 1 2 Compare FF Values 3 4 5 Fault-free emulation Injection 6 7 8 9 10 11 12 Cycle (TB) 13 14 15 16 17 18 19 20 21 Fault emulation Failure/Silent detection BELAS School. Torino, June 1st 2016 Latent detection 63 DMA Assessing and implementing the fault tolerance of digital circuits Turbo Decoder Weakest areas No se puede mostrar la imagen en este momento. No se puede mostrar la imagen en este momento. Latents No se puede mostrar la imagen en este momento. SEU No se puede mostrar la imagen en este momento. 30 million fault injection campaign Thales Alenia Space (AMERHIS) No se puede mostrar la imagen en este momento. Selective insertion of fault mitigation structures No se puede mostrar la imagen en este momento. “Analysis of Turbo Decoder Robustness against SEU Effects” Hardening 9% of FFs, M. Portela-García et al. School.2009 Torino, June 1st54% 2016of Failures are eliminated IEEE Transactions on Nuclear Science. Vol 56.BELAS No. 4. August 64 DMA Assessing and implementing the fault tolerance of digital circuits SET Fault Injection at RTL • SET Fault Injection is much more difficult • Access to internal combinational logic is required • Need to consider ASIC delays (SDF back-annotation) • The SET space to explore is huge… • SETs may occur at any gate, at any time and pulse duration • … but most SETs have no effect! • Need to inject a large number of SETs to achieve significant sensitivity results • Simulation / Emulation? • Timing simulation is very slow • Synthesizing the CUT for an FPGA would produce an equivalent functional circuit model but with different delays BELAS School. Torino, June 1st 2016 65 Assessing and implementing the fault tolerance of digital circuits DMA SET Fault Injection approaches • Simulation-based (Verilog PLI): • SOCFIT (IRoC) • Complemented by statistical de-rating estimation tool (TFIT) No se puede mostrar la imagen en este momento. A Practical Approach to Single Event Transients Analysis for Highly Complex Designs D. Alexandrescu et al. IEEE Int. Symp. On Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT) 2011 • Hybrid: Fault Injection & combinational logic propagation RTL propagation across clock cycles Simulation-based Fault Injection Emulation-based Fault Injection • Emulation-based: • AMUSE tool BELAS School. Torino, June 1st 2016 66 Assessing and implementing the fault tolerance of digital circuits DMA SET Emulation SOLUTION: CLK PERIOD QUANTIZATION • Delay Quantization • Implement a delay with a shift register (1 LUT) [García-Valderas, JETTA, 2008] • Voltage-Time Quantization • Quantize transition curves • Implement delay with Non-Linear Counters • Covers dynamic delay effects (accuracy close to an electrical simulator) BELAS School. Torino, June 1st 2016 Vi 111 0 1 0 VQ NLC Vo [Entrena, TNS, 2009] 67 Assessing and implementing the fault tolerance of digital circuits DMA The AMUSE System (Autonomous MUltilevel emulation system for Soft Error evaluation) Emulation Platform FPGA Emulation Controller Input Stimuli Interface Emulation Manager Fault Dictionary (RAM) AMUSE Fault Classification Fault List Fault Injection RTL CUT CUT State GL CUT Soft Error Sensitivity Evaluation of Microprocessors by Multilevel EmulationBased Fault Injection L.Entrena et al. IEEE T. on Computers, vol 61, Issue 3, pp 313-322 No se puede mostrar la imagen en este momento. Univ. Carlos III - Madrid BELAS School. Torino, June 1st 2016 68 Assessing and implementing the fault tolerance of digital circuits DMA Summary and Conclusions • Fault Injection provides practical solutions for assessment of SEU and SET effects Irradiation Autonomous Emulation Simulation ☹ 10,000 faults 2.38 h ☺ 11.6 min 1.17 faults/s ☺ 185,588 faults/s • ☹ 0.28 faults/s • 8.200 € ☹ 0.82 euro/fallo ☺ 129,750,000 faults ☹ 10,000 faults ☹ 10 h 40 € • 14 € 0.01 euro/fallo 0,001 € ☺ 0.11 e-06 euro/fallo For 10,000 faults BELAS School. Torino, June 1st 2016 69 DMA Assessing and implementing the fault tolerance of digital circuits Summary and Conclusions • Fault Injection provides practical solutions for assessment of SEU and SET effects Irradiation ☺ Realistic Autonomous Emulation Simulation • Fault model and circuit description • Fault model and circuit description ☹ Final implem. ☺ Design Stage ☺ Design stage ☹ Few faults ☹ Low control. and observab ☺ Many faults ☺ 100% control. and observab. ☺ Millions of faults ☺ High control. and observab. BELAS School. Torino, June 1st 2016 70 DMA Assessing and implementing the fault tolerance of digital circuits What can we do? Digital VLSI circuits Electronics circuits do fail in field! • Why does the circuit fail? •Testing circuits is necessary: -Is my circuit robust enough? -Which are the weakest areas? • Hardening methodologies: - Exist different techniques - Designer: Hardening by design BELAS School. Torino, June 1st 2016 71 DMA Assessing and implementing the fault tolerance of digital circuits Classic hardening techniques BELAS School. Torino, June 1st 2016 72 DMA Assessing and implementing the fault tolerance of digital circuits Hardware redundancy Replicate hardware components of the circuit and check the result of each replica Types of techniques: Passive techniques: mask fault effects Active techniques: detect the occurrence of faults and implement recovery actions Hybrid techniques: use a combination of both Operation Operation VOTER Error Signal Operation COMP Operation Result Operation BELAS School. Torino, June 1st 2016 73 Assessing and implementing the fault tolerance of digital circuits DMA Information redundancy Use redundant codes Types of codes Error detecting codes (EDCs) Error correcting codes (ECCs) Error Signal Data Encoder Memory Data Decoder & Checker BELAS School. Torino, June 1st 2016 Data Out 74 Assessing and implementing the fault tolerance of digital circuits DMA Time Redundancy Repeat computations two or more times and compare the results t0 Function Reg. t0+t1 Function Reg. t0+t2 Function Reg. Voter BELAS School. Torino, June 1st 2016 75 DMA Assessing and implementing the fault tolerance of digital circuits Specific hardening techniques Memories ASIC FPGA Microprocessors BELAS School. Torino, June 1st 2016 76 DMA Assessing and implementing the fault tolerance of digital circuits Memories: characteristics Very sensitive to SEU / MBU DRAM: data stored in capacitor SRAM: data stored in flip-flop Soft Error Rate Around 1.000-10.000 FIT/Mbit Increments with altitude: x5 at 2500m, compared to sea level Error characteristics SRAM: 100% single bit errors DRAM: 94% single bit errors, 6% adjacent bits BELAS School. Torino, June 1st 2016 77 DMA Assessing and implementing the fault tolerance of digital circuits Memory hardening: techniques Parity Frequently used in the past, mainly due to manufacturing defects Error detection halts the computer ECC memory (Error Checking and Correction, Error Correcting Code) Extra bits per 64-bit word to check (2 errors) and correct (1 error) Implemented adding an extra chip to every memory module Correcting more than one bit in a word is rarely required Scrubbing ECC: data is checked only when it is accessed Use a background task to inspect memory periodically Prevents errors to accumulate BELAS School. Torino, June 1st 2016 78 Assessing and implementing the fault tolerance of digital circuits DMA Memory hardening: techniques Interleaving Bits of the same word are NOT adjacent, in order to avoid a MBU to affect several bits in the same word Bit 0 Bit 1 Bit 2 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 24 25 26 27 28 29 30 31 24 25 26 27 28 29 30 31 Word 0 MBU BELAS School. Torino, June 1st 2016 Four different words with a single fault 79 DMA Assessing and implementing the fault tolerance of digital circuits Memory hardening: general case Interleaving Avoid multiple errors in the same word ECC: Correct 1 error, detect 2 errors Hamming code Scrubbing Prevent error accumulation BELAS School. Torino, June 1st 2016 80 DMA Assessing and implementing the fault tolerance of digital circuits Specific hardening techniques Memories ASIC FPGA Microprocessors BELAS School. Torino, June 1st 2016 81 DMA Assessing and implementing the fault tolerance of digital circuits ASIC: characteristics Technology: different radiation sensitivity SEL: immunity TID: enough resistance for the purpose (environment and duration) Sequential logic: sensitive to SEU Flip-flops Memory Combinational logic: sensitive to SET Usually neglected: %SEU >> %SET Modern technologies more sensitive to SET BELAS School. Torino, June 1st 2016 82 Assessing and implementing the fault tolerance of digital circuits DMA Technology solutions Modify manufacturing process to reduce sensitivity to radiation e.g. Silicon-On-Sapphire (SOS) and Silicon-On-Insulator (SOI) technologies A1 A2 Rad-hard / Rad-tolerant technologies: features Increased resistance to TID SRAM cell Reduced sensitivity to SEL, even immunity Reduced sensitivity to SEU and SET (unlikely) Rad-hard / Rad-tolerant technologies: drawbacks H1 H2 H3 H4 Higher cell area to increase TID tolerance Higher area to integrate redundancy (TMR flip-flops) Non general purpose technology: high cost Dual Interlocked storage Cell (DICE) BELAS School. Torino, June 1st 2016 83 DMA Assessing and implementing the fault tolerance of digital circuits DARE-180 ASIC library Developed at IMEC (Belgium) within ESA projects Technology: UMC 180nm (general purpose) Little sensitivity to SET Features TID: 1Mrad SEL immune Low SEU sensitivity for “normal” flip-flops and RAM SEU immunity for HIT flip-flops (DICE based) Can use standard ASIC design flow Drawbacks Circuit size: 2-4 times the equivalent for standard 180nm Power: 2.2 times the equivalent for standard 180nm Must use EDAC for RAM BELAS School. Torino, June 1st 2016 84 Assessing and implementing the fault tolerance of digital circuits DMA Future: DARE-90 ASIC library Technology: UMC 90nm Few cells available 90 nm technology is VERY sensitive to SET Need to use SET filtering cells: big area BELAS School. Torino, June 1st 2016 85 DMA Assessing and implementing the fault tolerance of digital circuits Specific hardening techniques Memories ASIC FPGA Microprocessors BELAS School. Torino, June 1st 2016 86 DMA Assessing and implementing the fault tolerance of digital circuits FPGA: technologies SRAM based (Xilinx, Altera, Atmel) Large size, lots of resources Configuration memory is SRAM: sensitive to SEU Errors in configuration memory behave as PERMANENT errors Volatile configuration: need configuration ROM Flash based (Actel-Microsemi) Medium-large size Higher power Configuration memory is Flash: not sensitive to SEU Non-volatile configuration Low radiation tolerance Medium radiation tolerance Antifuse based (Actel-Microsemi) Medium size One time programmable (not reconfigurable) No configuration memory Traditionally used for critical applications BELAS School. Torino, June 1st 2016 High radiation tolerance 87 DMA Assessing and implementing the fault tolerance of digital circuits Radiation hardened FPGAs General Latchup immune TID resistant: depends on the environment required Antifuse Insensitive configuration memory SEU insensitive: internal TMR SET sensitive Flash Insensitive configuration memory SEU and SET sensitive SRAM Sensitive configuration memory SEU and SET sensitive BELAS School. Torino, June 1st 2016 88 Assessing and implementing the fault tolerance of digital circuits DMA Radiation hardened FPGAs: protections SRAM based FPGAs configuration memory Scrubbing: Read configuration memory in background Partial reconfiguration to correct errors Xilinx SEM IP: Virtex-6 and Virtex-7 Scrubbing and fault injection Errors in configuration memory behave like PERMANENT faults: • Logic misbehaviour • Wrong interconnections SEU and SET protection Global TMR (Xilinx XTMR) Offers additional protection against configuration memory errors Local TMR (SEU) Local TMR + pulse filtering (SEU+SET) Additional tools RoRA: custom place & root to reduce critical bits in conf. memory BELAS School. Torino, June 1st 2016 89 DMA Assessing and implementing the fault tolerance of digital circuits Specific hardening techniques Memories ASIC FPGA Microprocessor BELAS School. Torino, June 1st 2016 90 Assessing and implementing the fault tolerance of digital circuits DMA Microprocessor hardening: SWIFT SWIFT: Software Implemented Fault Tolerance Used with COTS microprocessors (Commercial Off-The-Shelf): no other protection mechanism available Based in time and information redundancy Active redundancy: faults are detected and actively corrected SEU No effect Op-code Transient effect Data Address Persistent effect Performance (cache flush) BELAS School. Torino, June 1st 2016 Execution flow SEFI (proc. hang) 91 Assessing and implementing the fault tolerance of digital circuits DMA Microprocessor hardening: SWIFT SWIFT can not handle errors which “hang” the processor Additional HW required: watchdog SWIFT require compiler optimizations are disabled Avoid redundancy removal Techniques Data oriented Control flow oriented BELAS School. Torino, June 1st 2016 92 Assessing and implementing the fault tolerance of digital circuits DMA Microprocessor hardening: hybrid techniques Usually hardware-software combinations Politecnico di Torino Use Infrastructure IP (I-IP) I-IP are not functional, they are embedded just for reliability purposes I-IP may detect, analyze and correct errors on-line CPU-checker UC3M Functions are executed twice CPUChecker calculates signatures from execution: opcode, data Signatures from different executions are checked by CPUchecker Microprocessor Memory I-IP BELAS School. Torino, June 1st 2016 Peripherals App. Spec. Logic 93 Assessing and implementing the fault tolerance of digital circuits DMA Hardening: other considerations COTS in harsh environments Commercial off-the-shelf Compared to rad-hard components Low cost, high performance Non reliable Different hardening approaches Distributed hardening: same task is replicated in different devices BELAS School. Torino, June 1st 2016 94 DMA Assessing and implementing the fault tolerance of digital circuits OPTOS: optical nanosat by INTA Low-cost triple-cube nanosatellite Distributed OBDH (On-Board Data Handling) subsystem based on FPGAs and CPLDs No se puede mostrar la imagen en este momento. Optical wireless communication system (OBCom) with a reduced CAN (Controller Area Network) No se puede mostrar la imagen en este momento. No se puede mostrar la imagen en este momento. BELAS School. Torino, June 1st 2016 95 DMA Assessing and implementing the fault tolerance of digital circuits Conclusions Fault tolerance is not for free! A large area or performance overhead is usually involved (even higher than 100% in many cases) Full fault tolerance is not feasible nor required in general Only the most critical parts of the design should be hardened For critical designs, fault tolerance is a design requisite, as much as area or timing BELAS School. Torino, June 1st 2016 96 Assessing and implementing the fault tolerance of digital circuits DMA Conclusions Fault tolerant design tasks Identify the most critical parts of the design and apply the best technique for each case Validate the fault tolerant design (check the circuit behaviour in the presence of errors) Evaluate the fault tolerance achieved and repeat the process until the design satisfies dependability specs It is possible to produce robust digital circuits? Yes, we can assess SEU/SET sensitivity in early stages of design cycle Yes, we can use / propose mitigation techniques at different abstraction levels Yes, we must assess fault tolerance with a forced irradiation campaign BELAS School. Torino, June 1st 2016 97 DMA Assessing and implementing the fault tolerance of digital circuits Acknowledments Spanish National Research Funding Programme: Project RENASER+ Efectos de radiación en sistemas aeroespaciales, investigación sobre emulación (ESP2007-65914-C03-01) Plataforma de Diseño y Análisis Integral de Cicuitos para Aplicaciones Aeroespaciales. Sistemas distribuidos (TEC201022095-C03-03) Universidad de Sevilla Centro Nacional de Aceleradores Universidad Carlos III de Madrid BELAS School. Torino, June 1st 2016 98 DMA Assessing and implementing the fault tolerance of digital circuits Thank you for your attention BELAS School. Torino, June 1st 2016 99