A Backward/Forward recovery approach For the Preconditioned Conjugate Gradient Method

MassimilianoFasi   JulienLangou YvesRobert BoraUçar


Several recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen (2013, pp. 167–176) has shown how to combine such a verification mechanism (a stability test checking the orthogonality of two vectors and recomputing the residual) with checkpointing: the idea is to verify every d iterations, and to checkpoint every c × d iterations. When a silent error is detected by the verification mechanism, one can rollback to and re-execute from the last checkpoint. In this paper, we also propose to combine checkpointing and verification, but we use algorithm-based fault tolerance (ABFT) rather than stability tests. ABFT can be used for error detection, but also for error detection and correction, allowing a forward recovery (and no rollback nor re-execution) when a single error is detected. We introduce an abstract performance model to compute the performance of all schemes, and we instantiate it using the preconditioned conjugate gradient algorithm. Finally, we validate our new approach through a set of simulations.



From Detection To Optimization: Impact Of Soft Errors On High-Performance Computing Applications



Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2017


As high-performance computing (HPC) continues to progress, constraints on HPC system design forces the handling of errors to higher levels in the software stack. Of the types of errors facing HPC, soft errors that silently corrupt system or application state are among the most severe. The behavior of HPC applications in the presence of soft errors is critical to gain insight for effective utilization of HPC systems. The need to understand this behavior can be used in developing algorithm-based error detection guided by application characteristics from fault injection and error propagation studies. Furthermore, the realization that applications are tolerant to small errors allows optimizations such as lossy compression on high-cost data transfers. Lossy compression adds small user controllable amounts of error when compressing data, to reduce data size before expensive data transfers saving time. This dissertation investigates and improves the resiliency of HPC applications to soft errors, and explores lossy compression as a new form of optimization for expensive, time-consuming data transfers.

You can find the thesis here

GemFI: A Fault Injection Tool for Studying the Behavior of Applications on Unreliable Substrates

Konstantinos Parasyris∗ Georgios Tziantzoulis† Christos Antonopoulos‡ Nikolaos Bellas§ ∗‡§Dept. of Electrical and Computer Eng. ∗‡§I.RE.TE.TH. †Computer Science Dept. University Of Thessaly Centre for Research and Technology, Hellas Northwestern University Volos, Greece Volos, Greece Chicago, U.S.A. E-mail: ∗koparasy,‡cda,§nbellas@inf.uth.gr, †georgiostziantzioulis2011@u.nortwestern.edu


Dependable computing on unreliable substrates is the next challenge the computing community needs to overcome due to both manufacturing limitations in low geometries and the necessity to aggressively minimize power consumption. System designers often need to analyze the way hardware faults manifest as errors at the architectural level and how these errors affect application correctness. This paper introduces GemFI, a fault injection tool based on the cycle accurate full system simulator Gem5. GemFI provides fault injection methods and is easily extensible to support future fault models. It also supports multiple processor models and ISAs and allows fault injection in both functional and cycleaccurate simulations. GemFI offers fast-forwarding of simulation campaigns via checkpointing. Moreover, it facilitates the parallel execution of campaign experiments on a network of workstations. In order to validate and evaluate GemFI, we used it to apply fault injection on a series of real-world kernels and applications. The evaluation indicates that its overhead compared with Gem5 is minimal (up to 3.3%), whereas optimizations such as fast-forwarding via checkpointing and execution on NoWs can significantly reduce simulation time of a fault injection campaign. Keywords-fault-injection; simulation; cycle accurate; full system

You can find the paper here

Transient hardware faults simulation in GEM5 – Study of the behavior of multithreaded applications under faults

Konstantinos Parasyris Submitted to the Department of Computer & Communication Engineering in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer & Communication Engineering at

the University Of Thessaly

February 2013

Reliable computing under unreliable circumstances is the next challenge the computing community must overcome. To achieve such a difficult task we need to perform a thorough analysis of the way hardware faults manifest errors to architectural components and how such errors affect the applications behavior. In this direction the first contribution of my diploma thesis is the enhancement of new concepts in an already existed fault injection tool which was created by another thesis and improved by mine. The new framework utilized the Gem5 full cycle accurate simulator in order to enable fault injection. The current tool provides a variety of fault injection methods while it is not limited to models covering radiation or timing induced faults, but also facilitates an easily extensible tool to support future effective fault models. Extensive experimentation showed that our GEM5-based fault injection mechanism was very effective in emulating the behavior of faults in modern high-performance processors running complex workloads. An additional contribution of my thesis is the experimental analysis on two different applications: blackscholes and fluidanimate. We observed that tolerance to injected faults was highly dependent on the spatial location of the faults (e.g. registers, program counter, IF unit, etc.) and on the specific portion of the code affected. To accelerate data gathering and increase simulation speed, we made extensive use of a checkpoint mechanism , called DMTCP (Distributed MultiThreaded CheckPointing), while the whole procedure was automatized to execute on a distributed

You can find the thesis here






Directed by: Professor Israel Koren

Traditional fault tolerant techniques such as hardware or time redundancy incur high overhead and are inefficient for checking arithmetic operations. Our objective is to study an alternative approach of adding new instructions to check arithmetic operations. These checking instructions either rely on error detecting code or calculate approximate results and consequently, consume much less execution time. To evaluate the effectiveness of such an approach we wish to modify several benchmarks to use checking instructions and run simulation experiments to find out their execution time and memory usage. However, the checking instructions are not included in the instruction set and as a result, are not supported by current architecture simulators. Therefore, another objective of this thesis is to develop a method for inserting new instructions in the Gem5 simulator and cross compiler. The insertion process is integrated into a software tool called Gtool. Gtool can add an error checking capability to C programs by using the new instructions.

You can find the thesis here

Statistical Fault Injection-Based AVF Analysis of a GPU Architecture

N. Farazmand, R. Ubal, D. Kaeli Department Electrical and Computer Engineering Northeastern University


—The ever-increasing application of Graphics Processing Units (GPUs) for non-graphics general purpose computing (GPGPU) raises new challenges not found in traditional graphics processing. Reliable computing using an unreliable GPU is one such challenge. In order to guarantee a promising reliability level for GPGPU computing while avoiding significant impact on performance and hardware size, careful analysis of the GPU hardware is inevitable. In this paper, we provide novel insight into the Architectural Vulnerability Factor (AVF) of GPU hardware structures, which are either absent from a CPU architecture or have different design properties than structures present on CPU architectures. Using statistical fault injection to inject faults into register files(REG), local memory(MEM), and active mask stack (AMS), we show that the AMS, a GPU specific structure, is highly vulnerable with 40% AVF-util mandating protection against faults. We also show that the AVF/AVF-util for a GPU register file and local memory are 6%/15% and 1%/3% on average, lower that their typical values in CPU.

You can find the paper here

Understanding Error Propagation in GPGPU Applications

Guanpeng Li, Karthik Pattabiraman, Chen-Yong Cher and Pradip Bose, International Conference for High-Performance Computing, Storage and Networking (SC), 2016.  [PDF | Talk ] (Link to LLFI-GPU)


GPUs have emerged as general-purpose accelerators in high-performance computing (HPC) and scientific applications. However, the reliability characteristics of GPU applications have not been investigated in depth. While error propagation has been extensively investigated for non-GPU applications, GPU applications have a very different programming model which can have a significant effect on error propagation in them. We perform an empirical study to understand and characterize error propagation in GPU applications. We build a compilerbased fault-injection tool for GPU applications to track error propagation, and define metrics to characterize propagation in GPU applications. We find GPU applications exhibit significant error propagation for some kinds of errors, but not others, and the behaviour is highly application specific. We observe the GPUCPU interaction boundary naturally limits error propagation in these applications compared to traditional non-GPU applications. We also formulate various guidelines for the design of faulttolerance mechanisms in GPU applications based on our results. Keywords—Fault Injection, Error Resilience, GPGPU, CUDA, Error Propagation

You can find paper here

Github code here


Modeling Input Dependent Error Propagation in Programs

Guanpeng Li and Karthik Pattabiraman, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2018. [PDF | Talk]


Transient hardware faults are increasing in computer systems due to shrinking feature sizes. Traditional methods to mitigate such faults are through hardware duplication, which incurs huge overhead in performance and energy consumption. Therefore, researchers have explored software solutions such as selective instruction duplication, which require fine-grained analysis of instruction vulnerabilities to Silent Data Corruptions (SDCs). These are typically evaluated via Fault Injection (FI), which is often highly time-consuming. Hence, most studies confine their evaluations to a single input for each program. However, there is often significant variation in the SDC probabilities of both the overall program and individual instructions across inputs, which compromises the correctness of results with a single input.

In this work, we study the variation of SDC probabilities across different inputs of a program, and identify the reasons for the variations. Based on the observations, we propose a model, VTRIDENT, which predicts the variations in programs’ SDC probabilities without any FIs, for a given set of inputs. We find that VTRIDENT is nearly as accurate as FI in identifying the variations in SDC probabilities across inputs. We demonstrate the use of VTRIDENT to bound overall SDC probability of a program under multiple inputs, while performing FI on only a single input.

One Bit is (Not) Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors

Behrooz Sangchoolie*, Karthik Pattabiraman+, Johan Karlsson* (IFIP-2017)


Recent studies have shown that technology and voltage scaling are expected to increase the likelihood that particle-induced soft errors manifest as multiple-bit errors. This raises concerns about the validity of using single bit-flips for assessing the impact of soft errors in fault injection experiments. The goal of this paper is to investigate whether multiple-bit errors could cause a higher percentage of silent data corruptions (SDCs) compared to single-bit errors. Based on 2700 fault injection campaigns with 15 benchmark programs, featuring a total of 27 million experiments, our results show that single-bit errors in most cases yields a higher percentage of SDCs compared to multiple-bit errors. However, in 8% of the campaigns we observed a higher percentage of SDCs for multiple-bit errors. For most of these campaigns, the highest percentage of SDCs was obtained by flipping at most 3 bits. Moreover, we propose three ways of pruning the error space based on the results. Keywords—fault injection; transient hardware faults; single/multiple bit-flip errors; error space pruning;

• Does the muliple bit-flip model result in significantly different error resilience results compared with the single bit-flip model?

you can find the paper here

Talk for the paper here

Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications


Deep learning neural networks (DNNs) have been successful in solving a wide range of machine learning problems. Specialized hardware accelerators have been proposed to accelerate the execution of DNN algorithms for high-performance and energy efficiency. Recently, they have been deployed in datacenters (potentially for business-critical or industrial applications) and safety-critical systems such as self-driving cars. Soft errors caused by high-energy particles have been increasing in hardware systems, and these can lead to catastrophic failures in DNN systems. Traditional methods for building resilient systems, e.g., Triple Modular Redundancy (TMR), are agnostic of the DNN algorithm and the DNN accelerator’s architecture. Hence, these traditional resilience approaches incur high overheads, which makes them challenging to deploy. In this paper, we experimentally evaluate the resilience characteristics of DNN systems (i.e., DNN software running on specialized accelerators). We find that the error resilience of a DNN system depends on the data types, values, data reuses, and types of layers in the design. Based on our observations, we propose two efficient protection techniques for DNN systems.


You can find the paper here