The Ohio State University: College of Engineering

GemFI: A Fault Injection Tool for Studying the Behavior of Applications on Unreliable Substrates

Konstantinos Parasyris∗ Georgios Tziantzoulis† Christos Antonopoulos‡ Nikolaos Bellas§ ∗‡§Dept. of Electrical and Computer Eng. ∗‡§I.RE.TE.TH. †Computer Science Dept. University Of Thessaly Centre for Research and Technology, Hellas Northwestern University Volos, Greece Volos, Greece Chicago, U.S.A. E-mail: ∗koparasy,‡cda,§nbellas@inf.uth.gr, †georgiostziantzioulis2011@u.nortwestern.edu Abstract— Dependable…

Transient hardware faults simulation in GEM5 – Study of the behavior of multithreaded applications under faults

Konstantinos Parasyris Submitted to the Department of Computer & Communication Engineering in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer & Communication Engineering at the University Of Thessaly February 2013 Reliable computing under unreliable…

MODIFYING INSTRUCTION SETS IN THE GEM5 SIMULATOR TO SUPPORT FAULT TOLERANT DESIGNS

SEPTEMBER 2015 CHUAN ZHANG B.S., BEIJING INSTITUTE OF TECHNOLOGY M.S.M.E., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor Israel Koren Traditional fault tolerant techniques such as hardware or time redundancy incur high overhead and are inefficient for checking arithmetic operations. Our…

Statistical Fault Injection-Based AVF Analysis of a GPU Architecture

N. Farazmand, R. Ubal, D. Kaeli Department Electrical and Computer Engineering Northeastern University Abstract —The ever-increasing application of Graphics Processing Units (GPUs) for non-graphics general purpose computing (GPGPU) raises new challenges not found in traditional graphics processing. Reliable computing using…

Understanding Error Propagation in GPGPU Applications

Guanpeng Li, Karthik Pattabiraman, Chen-Yong Cher and Pradip Bose, International Conference for High-Performance Computing, Storage and Networking (SC), 2016.  [PDF | Talk ] (Link to LLFI-GPU) Abstract— GPUs have emerged as general-purpose accelerators in high-performance computing (HPC) and scientific applications. However, the reliability characteristics…

Modeling Input Dependent Error Propagation in Programs

Guanpeng Li and Karthik Pattabiraman, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2018. [PDF | Talk] Abstract: Transient hardware faults are increasing in computer systems due to shrinking feature sizes. Traditional methods to mitigate such faults are through hardware duplication, which…

One Bit is (Not) Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors

Behrooz Sangchoolie*, Karthik Pattabiraman+, Johan Karlsson* (IFIP-2017) Abstract— Recent studies have shown that technology and voltage scaling are expected to increase the likelihood that particle-induced soft errors manifest as multiple-bit errors. This raises concerns about the validity of using single…

Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications

ABSTRACT Deep learning neural networks (DNNs) have been successful in solving a wide range of machine learning problems. Specialized hardware accelerators have been proposed to accelerate the execution of DNN algorithms for high-performance and energy efficiency. Recently, they have been…

Modeling Soft-Error Propagation in Programs

Abstract—As technology scales to lower feature sizes, devices become more susceptible to soft errors. Soft errors can lead to silent data corruptions (SDCs), seriously compromising the reliability of a system. Traditional hardware-only techniques to avoid SDCs are energy hungry, and…