Papers

I have shared some paper that I am interested and those are related to my research in some way. I am also updating this page with my own paper .

A Novel Approach For Handling Soft Error in Conjugate Gradients

Muhammed Emin Ozturk, Marissa Renardy, Yukun Li, Gagan Agrawal, Ching-Shan Chou , 25th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC) , 2018

Abstract—

Soft errors or bit flips have recently become an important challenge in high performance computing. In this paper, we focus on soft errors in a particular algorithm: conjugate gradients (CG). We present a series of techniques to detect soft errors in CG. We first derive a mathematical quantity that is monotonically decreasing. Next, we add a set of heuristics and combine our approach with previously established methods. We have extensively evaluated our method considering three distinct dimensions. First, we show that the F-score of our detection is significantly better than two other methods. Second, we show that for soft errors that are not detected by our method, the resulting inaccuracy in the final results are small, and better than those with other methods. Finally,
we show that the runtime overheads of our method are lower than for other methods.

Understanding Error Propagation in GPGPU Applications

Guanpeng Li, Karthik Pattabiraman, Chen-Yong Cher and Pradip Bose, International Conference for High-Performance Computing, Storage and Networking (SC), 2016.  [PDF | Talk ] (Link to LLFI-GPU)

Abstract—

GPUs have emerged as general-purpose accelerators in high-performance computing (HPC) and scientific applications. However, the reliability characteristics of GPU applications have not been investigated in depth. While error propagation has been extensively investigated for non-GPU applications, GPU applications have a very different programming model which can have a significant effect on error propagation in them. We perform an empirical study to understand and characterize error propagation in GPU applications. We build a compilerbased fault-injection tool for GPU applications to track error propagation, and define metrics to characterize propagation in GPU applications. We find GPU applications exhibit significant error propagation for some kinds of errors, but not others, and the behaviour is highly application specific. We observe the GPUCPU interaction boundary naturally limits error propagation in these applications compared to traditional non-GPU applications. We also formulate various guidelines for the design of faulttolerance mechanisms in GPU applications based on our results. Keywords—Fault Injection, Error Resilience, GPGPU, CUDA, Error Propagation

You can find paper here

Github code here

 

Improving Fault Tolerance for Extreme Scale Systems

Improving_Fault_Tolerance_for_-12sjwbw

Mean Time Between Failures (MTBF), now calculated in days or hours, is expected to drop to minutes on exascale machines. In this thesis, a new approach for failure prediction based on the Void Search (VS) algorithm is presented . VS is used primarily in astrophysics for finding areas of space that have a very low density of galaxies. We explore its potential for failure prediction using environmental information and compare it to well known prediction methods. Another important issue for the HPC community is that next-generation supercomputers are expected to have more components and consume several times less energy per operation. Hence, supercomputer designers are pushing the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect soft errors, a percentage of those errors pass unnoticed by the hardware. Techniques that leverage certain properties of iterative HPC applications (such as the smoothness of the evolution of a particular dataset) can be used to detect silent errors at the application level. Results show that it is possible to detect a large number of corruptions (i.e., above 90% in some cases) with less than 100% overhead using these techniques. Nevertheless, these data-analytic solutions are still far from fully protecting applications to a level comparable with more expensive solutions such as full replication. In this thesis, partial replication is explored to overcome this limitation. More specifically, it has been observed that not all processes of an MPI application experience the same level of data variability at exactly the same time. Thus, one can smartly choose and replicate only those processes for which the lightweight dataanalytic detectors would perform poorly. Results indicate that this new approach can protect the MPI applications analyzed with 7–70% less overhead (depending on the application) than that of full duplication with similar detection recall.

 

EDUARDO BERROCAL

 

Implementing Fast, Virtualized Profiling to Eliminate Cache Warming

Abstract Simulation is an important part of the evaluation of next-generation computing systems. Detailed, cycle-level simulation, however, can be very slow when evaluating realistic workloads on modern microarchitectures. Sampled simulation (e.g., SMARTS and SimPoint) improves simulation performance by an order of magnitude or more through the reduction of large workloads into a small but representative sample. Additionally, the execution state just prior to a simulation sample can be stored into checkpoints, allowing for fast restoration and evaluation. Unfortunately, changes in software, architecture or fundamental pieces of the microarchitecture (e.g., hardware-software co-design) require checkpoint regeneration. The end result for co-design degenerates to creating checkpoints for each modification, a task checkpointing was designed to eliminate. Therefore, a solution is needed that allows for fast and accurate simulation, without the need for checkpoints. Virtualized fast-forwarding proposals, like FSA, are an alternative to checkpoints that speed up sampled simulation by advancing the execution at near-native speed between simulation points. They rely, however, on functional simulation to warm the architectural state prior to each simulation point, a costly operation for moderately-sized last-level caches (e.g., above 8MB). Simulating future systems with DRAM caches of many GBs can require warming of billions of instructions, dominating the time for simulation and negating the benefit of virtualized fast-forwarding. This paper proposes CoolSim, an efficient simulation framework that eliminates cache warming. CoolSim advances between simulation points using virtualized fast-forwarding, while collecting sparse memory reuse information (MRI). The MRI is collected more than an order of magnitude faster than functional warming. At the simulation point, detailed simulation is used to evaluate the design while a statistical cache model uses the previously acquired MRI to estimate whether each memory request hits in the cache. The MRI is an architecturally independent metric and therefore a single profile can be used in simulations of any size cache. We describe a prototype implementation of CoolSim based on KVM and gem5 running 19x faster than the state-of-the-art sampled simulation, while it estimates the CPI of the SPEC CPU2006 benchmarks with 3.62% error on average, across a wide range of cache sizes.

Paper

paper-1rom7me