– Distributed BERT Pre-Training & Fine-Tuning With Intel Optimized Tensorflow on Intel Xeon Scalable Processors
As a Deep Learning Engineer Intern, I was working on the BERT model for the Intel AI TensorFlow team under the supervision of Wei Wang. Our work has been presented and published at SC20 as Research Poster.
Distributed computing has become a key component in the field of Data Science, allowing for faster prototyping and accelerated time to market of numerous workloads. This work examines the distributed training performance of BERT, a state of the art language model for Neural Language Pro- cessing (NLP), in the tasks of pre-training and fine-tuning on general-purpose Intel CPUs. The effects using Intel-optimized TensorFlow optimizations on Intel Architectures with both FP32 and BFLOAT16 floating-point format are included in the analysis. Results show that the distributed TensorFlow BERT model with LAMB optimizer can maintain high accuracy while getting good performance speedups from scaling to a larger amount of Intel Xeon CPUs.
– Handling soft errors in krylov subspace methods by exploiting their numerical properties
Krylov space methods are a popular means for solving sparse systems. In this paper, we consider three such methods: GMRES, Conjugate Gradient (CG) and Conjugate Residual (CR). We focus on the problem of efficiently and accurately detecting soft errors leading to silent data corruption (SDC) for each of these methods. Unlike a limited amount of previous work in this area, our work is driven by analysis of mathematical properties of the methods. We identify a term we refer to as energy norm, which is decreasing for our target class of methods. We also show other applications of error norm and residual value, and expand the set of algorithms to which they can be applied. We have extensively evaluated our method considering three distinct dimensions: accuracy, the magnitude of undetected errors, and runtime overhead. First, we show that our methods have a high detection accuracy rate. We gain over 90\% detection rate for GMRES in most of the scenarios and matrices. For most cases in CG and CR, we gain over 70\% detection rate as well. Second, we show that for soft errors that are not detected by our methods, the resulting inaccuracy in the final results is small. Finally, we also show that the run-time overheads of our method are low.
– A Novel Approach For Handling Soft Error in Conjugate Gradient
“A Novel Approach For Handling Soft Error in Conjugate Gradient”
Soft errors or bit flips have recently become an important challenge in high-performance computing. In this project, we focus on soft errors in a particular algorithm, which is Conjugate Gradient (CG).
We present a series of techniques to detect soft-errors in CG. We first derive a mathematical quantity that is monotonically decreasing. Next, we add a set of heuristics and also combine our approach with another method. We have extensively evaluated our method considering three distinct dimensions. First, we show that the F-score of our detection is significantly better than two other methods. Second, we show that for soft-errors that are not detected by our methods, the resulting inaccuracy in the final results is low and better than those with other methods. Finally, the runtime overheads of our method are also lower.
– The Evaluation of Impact of Soft Error in Cache Structure over Linear Algebra Applications
”The Evaluation of Impact of Soft Error in Cache Structure over Linear Algebra Applications” Cache structures in computer system are vulnerable to silent data corruption. Improving fault tolerance ability for cache structure plays crucial role in reliability of High Performance Computing. In this study, we evaluate the impact of soft error occurred in cache structure over several iterative applications since first step towards the design of novel error detection method requires an understanding of vulnerability of application to silent data corruption. We implement injection model with Gem5 and use linear algebra iterative problem solver application as benchmarks.”
– Accelerating miniMD with NT-based thread parallel methods
Supervisor : Prof.Hasan Metin Aktulga Michigan State University
In this project, we implement and analyze the performance of previously reported thread-parallel algorithms using OpenMP on the Intel Xeon Phi many-core processor, and we compare their performance against
newly developed algorithms in our group based on the Neutral Territory (NT) method. We observe that NT-based thread parallel algorithms can outperform the existing methods using the benchmarks the miniMD code.