ECE Distinguished Lecture Series – November 3, 2022

November 1, 2022

ECE Distinguished Lecture

November 3, 2022 at 10:30 am (Simrall Hall 104)

Erasure Coded Computations: New Models for Fault Tolerance

Ananth Grama

 

Abstract: Dealing with faults is an important problem, as parallel and distributed systems scale to millions of processing cores and across wide area networks. Traditional methods for fault tolerance include checkpoint-restart, active replicas, and deterministic replay. These techniques have significant associated resource overheads and constraints. We propose a completely new approach to fault tolerant computations using an analog of a commonly used technique in storage — erasure coding. In storage, an encoding-decoding pair corresponds

to an identity function — i.e., the decoded form of erasure coded data must be identical to the original data even in the presence of (a bounded number of) erasures. In the computational analog, the decoded form of an encoded data on which an algorithm has been executed must be identical to the result of the algorithm on the original data (as opposed to the identity function in storage), in the presence of failures. Stated otherwise, erasure coded computations apply a minimally modified algorithm on a suitably augmented (coded) input to produce an augmented output. The execution of such an algorithm proceeds completely oblivious to faults in the system. In the event of one or more faults, the real solution is recovered using a rapid reconstruction method (decoding) from the augmented output. We demonstrate this approach on two important problems — solving sparse linear systems and solving eigenvalue problems. We present input augmentation and output recovery techniques for these problems. Through detailed experiments, we show that our approach can be made oblivious to a large number of faults with very low computational overhead. Specifically, we demonstrate typical cases where a single fault can be corrected with less than 10% overhead in time and space, and even in extreme cases (fault rates of 20%), our approach is able to compute a solution with reasonable overhead. These results represent a significant improvement over the state of the art and present significant new opportunities for programming models at scale that do not enforce synchrony or require periodic checkpoints/ reductions to persistent storage.

Biographical Notes: Ananth Grama is the Samuel Conte Professor of Computer Science at Purdue University. He works in broad areas of parallel and distributed computing, scientific computing, and large-scale data analytics. He received his PhD from the University of Minnesota in 1996, and has been at Purdue since then. He served as the Director of the Computational Science and Engineering and Computational Life Sciences programs at Purdue from 2012 – 16, and Chaired the Biodata Management and Analysis (BDMA) Study Section of the National Institute of Health from 2012-14. He is a recipient of the National Science Foundation CAREER award (1998), Purdue University Faculty Scholar Award (2002-07), a Fellow of the American Association for the Advancement of Sciences (2013), and a Distinguished Alumnus of the University of Minnesota (2016).

For further information contact: Ali Gurbuz | gurbuz@ece.msstate.edu| 662.325.1530 & John Ball | jeball@ece.msstate.edu