Conference proceedings article
A Robust Fault Tolerance Scheme for Lifeline-Based Taskpools



Publication Details
Authors:
Fohry, C.; Bungart, M.
Editor:
Bilof, Randall
Publisher:
IEEE
Place:
Piscataway, NJ
Publication year:
2016
Pages range:
200-209
Book title:
2016 45th International Conference on Parallel Processing Workshops (ICPPW)
ISBN:
978-1-5090-2825-2

Abstract
Fault tolerance is of increasing importance for parallel computing. While often addressed at system level, application-level resilience techniques may be more efficient. In particular, it seems worthwhile to provide fault tolerant libraries for reusable patterns such as the task pool. We consider a task pool variant that uses cooperative work stealing, called the lifeline scheme. It is implemented in the GLB library of the PGAS programming language X10. Extending our own previous work, we present a fault-tolerance scheme for this setting, which is both communication-efficient and robust. Here, robustness denotes the ability to tolerate multiple coincident failures of interrelated workers. Our algorithm keeps two copies of important data, and tolerates almost all permanent place failures that leave one of the copies intact. For that, we nest execution of restore protocols. We implemented our algorithm within the GLB library. Performance measurements show a steal count dependent overhead of 5 to 40{\%} during failure-free operation and a negligible overhead for restore.


Authors/Editors

Last updated on 2019-25-07 at 16:24