Beitrag in einem Tagungsband

A Robust Fault Tolerance Scheme for Lifeline-Based Taskpools



Details zur Publikation
Autor(inn)en:
Fohry, C.; Bungart, M.
Herausgeber:
Bilof, Randall
Verlag:
IEEE
Verlagsort / Veröffentlichungsort:
Piscataway, NJ

Publikationsjahr:
2016
Seitenbereich:
200-209
Buchtitel:
2016 45th International Conference on Parallel Processing Workshops (ICPPW)
ISBN:
978-1-5090-2825-2
DOI-Link der Erstveröffentlichung:


Zusammenfassung, Abstract
Fault tolerance is of increasing importance for parallel computing. While often addressed at system level, application-level resilience techniques may be more efficient. In particular, it seems worthwhile to provide fault tolerant libraries for reusable patterns such as the task pool. We consider a task pool variant that uses cooperative work stealing, called the lifeline scheme. It is implemented in the GLB library of the PGAS programming language X10. Extending our own previous work, we present a fault-tolerance scheme for this setting, which is both communication-efficient and robust. Here, robustness denotes the ability to tolerate multiple coincident failures of interrelated workers. Our algorithm keeps two copies of important data, and tolerates almost all permanent place failures that leave one of the copies intact. For that, we nest execution of restore protocols. We implemented our algorithm within the GLB library. Performance measurements show a steal count dependent overhead of 5 to 40{\%} during failure-free operation and a negligible overhead for restore.


Autor(inn)en / Herausgeber(innen)

Zuletzt aktualisiert 2022-20-04 um 14:41