SaFirE: Scalable and accurate fault injection for parallel multithreaded applications

Giorgis Georgakoudis, Ignacio Laguna, Hans Vandierendonck, Dimitrios S. Nikolopoulos, Martin Schulz

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

Soft errors threaten to disrupt supercomputing scaling. Fault injection is a key technique to understand the impact of faults on scientific applications. However, injecting faults in parallel applications has been prohibitively slow, inaccurate and hard to implement. In this paper, we present SAFIRE, the first fast and accurate fault injection framework for parallel, multi-threaded applications. SAFIRE uses novel compiler instrumentation and code generation techniques to achieve high accuracy and high speed. Using SAFIRE, we show that fault manifestations can be significantly different depending on whether they happen in the application itself or in the parallel runtime system. In our experimental evaluation on 15 HPC parallel programs, we show that SAFIRE is multiple factors faster and equally accurate in comparison with state-of-the-art dynamic binary instrumentation tools for fault injection.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages890-899
Number of pages10
ISBN (Electronic)9781728112466
DOIs
StatePublished - May 2019
Event33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019 - Rio de Janeiro, Brazil
Duration: 20 May 201924 May 2019

Publication series

NameProceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019

Conference

Conference33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019
Country/TerritoryBrazil
CityRio de Janeiro
Period20/05/1924/05/19

Fingerprint

Dive into the research topics of 'SaFirE: Scalable and accurate fault injection for parallel multithreaded applications'. Together they form a unique fingerprint.

Cite this