TY - GEN
T1 - SaFirE
T2 - 33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019
AU - Georgakoudis, Giorgis
AU - Laguna, Ignacio
AU - Vandierendonck, Hans
AU - Nikolopoulos, Dimitrios S.
AU - Schulz, Martin
N1 - Publisher Copyright:
© 2019 IEEE
PY - 2019/5
Y1 - 2019/5
N2 - Soft errors threaten to disrupt supercomputing scaling. Fault injection is a key technique to understand the impact of faults on scientific applications. However, injecting faults in parallel applications has been prohibitively slow, inaccurate and hard to implement. In this paper, we present SAFIRE, the first fast and accurate fault injection framework for parallel, multi-threaded applications. SAFIRE uses novel compiler instrumentation and code generation techniques to achieve high accuracy and high speed. Using SAFIRE, we show that fault manifestations can be significantly different depending on whether they happen in the application itself or in the parallel runtime system. In our experimental evaluation on 15 HPC parallel programs, we show that SAFIRE is multiple factors faster and equally accurate in comparison with state-of-the-art dynamic binary instrumentation tools for fault injection.
AB - Soft errors threaten to disrupt supercomputing scaling. Fault injection is a key technique to understand the impact of faults on scientific applications. However, injecting faults in parallel applications has been prohibitively slow, inaccurate and hard to implement. In this paper, we present SAFIRE, the first fast and accurate fault injection framework for parallel, multi-threaded applications. SAFIRE uses novel compiler instrumentation and code generation techniques to achieve high accuracy and high speed. Using SAFIRE, we show that fault manifestations can be significantly different depending on whether they happen in the application itself or in the parallel runtime system. In our experimental evaluation on 15 HPC parallel programs, we show that SAFIRE is multiple factors faster and equally accurate in comparison with state-of-the-art dynamic binary instrumentation tools for fault injection.
UR - http://www.scopus.com/inward/record.url?scp=85072822375&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2019.00097
DO - 10.1109/IPDPS.2019.00097
M3 - Conference contribution
AN - SCOPUS:85072822375
T3 - Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019
SP - 890
EP - 899
BT - Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 20 May 2019 through 24 May 2019
ER -