TY - GEN
T1 - Doubt and Redundancy Kill Soft Errors-Towards Detection and Correction of Silent Data Corruption in Task-based Numerical Software
AU - Samfass, Philipp
AU - Weinzierl, Tobias
AU - Reinarz, Anne
AU - Bader, Michael
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Resilient algorithms in high-performance computing are subject to rigorous non-functional constraints. Resiliency must not increase the runtime, memory footprint or I/O demands too significantly. We propose a task-based soft error detection scheme that relies on error criteria per task outcome. They formalise how 'dubious' an outcome is, i.e. how likely it contains an error. Our whole simulation is replicated once, forming two teams of MPI ranks that share their task results. Thus, ideally each team handles only around half of the workload. If a task yields large error criteria values, i.e. is dubious, we compute the task redundantly and compare the outcomes. Whenever they disagree, the task result with a lower error likeliness is accepted. We obtain a self-healing, resilient algorithm which can compensate silent floating-point errors without a significant performance, I/O or memory footprint penalty. Case studies however suggest that a careful, domain-specific tailoring of the error criteria remains essential.
AB - Resilient algorithms in high-performance computing are subject to rigorous non-functional constraints. Resiliency must not increase the runtime, memory footprint or I/O demands too significantly. We propose a task-based soft error detection scheme that relies on error criteria per task outcome. They formalise how 'dubious' an outcome is, i.e. how likely it contains an error. Our whole simulation is replicated once, forming two teams of MPI ranks that share their task results. Thus, ideally each team handles only around half of the workload. If a task yields large error criteria values, i.e. is dubious, we compute the task redundantly and compare the outcomes. Whenever they disagree, the task result with a lower error likeliness is accepted. We obtain a self-healing, resilient algorithm which can compensate silent floating-point errors without a significant performance, I/O or memory footprint penalty. Case studies however suggest that a careful, domain-specific tailoring of the error criteria remains essential.
KW - Correction
KW - Detection
KW - Fault resilience
KW - Fault tolerance
KW - Soft errors
UR - http://www.scopus.com/inward/record.url?scp=85124624726&partnerID=8YFLogxK
U2 - 10.1109/FTXS54580.2021.00005
DO - 10.1109/FTXS54580.2021.00005
M3 - Conference contribution
AN - SCOPUS:85124624726
T3 - Proceedings of FTXS 2021: Workshop on Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 1
EP - 10
BT - Proceedings of FTXS 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2021
Y2 - 14 November 2021
ER -