Doubt and Redundancy Kill Soft Errors-Towards Detection and Correction of Silent Data Corruption in Task-based Numerical Software

Philipp Samfass, Tobias Weinzierl, Anne Reinarz, Michael Bader

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Resilient algorithms in high-performance computing are subject to rigorous non-functional constraints. Resiliency must not increase the runtime, memory footprint or I/O demands too significantly. We propose a task-based soft error detection scheme that relies on error criteria per task outcome. They formalise how 'dubious' an outcome is, i.e. how likely it contains an error. Our whole simulation is replicated once, forming two teams of MPI ranks that share their task results. Thus, ideally each team handles only around half of the workload. If a task yields large error criteria values, i.e. is dubious, we compute the task redundantly and compare the outcomes. Whenever they disagree, the task result with a lower error likeliness is accepted. We obtain a self-healing, resilient algorithm which can compensate silent floating-point errors without a significant performance, I/O or memory footprint penalty. Case studies however suggest that a careful, domain-specific tailoring of the error criteria remains essential.

Original languageEnglish
Title of host publicationProceedings of FTXS 2021
Subtitle of host publicationWorkshop on Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1-10
Number of pages10
ISBN (Electronic)9781665420594
DOIs
StatePublished - 2021
Event2021 Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2021 - St. Louis, United States
Duration: 14 Nov 2021 → …

Publication series

NameProceedings of FTXS 2021: Workshop on Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference2021 Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2021
Country/TerritoryUnited States
CitySt. Louis
Period14/11/21 → …

Keywords

  • Correction
  • Detection
  • Fault resilience
  • Fault tolerance
  • Soft errors

Fingerprint

Dive into the research topics of 'Doubt and Redundancy Kill Soft Errors-Towards Detection and Correction of Silent Data Corruption in Task-based Numerical Software'. Together they form a unique fingerprint.

Cite this