FlipTracker: Understanding natural error resilience in HPC applications

Luanzheng Guo, Dong Li, Ignacio Laguna, Martin Schulz

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

24 Scopus citations

Abstract

As high-performance computing systems scale in size and computational power, the danger of silent errors, i.e., errors that can bypass hardware detection mechanisms and impact application state, grows dramatically. Consequently, applications running on HPC systems need to exhibit resilience to such errors. Previous work has found that, for certain codes, this resilience can come for free, i.e., some applications are naturally resilient, but few studies have shown the code patterns - combinations or sequences of computations - that make an application naturally resilient. In this paper, we present FlipTracker, a framework designed to extract these patterns using fine-grained tracking of error propagation and resilience properties, and we use it to present a set of computation patterns that are responsible for making representative HPC applications naturally resilient to errors. This not only enables a deeper understanding of resilience properties of these codes, but also can guide future application designs towards patterns with natural resilience.

Original languageEnglish
Title of host publicationProceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages94-107
Number of pages14
ISBN (Electronic)9781538683842
DOIs
StatePublished - 2 Jul 2018
Event2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018 - Dallas, United States
Duration: 11 Nov 201816 Nov 2018

Publication series

NameProceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018

Conference

Conference2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
Country/TerritoryUnited States
CityDallas
Period11/11/1816/11/18

Keywords

  • Fault tolerance
  • High-Performance Computing
  • Natural Resilience
  • Resilience computation patterns

Fingerprint

Dive into the research topics of 'FlipTracker: Understanding natural error resilience in HPC applications'. Together they form a unique fingerprint.

Cite this