Mechanisms and evaluation of cross-layer fault-tolerance for supercomputing

Chen Han Ho, Marc De Kruijf, Karthikeyan Sankaralingam, Barry Rountree, Martin Schulz, Bronis R. De Supinski

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Reliability is emerging as an important constraint for future microprocessors. Cooperative hardware and software approaches for error tolerance can solve this hardware reliability challenge. Cross-layer fault tolerance frameworks expose hardware failures to upper-layers, like the compiler, to help correct faults. Such cooperative approaches require less hardware complexity than masking all faults at the hardware level and are generally more energy efficient. This paper provides a detailed design and an implementation study of cross-layer fault tolerance for supercomputing. Since supercomputers necessarily involve large component counts, they have more frequent failures than consumer electronics and small systems. Conventionally, these systems use redundancy and check pointing to achieve reliable computing. However, redundancy increases acquisition as well as recurring energy costs. This paper describes a simple language-level mechanism coupled with complementary compilation and lightweight hardware error detection that provides efficient reliability and cross-layer fault-tolerance for supercomputers. Our evaluation focuses on strong scaling problems for which we can trade computing power for redundancy. Our results show a range of 1.07x to 2.5x speedup when employing cross-layer error-tolerance compared to conventional full dual modular redundancy (DMR) to contain all errors within hardware. Further, we demonstrate the approach can sustain 7% to 50% lower energy. The most important result of this work is qualitative: we can use a simplified hardware design with relaxed architectural correctness guarantees.

Original languageEnglish
Title of host publicationProceedings - 41st International Conference on Parallel Processing, ICPP 2012
Pages510-519
Number of pages10
DOIs
StatePublished - 2012
Externally publishedYes
Event41st International Conference on Parallel Processing, ICPP 2012 - Pittsburgh, PA, United States
Duration: 10 Sep 201213 Sep 2012

Publication series

NameProceedings of the International Conference on Parallel Processing
ISSN (Print)0190-3918

Conference

Conference41st International Conference on Parallel Processing, ICPP 2012
Country/TerritoryUnited States
CityPittsburgh, PA
Period10/09/1213/09/12

Keywords

  • Cross-Layer Fault Tolerance
  • HPC
  • Reliability
  • Supercomputing

Fingerprint

Dive into the research topics of 'Mechanisms and evaluation of cross-layer fault-tolerance for supercomputing'. Together they form a unique fingerprint.

Cite this