Enabling Application-Integrated Proactive Fault Tolerance

Dai Yang, Josef Weidendorfer, Carsten Trinitis, Tilman Küstner, Sibylle Ziegler

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

Abstract

Exascale computing is the next major milestone for the HPC community. Due to a steadily increasing probability of failures, current applications must be made malleable to be able to cope with dynamic resource changes. In this paper, we show first results with LAIK, a lightweight library for dynamically re-distributable application data. This allows to free compute nodes from workload before a predicted failure. For a real-world application, we show that LAIK adds negligible overhead. In addition, we show the effect of different re-distribution strategies.

Original languageEnglish
Title of host publicationParallel Computing is Everywhere
EditorsGerhard R. Joubert, Patrizio Dazzi, Frans Peters, Marco Danelutto, Sanzio Bassini
PublisherIOS Press BV
Pages475-484
Number of pages10
ISBN (Electronic)9781614998426
DOIs
StatePublished - 2018
Externally publishedYes

Publication series

NameAdvances in Parallel Computing
Volume32
ISSN (Print)0927-5452
ISSN (Electronic)1879-808X

Keywords

  • Application-Integrated Fault Tolerance
  • Data Distribution
  • High Performance Computing
  • Parallel Programming Models

Fingerprint

Dive into the research topics of 'Enabling Application-Integrated Proactive Fault Tolerance'. Together they form a unique fingerprint.

Cite this