Skip to main navigation Skip to search Skip to main content

TeaMPI—replication-based resilience without the (Performance) Pain

  • Philipp Samfass
  • , Tobias Weinzierl
  • , Benjamin Hazelwood
  • , Michael Bader
  • Technical University of Munich
  • University of Durham

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

In an era where we can not afford to checkpoint frequently, replication is a generic way forward to construct numerical simulations that can continue to run even if hardware parts fail. Yet, replication often is not employed on larger scales, as naïvely mirroring a computation once effectively halves the machine size, and as keeping replicated simulations consistent with each other is not trivial. We demonstrate for the ExaHyPE engine—a task-based solver for hyperbolic equation systems—that it is possible to realise resiliency without major code changes on the user side, while we introduce a novel algorithmic idea where replication reduces the time-to-solution. The redundant CPU cycles are not burned “for nothing”. Our work employs a weakly consistent data model where replicas run independently yet inform each other through heartbeat messages whether they are still up and running. Our key performance idea is to let the tasks of the replicated simulations share some of their outcomes, while we shuffle the actual task execution order per replica. This way, replicated ranks can skip some local computations and automatically start to synchronise with each other. Our experiments with a production-level seismic wave-equation solver provide evidence that this novel concept has the potential to make replication affordable for large-scale simulations in high-performance computing.

Original languageEnglish
Title of host publicationHigh Performance Computing - 35th International Conference, ISC High Performance 2020, Proceedings
EditorsPonnuswamy Sadayappan, Bradford L. Chamberlain, Guido Juckeland, Hatem Ltaief
PublisherSpringer
Pages455-473
Number of pages19
ISBN (Print)9783030507428
DOIs
StatePublished - 2020
Event35th International Conference on High Performance Computing, ISC High Performance 2020 - Frankfurt, Germany
Duration: 22 Jun 202025 Jun 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12151 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference35th International Conference on High Performance Computing, ISC High Performance 2020
Country/TerritoryGermany
CityFrankfurt
Period22/06/2025/06/20

Fingerprint

Dive into the research topics of 'TeaMPI—replication-based resilience without the (Performance) Pain'. Together they form a unique fingerprint.

Cite this