Fault resilience of the algebraic multi-grid solver

Marc Casas, Bronis R. De Supinski, Greg Bronevetsky, Martin Schulz

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

58 Scopus citations

Abstract

As HPC system sizes grow to millions of cores and chip feature sizes continue to decrease, HPC applications become increasingly exposed to transient hardware faults. These faults can cause aborts and performance degradation. Most importantly, they can corrupt results. Thus, we must evaluate the fault vulnerability of key HPC algorithms to develop cost-effective techniques to improve application resilience. We present an approach that analyzes the vulnerability of applications to faults, systematically reduces it by protecting the most vulnerable components and predicts application vulnerability at large scales. We initially focus on sparse scientific applications and apply our approach in this paper to the Algebraic Multi Grid (AMG) algorithm. We empirically analyze AMG's vulnerability to hardware faults in both sequential and parallel (hybrid MPI/OpenMP) executions on up to 1,600 cores and propose and evaluate the use of targeted pointer replication to reduce it. Our techniques increase AMG's resilience to transient hardware faults by 50-80% and improve its scalability on faulty computational environments by 35%. Further, we show how to model AMG's scalability in fault-prone environments to predict execution times of large-scale runs accurately.

Original languageEnglish
Title of host publicationICS'12 - Proceedings of the 2012 ACM International Conference on Supercomputing
Pages91-100
Number of pages10
DOIs
StatePublished - 2012
Externally publishedYes
Event26th ACM International Conference on Supercomputing, ICS'12 - San Servolo Island, Venice, Italy
Duration: 25 Jun 201229 Jun 2012

Publication series

NameProceedings of the International Conference on Supercomputing

Conference

Conference26th ACM International Conference on Supercomputing, ICS'12
Country/TerritoryItaly
CitySan Servolo Island, Venice
Period25/06/1229/06/12

Keywords

  • Algebraic Multi-Grid solver
  • Resilience
  • Transient faults

Fingerprint

Dive into the research topics of 'Fault resilience of the algebraic multi-grid solver'. Together they form a unique fingerprint.

Cite this