HPC Hardware Design Reliability Benchmarking With HDFIT

Patrik Omland, Alessio Netti, Yang Peng, Andrea Baldovin, Michael Paulitsch, Gustavo Espinosa, Jorge Parra, Gereon Hinz, Alois Knoll

Research output: Contribution to journalArticlepeer-review

5 Scopus citations

Abstract

Chips pack ever more, ever smaller transistors. Fault rates increase in turn and become more concerning, particularly at the scale of High-Performance Computing (HPC) systems: on one hand, hardware fault protection is costly - more than 10% silicon area for floating-point units; on the other, HPC users expect correct application output after the anticipated time of computation, but workloads are seldom bit-reproducible and tolerances in output are allowed for. Benign hardware faults causing errors within these tolerances are therefore acceptable: however, with abstract reliability targets such as 'undetected failures per time,' current HPC system design does not allow for pursuing trade-offs between reliability and performance with respect to faults. To address the above, we propose a user-centric reliability benchmark to specify HPC system reliability targets, allowing for better performance optimizations in hardware design, while meeting HPC user expectations. Our open-source Hardware Design Fault Injection Toolkit (HDFIT) enables - for the first time - end-to-end hardware design reliability experiments: from netlist-level fault injection to application output error. In a proof of concept we present an HPC general matrix multiply (GEMM) reliability study, targeting a series of popular applications, and using HDFIT to benchmark an open-source GEMM accelerator.

Original languageEnglish
Pages (from-to)995-1006
Number of pages12
JournalIEEE Transactions on Parallel and Distributed Systems
Volume34
Issue number3
DOIs
StatePublished - 1 Mar 2023

Keywords

  • Fault injection
  • HPC reliability
  • fault model
  • fault tolerance
  • hardware faults
  • hardware reliability
  • program vulnerability

Fingerprint

Dive into the research topics of 'HPC Hardware Design Reliability Benchmarking With HDFIT'. Together they form a unique fingerprint.

Cite this