TY - JOUR
T1 - HPC Hardware Design Reliability Benchmarking With HDFIT
AU - Omland, Patrik
AU - Netti, Alessio
AU - Peng, Yang
AU - Baldovin, Andrea
AU - Paulitsch, Michael
AU - Espinosa, Gustavo
AU - Parra, Jorge
AU - Hinz, Gereon
AU - Knoll, Alois
N1 - Publisher Copyright:
© 1990-2012 IEEE.
PY - 2023/3/1
Y1 - 2023/3/1
N2 - Chips pack ever more, ever smaller transistors. Fault rates increase in turn and become more concerning, particularly at the scale of High-Performance Computing (HPC) systems: on one hand, hardware fault protection is costly - more than 10% silicon area for floating-point units; on the other, HPC users expect correct application output after the anticipated time of computation, but workloads are seldom bit-reproducible and tolerances in output are allowed for. Benign hardware faults causing errors within these tolerances are therefore acceptable: however, with abstract reliability targets such as 'undetected failures per time,' current HPC system design does not allow for pursuing trade-offs between reliability and performance with respect to faults. To address the above, we propose a user-centric reliability benchmark to specify HPC system reliability targets, allowing for better performance optimizations in hardware design, while meeting HPC user expectations. Our open-source Hardware Design Fault Injection Toolkit (HDFIT) enables - for the first time - end-to-end hardware design reliability experiments: from netlist-level fault injection to application output error. In a proof of concept we present an HPC general matrix multiply (GEMM) reliability study, targeting a series of popular applications, and using HDFIT to benchmark an open-source GEMM accelerator.
AB - Chips pack ever more, ever smaller transistors. Fault rates increase in turn and become more concerning, particularly at the scale of High-Performance Computing (HPC) systems: on one hand, hardware fault protection is costly - more than 10% silicon area for floating-point units; on the other, HPC users expect correct application output after the anticipated time of computation, but workloads are seldom bit-reproducible and tolerances in output are allowed for. Benign hardware faults causing errors within these tolerances are therefore acceptable: however, with abstract reliability targets such as 'undetected failures per time,' current HPC system design does not allow for pursuing trade-offs between reliability and performance with respect to faults. To address the above, we propose a user-centric reliability benchmark to specify HPC system reliability targets, allowing for better performance optimizations in hardware design, while meeting HPC user expectations. Our open-source Hardware Design Fault Injection Toolkit (HDFIT) enables - for the first time - end-to-end hardware design reliability experiments: from netlist-level fault injection to application output error. In a proof of concept we present an HPC general matrix multiply (GEMM) reliability study, targeting a series of popular applications, and using HDFIT to benchmark an open-source GEMM accelerator.
KW - Fault injection
KW - HPC reliability
KW - fault model
KW - fault tolerance
KW - hardware faults
KW - hardware reliability
KW - program vulnerability
UR - http://www.scopus.com/inward/record.url?scp=85147307802&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2023.3237777
DO - 10.1109/TPDS.2023.3237777
M3 - Article
AN - SCOPUS:85147307802
SN - 1045-9219
VL - 34
SP - 995
EP - 1006
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 3
ER -