TY - GEN
T1 - Hardware and software innovations in energy-efficient system-reliability monitoring
AU - Tenentes, Vasileios
AU - Leech, Charles
AU - Bragg, Graeme M.
AU - Merrett, Geoff
AU - Al-Hashimi, Bashir M.
AU - Amrouch, Hussam
AU - Henkel, Jorg
AU - Das, Shidhartha
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/6/28
Y1 - 2017/6/28
N2 - Many threats that can undermine the reliability of a system can be realized at design, while others only during its online operation. As the availability of system monitoring sensors and run-time software increases in heterogeneous platforms, there is a demand for a novel platform-independent framework that can capture and deliver, in a holistic way, system level self-assessment and adaptation capabilities at run-time. In this paper, two groups from academia and one from industry present the following three contributions. First, system reliability is considered from the perspective of novel timing guardband designs for aging mitigation. Effective timing guardband models are presented from the physical to the system level, while targeting multiple wear-out mechanisms. Second, a technique for correlating complex software and micro-architectural events with power integrity loss is presented. The presented technique uses an embedded voltage noise sensor, a power-network model and a genetic algorithm for identifying workload that triggers power-network resonances which can ultimately lead to system failures. Third, the 'PRiME' cross-layer programming framework is presented that unites available sensors and dynamic-voltage and frequency scaling actuators with learning-based run-time process mapping and scheduling algorithms. Scenarios on exploring the energy efficiency and reliability of heterogeneous platforms using run-time software derived from the developed framework are also reviewed.
AB - Many threats that can undermine the reliability of a system can be realized at design, while others only during its online operation. As the availability of system monitoring sensors and run-time software increases in heterogeneous platforms, there is a demand for a novel platform-independent framework that can capture and deliver, in a holistic way, system level self-assessment and adaptation capabilities at run-time. In this paper, two groups from academia and one from industry present the following three contributions. First, system reliability is considered from the perspective of novel timing guardband designs for aging mitigation. Effective timing guardband models are presented from the physical to the system level, while targeting multiple wear-out mechanisms. Second, a technique for correlating complex software and micro-architectural events with power integrity loss is presented. The presented technique uses an embedded voltage noise sensor, a power-network model and a genetic algorithm for identifying workload that triggers power-network resonances which can ultimately lead to system failures. Third, the 'PRiME' cross-layer programming framework is presented that unites available sensors and dynamic-voltage and frequency scaling actuators with learning-based run-time process mapping and scheduling algorithms. Scenarios on exploring the energy efficiency and reliability of heterogeneous platforms using run-time software derived from the developed framework are also reviewed.
UR - http://www.scopus.com/inward/record.url?scp=85046015785&partnerID=8YFLogxK
U2 - 10.1109/DFT.2017.8244435
DO - 10.1109/DFT.2017.8244435
M3 - Conference contribution
AN - SCOPUS:85046015785
T3 - 2017 IEEE Int. Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2017
SP - 1
EP - 5
BT - 2017 IEEE Int. Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 13th IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2017
Y2 - 23 October 2017 through 25 October 2017
ER -