TY - GEN
T1 - Implicit intermittent fault detection in distributed systems
AU - Waszecki, Peter
AU - Kauer, Matthias
AU - Lukasiewycz, Martin
AU - Chakraborty, Samarjit
PY - 2014
Y1 - 2014
N2 - This paper presents a novel approach to detect resources in distributed systems with an increased occurrence of intermittent faults that exceed the amount of unavoidable transient faults caused by environmental phenomena. Intermittent faults occur due to stressed resources and often are a precursor of permanent faults. The proposed early fault detection and diagnosis allows the use of precautionary measures before the permanent failure of a component in a distributed system occurs. In this paper, we present four methods that can implicitly detect intermittent faults by taking the distributed applications and their dependencies into account. Thus, explicit tests are not required which would lead to additional costs and resource load. On the other hand, the implicit approach may considerably reduce the number of plausibility tests compared to the conservative solution with one test per resource. We analyzed and evaluated implementations of the proposed fault detection principle. The experimental results give evidence of the feasibility of our approach and show a comparison of the implemented methods in terms of runtime and detection rate.
AB - This paper presents a novel approach to detect resources in distributed systems with an increased occurrence of intermittent faults that exceed the amount of unavoidable transient faults caused by environmental phenomena. Intermittent faults occur due to stressed resources and often are a precursor of permanent faults. The proposed early fault detection and diagnosis allows the use of precautionary measures before the permanent failure of a component in a distributed system occurs. In this paper, we present four methods that can implicitly detect intermittent faults by taking the distributed applications and their dependencies into account. Thus, explicit tests are not required which would lead to additional costs and resource load. On the other hand, the implicit approach may considerably reduce the number of plausibility tests compared to the conservative solution with one test per resource. We analyzed and evaluated implementations of the proposed fault detection principle. The experimental results give evidence of the feasibility of our approach and show a comparison of the implemented methods in terms of runtime and detection rate.
UR - http://www.scopus.com/inward/record.url?scp=84897886240&partnerID=8YFLogxK
U2 - 10.1109/ASPDAC.2014.6742964
DO - 10.1109/ASPDAC.2014.6742964
M3 - Conference contribution
AN - SCOPUS:84897886240
SN - 9781479928163
T3 - Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC
SP - 646
EP - 651
BT - 2014 19th Asia and South Pacific Design Automation Conference, ASP-DAC 2014 - Proceedings
T2 - 2014 19th Asia and South Pacific Design Automation Conference, ASP-DAC 2014
Y2 - 20 January 2014 through 23 January 2014
ER -