Implicit intermittent fault detection in distributed systems

Peter Waszecki, Matthias Kauer, Martin Lukasiewycz, Samarjit Chakraborty

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

This paper presents a novel approach to detect resources in distributed systems with an increased occurrence of intermittent faults that exceed the amount of unavoidable transient faults caused by environmental phenomena. Intermittent faults occur due to stressed resources and often are a precursor of permanent faults. The proposed early fault detection and diagnosis allows the use of precautionary measures before the permanent failure of a component in a distributed system occurs. In this paper, we present four methods that can implicitly detect intermittent faults by taking the distributed applications and their dependencies into account. Thus, explicit tests are not required which would lead to additional costs and resource load. On the other hand, the implicit approach may considerably reduce the number of plausibility tests compared to the conservative solution with one test per resource. We analyzed and evaluated implementations of the proposed fault detection principle. The experimental results give evidence of the feasibility of our approach and show a comparison of the implemented methods in terms of runtime and detection rate.

Original languageEnglish
Title of host publication2014 19th Asia and South Pacific Design Automation Conference, ASP-DAC 2014 - Proceedings
Pages646-651
Number of pages6
DOIs
StatePublished - 2014
Event2014 19th Asia and South Pacific Design Automation Conference, ASP-DAC 2014 - Suntec, Singapore
Duration: 20 Jan 201423 Jan 2014

Publication series

NameProceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC

Conference

Conference2014 19th Asia and South Pacific Design Automation Conference, ASP-DAC 2014
Country/TerritorySingapore
CitySuntec
Period20/01/1423/01/14

Fingerprint

Dive into the research topics of 'Implicit intermittent fault detection in distributed systems'. Together they form a unique fingerprint.

Cite this