A Survey of AIOps Methods for Failure Management

Paolo Notaro, Jorge Cardoso, Michael Gerndt

Research output: Contribution to journalArticlepeer-review

49 Scopus citations

Abstract

Modern society is increasingly moving toward complex and distributed computing systems. The increase in scale and complexity of these systems challenges O&M teams that perform daily monitoring and repair operations, in contrast with the increasing demand for reliability and scalability of modern applications. For this reason, the study of automated and intelligent monitoring systems has recently sparked much interest across applied IT industry and academia. Artificial Intelligence for IT Operations (AIOps) has been proposed to tackle modern IT administration challenges thanks to Machine Learning, AI, and Big Data. However, AIOps as a research topic is still largely unstructured and unexplored, due to missing conventions in categorizing contributions for their data requirements, target goals, and components. In this work, we focus on AIOps for Failure Management (FM), characterizing and describing 5 different categories and 14 subcategories of contributions, based on their time intervention window and the target problem being solved. We review 100 FM solutions, focusing on applicability requirements and the quantitative results achieved, to facilitate an effective application of AIOps solutions. Finally, we discuss current development problems in the areas covered by AIOps and delineate possible future trends for AI-based failure management.

Original languageEnglish
Article number81
JournalACM Transactions on Intelligent Systems and Technology
Volume12
Issue number6
DOIs
StatePublished - Dec 2021

Keywords

  • AIOps
  • IT operations and maintenance
  • artificial intelligence
  • failure management

Fingerprint

Dive into the research topics of 'A Survey of AIOps Methods for Failure Management'. Together they form a unique fingerprint.

Cite this