TY - JOUR
T1 - A Survey of AIOps Methods for Failure Management
AU - Notaro, Paolo
AU - Cardoso, Jorge
AU - Gerndt, Michael
N1 - Publisher Copyright:
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2021/12
Y1 - 2021/12
N2 - Modern society is increasingly moving toward complex and distributed computing systems. The increase in scale and complexity of these systems challenges O&M teams that perform daily monitoring and repair operations, in contrast with the increasing demand for reliability and scalability of modern applications. For this reason, the study of automated and intelligent monitoring systems has recently sparked much interest across applied IT industry and academia. Artificial Intelligence for IT Operations (AIOps) has been proposed to tackle modern IT administration challenges thanks to Machine Learning, AI, and Big Data. However, AIOps as a research topic is still largely unstructured and unexplored, due to missing conventions in categorizing contributions for their data requirements, target goals, and components. In this work, we focus on AIOps for Failure Management (FM), characterizing and describing 5 different categories and 14 subcategories of contributions, based on their time intervention window and the target problem being solved. We review 100 FM solutions, focusing on applicability requirements and the quantitative results achieved, to facilitate an effective application of AIOps solutions. Finally, we discuss current development problems in the areas covered by AIOps and delineate possible future trends for AI-based failure management.
AB - Modern society is increasingly moving toward complex and distributed computing systems. The increase in scale and complexity of these systems challenges O&M teams that perform daily monitoring and repair operations, in contrast with the increasing demand for reliability and scalability of modern applications. For this reason, the study of automated and intelligent monitoring systems has recently sparked much interest across applied IT industry and academia. Artificial Intelligence for IT Operations (AIOps) has been proposed to tackle modern IT administration challenges thanks to Machine Learning, AI, and Big Data. However, AIOps as a research topic is still largely unstructured and unexplored, due to missing conventions in categorizing contributions for their data requirements, target goals, and components. In this work, we focus on AIOps for Failure Management (FM), characterizing and describing 5 different categories and 14 subcategories of contributions, based on their time intervention window and the target problem being solved. We review 100 FM solutions, focusing on applicability requirements and the quantitative results achieved, to facilitate an effective application of AIOps solutions. Finally, we discuss current development problems in the areas covered by AIOps and delineate possible future trends for AI-based failure management.
KW - AIOps
KW - IT operations and maintenance
KW - artificial intelligence
KW - failure management
UR - http://www.scopus.com/inward/record.url?scp=85123922295&partnerID=8YFLogxK
U2 - 10.1145/3483424
DO - 10.1145/3483424
M3 - Article
AN - SCOPUS:85123922295
SN - 2157-6904
VL - 12
JO - ACM Transactions on Intelligent Systems and Technology
JF - ACM Transactions on Intelligent Systems and Technology
IS - 6
M1 - 81
ER -