TY - GEN
T1 - Experience report
T2 - 26th IEEE International Symposium on Software Reliability Engineering, ISSRE 2015
AU - Farshchi, Mostafa
AU - Schneider, Jean Guy
AU - Weber, Ingo
AU - Grundy, John
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2016/1/13
Y1 - 2016/1/13
N2 - Failure of application operations is one of the main causes of system-wide outages in cloud environments. This particularly applies to DevOps operations, such as backup, redeployment, upgrade, customized scaling, and migration that are exposed to frequent interference from other concurrent operations, configuration changes, and resources failure. However, current practices fail to provide a reliable assurance of correct execution of these kinds of operations. In this paper, we present an approach to address this problem that adopts a regression-based analysis technique to find the correlation between an operation's activity logs and the operation activity's effect on cloud resources. The correlation model is then used to derive assertion specifications, which can be used for runtime verification of running operations and their impact on resources. We evaluated our proposed approach on Amazon EC2 with 22 rounds of rolling upgrade operations while other types of operations were running and random faults were injected. Our experiment shows that our approach successfully managed to raise alarms for 115 random injected faults, with a precision of 92.3%.
AB - Failure of application operations is one of the main causes of system-wide outages in cloud environments. This particularly applies to DevOps operations, such as backup, redeployment, upgrade, customized scaling, and migration that are exposed to frequent interference from other concurrent operations, configuration changes, and resources failure. However, current practices fail to provide a reliable assurance of correct execution of these kinds of operations. In this paper, we present an approach to address this problem that adopts a regression-based analysis technique to find the correlation between an operation's activity logs and the operation activity's effect on cloud resources. The correlation model is then used to derive assertion specifications, which can be used for runtime verification of running operations and their impact on resources. We evaluated our proposed approach on Amazon EC2 with 22 rounds of rolling upgrade operations while other types of operations were running and random faults were injected. Our experiment shows that our approach successfully managed to raise alarms for 115 random injected faults, with a precision of 92.3%.
KW - Cloud application operations
KW - Cloud monitoring
KW - DevOps
KW - anomaly detection
KW - error detection
KW - log analysis
UR - http://www.scopus.com/inward/record.url?scp=84964815259&partnerID=8YFLogxK
U2 - 10.1109/ISSRE.2015.7381796
DO - 10.1109/ISSRE.2015.7381796
M3 - Conference contribution
AN - SCOPUS:84964815259
T3 - 2015 IEEE 26th International Symposium on Software Reliability Engineering, ISSRE 2015
SP - 24
EP - 34
BT - 2015 IEEE 26th International Symposium on Software Reliability Engineering, ISSRE 2015
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 2 November 2015 through 5 November 2015
ER -