TY - GEN
T1 - Efficient verification of IT change operations or
T2 - 2012 IEEE Network Operations and Management Symposium, NOMS 2012
AU - Hagen, Sebastian
AU - Seibold, Michael
AU - Kemper, Alfons
PY - 2012
Y1 - 2012
N2 - On April 21st, 2011, a major outage occurred in Amazon's US east coast data center which led to significant disruptions on customer services. The root cause of the outage was an IT change to route traffic off from a router to a redundant router to conduct a network upgrade. The change was wrongly executed as a router was picked that could not handle the traffic due to capacity constraints. Consequently, network outages occurred, finally leading to unavailability, temporary, and even durable data loss of customers. We propose an object-oriented verification technique to detect conflicts among IT change operations and safety constraints, such as network capacity constraints, in the verification phase before the execution of IT changes. Based on Amazon's incident report different scenarios in static and dynamic routing environments that cause a network overload are shown to be detectable by logical verification. The verification algorithm is proven to be sound and has linear runtime complexity for Amazon's network overload scenarios. A performance analysis confirms the theoretical results and promises scalability to thousands of IT changes and safety constraints.
AB - On April 21st, 2011, a major outage occurred in Amazon's US east coast data center which led to significant disruptions on customer services. The root cause of the outage was an IT change to route traffic off from a router to a redundant router to conduct a network upgrade. The change was wrongly executed as a router was picked that could not handle the traffic due to capacity constraints. Consequently, network outages occurred, finally leading to unavailability, temporary, and even durable data loss of customers. We propose an object-oriented verification technique to detect conflicts among IT change operations and safety constraints, such as network capacity constraints, in the verification phase before the execution of IT changes. Based on Amazon's incident report different scenarios in static and dynamic routing environments that cause a network overload are shown to be detectable by logical verification. The verification algorithm is proven to be sound and has linear runtime complexity for Amazon's network overload scenarios. A performance analysis confirms the theoretical results and promises scalability to thousands of IT changes and safety constraints.
UR - http://www.scopus.com/inward/record.url?scp=84864255488&partnerID=8YFLogxK
U2 - 10.1109/NOMS.2012.6211920
DO - 10.1109/NOMS.2012.6211920
M3 - Conference contribution
AN - SCOPUS:84864255488
SN - 9781467302685
T3 - Proceedings of the 2012 IEEE Network Operations and Management Symposium, NOMS 2012
SP - 368
EP - 376
BT - Proceedings of the 2012 IEEE Network Operations and Management Symposium, NOMS 2012
Y2 - 16 April 2012 through 20 April 2012
ER -