Efficient verification of IT change operations or: How we could have prevented Amazon's cloud outage

Sebastian Hagen, Michael Seibold, Alfons Kemper

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

16 Scopus citations

Abstract

On April 21st, 2011, a major outage occurred in Amazon's US east coast data center which led to significant disruptions on customer services. The root cause of the outage was an IT change to route traffic off from a router to a redundant router to conduct a network upgrade. The change was wrongly executed as a router was picked that could not handle the traffic due to capacity constraints. Consequently, network outages occurred, finally leading to unavailability, temporary, and even durable data loss of customers. We propose an object-oriented verification technique to detect conflicts among IT change operations and safety constraints, such as network capacity constraints, in the verification phase before the execution of IT changes. Based on Amazon's incident report different scenarios in static and dynamic routing environments that cause a network overload are shown to be detectable by logical verification. The verification algorithm is proven to be sound and has linear runtime complexity for Amazon's network overload scenarios. A performance analysis confirms the theoretical results and promises scalability to thousands of IT changes and safety constraints.

Original languageEnglish
Title of host publicationProceedings of the 2012 IEEE Network Operations and Management Symposium, NOMS 2012
Pages368-376
Number of pages9
DOIs
StatePublished - 2012
Event2012 IEEE Network Operations and Management Symposium, NOMS 2012 - Maui, HI, United States
Duration: 16 Apr 201220 Apr 2012

Publication series

NameProceedings of the 2012 IEEE Network Operations and Management Symposium, NOMS 2012

Conference

Conference2012 IEEE Network Operations and Management Symposium, NOMS 2012
Country/TerritoryUnited States
CityMaui, HI
Period16/04/1220/04/12

Fingerprint

Dive into the research topics of 'Efficient verification of IT change operations or: How we could have prevented Amazon's cloud outage'. Together they form a unique fingerprint.

Cite this