Efficient verification of IT change operations or: How we could have prevented Amazon's cloud outage

Sebastian Hagen, Michael Seibold, Alfons Kemper

Publikation: Beitrag in Buch/Bericht/KonferenzbandKonferenzbeitragBegutachtung

15 Zitate (Scopus)

Abstract

On April 21st, 2011, a major outage occurred in Amazon's US east coast data center which led to significant disruptions on customer services. The root cause of the outage was an IT change to route traffic off from a router to a redundant router to conduct a network upgrade. The change was wrongly executed as a router was picked that could not handle the traffic due to capacity constraints. Consequently, network outages occurred, finally leading to unavailability, temporary, and even durable data loss of customers. We propose an object-oriented verification technique to detect conflicts among IT change operations and safety constraints, such as network capacity constraints, in the verification phase before the execution of IT changes. Based on Amazon's incident report different scenarios in static and dynamic routing environments that cause a network overload are shown to be detectable by logical verification. The verification algorithm is proven to be sound and has linear runtime complexity for Amazon's network overload scenarios. A performance analysis confirms the theoretical results and promises scalability to thousands of IT changes and safety constraints.

OriginalspracheEnglisch
TitelProceedings of the 2012 IEEE Network Operations and Management Symposium, NOMS 2012
Seiten368-376
Seitenumfang9
DOIs
PublikationsstatusVeröffentlicht - 2012
Veranstaltung2012 IEEE Network Operations and Management Symposium, NOMS 2012 - Maui, HI, USA/Vereinigte Staaten
Dauer: 16 Apr. 201220 Apr. 2012

Publikationsreihe

NameProceedings of the 2012 IEEE Network Operations and Management Symposium, NOMS 2012

Konferenz

Konferenz2012 IEEE Network Operations and Management Symposium, NOMS 2012
Land/GebietUSA/Vereinigte Staaten
OrtMaui, HI
Zeitraum16/04/1220/04/12

Fingerprint

Untersuchen Sie die Forschungsthemen von „Efficient verification of IT change operations or: How we could have prevented Amazon's cloud outage“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren