TY - GEN
T1 - Rollback-recovery without checkpoints in distributed event processing systems
AU - Koldehofe, Boris
AU - Mayer, Ruben
AU - Ramachandran, Umakishore
AU - Rothermel, Kurt
AU - Völz, Marco
PY - 2013
Y1 - 2013
N2 - Reliability is of critical importance to many applications involving distributed event processing systems. Especially the use of stateful operators makes it challenging to provide efficient recovery from failures and to ensure consistent event streams. Even during failure-free execution, state-of-the-art methods for achieving reliability incur significant overhead at run-time concerning computational resources, event traffic, and event detection time. This paper proposes a novel method for rollback-recovery that allows for recovery from multiple simultaneous operator failures, but eliminates the need for persistent checkpoints. Thereby, the operator state is preserved in savepoints at points in time when its execution solely depends on the state of incoming event streams which are reproducible by predecessor operators. We propose an expressive event processing model to determine save-points and algorithms for their coordination in a distributed operator network. Evaluations show that very low overhead at failure-free execution in comparison to other approaches is achieved.
AB - Reliability is of critical importance to many applications involving distributed event processing systems. Especially the use of stateful operators makes it challenging to provide efficient recovery from failures and to ensure consistent event streams. Even during failure-free execution, state-of-the-art methods for achieving reliability incur significant overhead at run-time concerning computational resources, event traffic, and event detection time. This paper proposes a novel method for rollback-recovery that allows for recovery from multiple simultaneous operator failures, but eliminates the need for persistent checkpoints. Thereby, the operator state is preserved in savepoints at points in time when its execution solely depends on the state of incoming event streams which are reproducible by predecessor operators. We propose an expressive event processing model to determine save-points and algorithms for their coordination in a distributed operator network. Evaluations show that very low overhead at failure-free execution in comparison to other approaches is achieved.
KW - Complex event processing
KW - Recovery
KW - Reliability
UR - http://www.scopus.com/inward/record.url?scp=84881159962&partnerID=8YFLogxK
U2 - 10.1145/2488222.2488259
DO - 10.1145/2488222.2488259
M3 - Conference contribution
AN - SCOPUS:84881159962
SN - 9781450317580
T3 - DEBS 2013 - Proceedings of the 7th ACM International Conference on Distributed Event-Based Systems
SP - 27
EP - 38
BT - DEBS 2013 - Proceedings of the 7th ACM International Conference on Distributed Event-Based Systems
T2 - 7th ACM International Conference on Distributed Event-Based Systems, DEBS 2013
Y2 - 29 June 2013 through 3 July 2013
ER -