TY - JOUR
T1 - Response Time and Availability Study of RAFT Consensus in Distributed SDN Control Plane
AU - Sakic, Ermin
AU - Kellerer, Wolfgang
N1 - Publisher Copyright:
© 2004-2012 IEEE.
PY - 2018/3
Y1 - 2018/3
N2 - Software defined networking (SDN) promises unprecedented flexibility and ease of network operations. While flexibility is an important factor when leveraging advantages of a new technology, critical infrastructure networks also have stringent requirements on network robustness and control plane delays. Robustness in the SDN control plane is realized by deploying multiple distributed controllers, formed into clusters for durability and fast-failover purposes. However, the effect of the controller clustering on the total system response time is not well investigated in current literature. Hence, in this work we provide a detailed analytical study of the distributed consensus algorithm RAFT, implemented in OpenDaylight and ONOS SDN controller platforms. In those controllers, RAFT implements the data-store replication, leader election after controller failures and controller state recovery on successful repairs. To evaluate its performance, we introduce a framework for numerical analysis of various SDN cluster organizations w.r.t. Their response time and availability metrics. We use Stochastic Activity Networks for modeling the RAFT operations, failure injection and cluster recovery processes, and using real-world experiments, we collect the rate parameters to provide realistic inputs for a representative cluster recovery model. We also show how a fast rejuvenation mechanism for the treatment of failures induced by software errors can minimize the total response time experienced by the controller clients, while guaranteeing a higher system availability in the long-Term.
AB - Software defined networking (SDN) promises unprecedented flexibility and ease of network operations. While flexibility is an important factor when leveraging advantages of a new technology, critical infrastructure networks also have stringent requirements on network robustness and control plane delays. Robustness in the SDN control plane is realized by deploying multiple distributed controllers, formed into clusters for durability and fast-failover purposes. However, the effect of the controller clustering on the total system response time is not well investigated in current literature. Hence, in this work we provide a detailed analytical study of the distributed consensus algorithm RAFT, implemented in OpenDaylight and ONOS SDN controller platforms. In those controllers, RAFT implements the data-store replication, leader election after controller failures and controller state recovery on successful repairs. To evaluate its performance, we introduce a framework for numerical analysis of various SDN cluster organizations w.r.t. Their response time and availability metrics. We use Stochastic Activity Networks for modeling the RAFT operations, failure injection and cluster recovery processes, and using real-world experiments, we collect the rate parameters to provide realistic inputs for a representative cluster recovery model. We also show how a fast rejuvenation mechanism for the treatment of failures induced by software errors can minimize the total response time experienced by the controller clients, while guaranteeing a higher system availability in the long-Term.
KW - ONOS
KW - OpenDaylight
KW - Performance analysis
KW - RAFT
KW - SDN
KW - distributed control plane
KW - fault tolerance
KW - smart grid
KW - stochastic activity networks
KW - strong consistency
UR - http://www.scopus.com/inward/record.url?scp=85035150500&partnerID=8YFLogxK
U2 - 10.1109/TNSM.2017.2775061
DO - 10.1109/TNSM.2017.2775061
M3 - Article
AN - SCOPUS:85035150500
SN - 1932-4537
VL - 15
SP - 304
EP - 318
JO - IEEE Transactions on Network and Service Management
JF - IEEE Transactions on Network and Service Management
IS - 1
ER -