TY - GEN
T1 - An Optical Transceiver Reliability Study based on SFP Monitoring and OS-level Metric Data
AU - Notaro, Paolo
AU - Yu, Qiao
AU - Haeri, Soroush
AU - Cardoso, Jorge
AU - Gerndt, Michael
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - The increasing demand for cloud computing drives the expansion in scale of datacenters and their internal optical network, in a strive for increasing bandwidth, high reliability, and lower latency. Optical transceivers are essential elements of optical networks, whose reliability has not been well-studied compared to other hardware components. In this paper, we leverage high quantities of monitoring data from optical transceivers and OS-level metrics to provide statistical insights about the occurrence of optical transceiver failures. We estimate transceiver failure rates and normal operating ranges for monitored attributes, correlate early-observable patterns to known failure symptoms, and finally develop failure prediction models based on our analyses. Our results enable network administrators to deploy early-warning systems and enact predictive maintenance strategies, such as replacement or traffic re-routing, reducing the number of incidents and their associated costs.
AB - The increasing demand for cloud computing drives the expansion in scale of datacenters and their internal optical network, in a strive for increasing bandwidth, high reliability, and lower latency. Optical transceivers are essential elements of optical networks, whose reliability has not been well-studied compared to other hardware components. In this paper, we leverage high quantities of monitoring data from optical transceivers and OS-level metrics to provide statistical insights about the occurrence of optical transceiver failures. We estimate transceiver failure rates and normal operating ranges for monitored attributes, correlate early-observable patterns to known failure symptoms, and finally develop failure prediction models based on our analyses. Our results enable network administrators to deploy early-warning systems and enact predictive maintenance strategies, such as replacement or traffic re-routing, reducing the number of incidents and their associated costs.
KW - cloud computing
KW - datacenters
KW - failure study
KW - hardware reliability
KW - optical network
KW - optical transceiver
UR - http://www.scopus.com/inward/record.url?scp=85166316025&partnerID=8YFLogxK
U2 - 10.1109/CCGrid57682.2023.00011
DO - 10.1109/CCGrid57682.2023.00011
M3 - Conference contribution
AN - SCOPUS:85166316025
T3 - Proceedings - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023
SP - 1
EP - 12
BT - Proceedings - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023
A2 - Simmhan, Yogesh
A2 - Altintas, Ilkay
A2 - Varbanescu, Ana-Lucia
A2 - Balaji, Pavan
A2 - Prasad, Abhinandan S.
A2 - Carnevale, Lorenzo
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023
Y2 - 1 May 2023 through 4 May 2023
ER -