Network Fault Diagnostics in data centers & Beyond
Data center network faults are hard to debug due to their scale and complexity. With the prevalence of hard-to-reproduce transient faults, root-cause analysis of network faults is extremely difficult due to unavailability of historical data, and inability to correlate the distributed data across the network. Often, it is not possible to find the root cause using only switch-local information. To find the root cause of such transient faults, we need: 1) Visibility: fine-grained, packet-level and network-wide observability, 2) Retrospection: ability to get historical information before the fault occurs, and 3) Correlation: ability to correlate the information across the network. In this work, we present the design and implementation of SyNDB, a tool with the aforementioned capabilities to enable root cause analysis of network faults. We implement and evaluate SyNDB with realistic topologies using large scale simulation and programmable switches. Our evaluations show that SyNDB can capture and correlate packet records over sufficiently large time windows (∼4 ms), thus facilitating the root cause analysis of various network faults. Towards the later part of the talk, I will be giving a high-level overview of my ongoing projects on telco and multi-cloud networking at IBM Research India (IRL).
Pravein is a research scientist at IBM Research India. He has completed PhD from National University of Singapore in 2019 His research has been recognized with the best paper award at ACM SOSR 2019 and Facebook research award. Prior to PhD, he has obtained masters from NUS, B.E from College of Engineering, Guindy in 2008 and worked in Cisco for 4 years. His research interests are areas surrounding telco, data center networks, and programmable networks.