CNI Seminar Series

Failure diagnosis in networked systems

Dr. Vipul Harsh, Post-doc, Conviva Networks

#240
Slides
Abstract

Failure incidents in networked systems , including distributed services, datacenter networks, private or public cloud environments, result in significant downtime and violations of service-level agreements, incurring large financial losses. To mitigate failures quickly, operators seek to implement automated failure diagnosis, also known as Root Cause Analysis (RCA). There are two major challenges involved in designing RCA solutions: (1) accurately modeling the behaviour of the system via available telemetry and (2) using the model to infer the root causes accurately. Existing RCA approaches struggle with modeling complex environments or employ suboptimal inference algorithms that are inadequate in extracting high accuracy from the model. I will present practical solutions for RCA, tackling these challenges via leveraging powerful reasoning techniques to derive insights from available monitoring data. Finally, I will briefly touch upon the need for a holistic systems-level approach for effective failure diagnosis beyond just enhancing modeling techniques and algorithms.


Bio
Dr. Vipul Harsh, Post-doc, Conviva Networks

Vipul Harsh graduated with a Ph.D. from UIUC where he worked with Brighten Godfrey. He is broadly interested in networked and distributed systems. His work spans failure diagnosis in networked systems, datacenter topology, distributed monitoring, and parallel algorithms. His works have been published in top-tier CS conferences and one of his projects has been adopted into a product at VMware. He is currently a post-doc at Conviva networks with Vyas Sekar where he works on failure diagnosis in internet-scale services among other things.