Computing resources footprint has skyrocketed in past years due to high demand for AI workloads as well as traditional web traffic. It is important to manage the computing resources intelligently to provide good user experience at lower costs. In this talk, I will cover two projects at MSR India for such intelligent resource management. First, I will talk about Atlas that focuses on improving the training time and GPU utilization for geo-distributed LLM training. Atlas uses novel temporal bandwidth sharing and many other design choices to speedup training by 17X. However, it does not eliminate the bubbles (idle GPU cycles). We extend Atlas to multiplex training and inference on the same GPU clusters. It runs prefill-as-a-service (part of LLM inference) during the bubbles that improves the GPU utilization up to 94%. Second, I will talk about KnapsackLB, a new layer-4 load balancer (LB) that adapts to different and dynamic performance of backend instances (DIPs). KnapsackLB is generic (can work with variety of LBs), does not require agents on DIPs, LBs or clients, and scales to large numbers of DIPs. KnapsackLB uses judicious active probes to learn a mapping from LB weights to the response latency of each DIP, and then applies Integer Linear Programming (ILP) to calculate LB weights that optimize latency, using an iterative method to scale the computation to large numbers of DIPs. Using testbed experiments and simulations, we show that KnapsackLB load balances traffic according to DIP performance and cuts average latency by up to 45% compared to existing designs.
Dr. Rohan Gandhi is a senior research engineer at Microsoft Research, India. He received his PhD from Purdue University and completed his post-doc from Carnegie Mellon University. His work mainly focuses on improving the resource management for networked systems. His work has been published in top conferences including ACM SIGCOMM, ACM HotNets, ACM CoNEXT, ACM Eurosys.