Over the past couple of years, Large Language Models (aka LLMs) have captured the world's imagination by demonstrating human-like skills in a wide variety of tasks spanning natural language processing, question answering and code generation. Consequently, LLMs are being deployed at an unprecedented scale across many applications such as chatbots, search and code assistants etc. However, serving LLMs is expensive; each replica of the model typically requires multiple GPUs whereas the resource utilization at each replica is usually low. In this talk, I will discuss how LLMs are served and why it is challenging to serve them efficiently, along with some of the work we have done at Microsoft Research to make LLM serving more efficient. In particular, I will talk about Sarathi-Serve [OSDI'24] and vAttention [ASPLOS'25] that deal with some of the fundamental challenges associated with scheduling and memory management in LLM serving systems.
Ashish Panwar is a Senior Researcher at Microsoft Research India where he works in the systems group. He currently spends most of his time thinking about how to optimize the performance of the LLM serving systems. Ashish obtained his PhD in 2022 from the CSA department at IISc. In his thesis, Ashish explored various methods to optimize the management of virtual memory in current operating systems, for which he also received the best PhD thesis award from CSA.