Scaling AI Model Serving: QoS, Multimodality, and Beyond

CNI Seminar Series

Scaling AI Model Serving: QoS, Multimodality, and Beyond

Dr. Jayashree Mohan, Senior Researcher at Microsoft Research

#251

Abstract

As AI models grow in scale and complexity, deploying them in production environments presents significant challenges—including heterogeneous multi-stage inference pipelines, bursty and unpredictable traffic patterns, and diverse latency requirements across applications. This talk addresses these challenges through two complementary systems. I'll first present Niyama, a QoS-driven inference serving system for Large Language Models (LLMs). Niyama enables fine-grained latency classification and dynamic scheduling on shared infrastructure, leveraging predictable execution patterns to improve throughput while maintaining strict service-level objectives (SLOs). It introduces hybrid prioritization and selective request relegation to gracefully manage overload, increasing serving capacity by 32% and drastically reducing SLO violations under overload. Next, I'll present ModServe, a modular serving framework for Large Multimodal Models (LMMs). ModServe decouples inference stages for independent optimization and employs modality-aware scheduling and autoscaling to handle bursty traffic efficiently. Evaluated on a 128-GPU cluster with production traces, ModServe achieves 3.3–5.5× higher throughput and up to 41.3% cost savings while meeting latency SLOs. Together, Niyama and ModServe showcase how systems-level innovations can unlock scalable, cost-effective, and QoS-compliant serving of advanced AI models across modalities.

Bio

Dr. Jayashree Mohan, Senior Researcher at Microsoft Research

Jayashree Mohan is a Senior Researcher at Microsoft Research Bangalore working primarily on building efficient AI infrastructure, with a current focus on LLM serving. Prior to joining MSR, Jayashree received a PhD in Computer Science from the University of Texas at Austin, where her dissertation focused on optimizing storage for ML workloads.