Redefining Caching for Generative AI

Abstract

Caching has long been a foundational principle in designing efficient systems. However, in the era of Generative AI, we are revisiting caching with a new perspective. Due to the inherent variability in generative models and the diverse outputs they produce, exact caching has significant limitations. To address these challenges, we propose novel approximate caching techniques to enhance the efficiency of text-to-image diffusion models and large language models (LLMs) in Generative AI workflows. In this talk, I will present two key works: The first work focuses on diffusion models for text-to-image generation. These models require several iterative denoising steps and are computationally intensive, relying on expensive GPUs and incurring considerable latency. In this research, we introduce a novel approximate-caching system, NIRVANA, which reduces the computational burden by reusing intermediate noise states generated during prior image creations. This approach yields significant savings in GPU compute and reduces generation latency. This work was published in USENIX NSDI 2024. The second work, Cache-Craft, extends the concept of approximate caching to LLMs, specifically in the context of Retrieval-Augmented Generation (RAG). Cache-Craft enables the reuse of precomputed key-value pairs (chunk-caches) for knowledge chunks, even when input text contexts differ across queries. By strategically recomputing to maintain output quality, Cache-Craft minimizes redundant computations and improves both time-to-first-token and system throughput. This work will be published in ACM SIGMOD 2025. References: [1] “Approximate Caching for Efficiently Serving {Text-to-Image} Diffusion Models” NSDI 2024 Authors: Shubham Agarwal, Subrata Mitra, Sarthak Chakraborty, Srikrishna Karanam, Koyel Mukherjee, Shiv Kumar Saini [2] “Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation” SIGMOD 2025. Authors: Shubham Agarwal∗, Sai Sundaresan1∗, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, Shiv Kumar Saini

Bio

Dr. Subrata Mitra, Adobe Research

Subrata Mitra is a Senior Research Scientist at Adobe Research, Bangalore. His current research lies at the intersection of computer systems and machine learning, with a focus on efficiency and scalability. Previously, his work primarily addressed improving the performance and reliability of cloud and distributed systems and improving scalability of Big-Data processing and Recommender Systems. His research contributions include approximately 40 papers published in top computer systems conferences e.g. USENIX NSDI, EuroSys, USENIX ATC, SenSys, SIGMOD, VLDB, PLDI, and top artificial intelligence conferences e.g. AAAI, ICML, NeurIPS, ACL, and ECCV. He also serves in the PC of several major conferences. He received his Ph.D. in Electrical and Computer Engineering from Purdue University, West Lafayette, MS in Computer Engineering from University of Florida, Gainesville and BE in Electronics and Telecommunication Engineering from Jadavpur University, Kolkata. Prior to Adobe Research, he had spent time as research interns at Microsoft Research – Redmond, AT&T Research – New Jersey and Lawrence Livermore National Labs - Livermore. Even prior to that he worked in Software Engineering roles at Intel, Santa Clara and Atrenta (now Synopsys) on new product development on Electronic Design Automation.

CNI Seminar Series