
Information Retrieval and Large Language Models require huge amounts of memory for their operation. This talk presents two ideas for reducing the memory requirements of these applications. In the first part, we offload memory-resident data to storage devices and employ direct-GPU transfers. In order to reduce the consequent delays, we employ prefetching. In the second part, we look at the memory demands of batched AI inference. We employ context specific sparsity to reduce memory, and computation demands to reduce the AI inference costs. These projects highlight how systems and algorithm co-design can improve the efficiencies of the emerging AI workloads.
Narasimha Reddy is a professor and currently also serves as Head of the Electrical and Computer Engineering Department at Texas A&M University. His research interests are in Storage systems, computer architecture and network security. He has previously served as an Associate Dean for Research at the College of Engineering at Texas A& M University. He obtained his Ph.D from University of Illinois at Urbana-Champaign and B.Tech from Indian Institute of Technology.