Controlled Spatial Sharing of GPUs for scalable and Efficient Inference at Edge

Edge clouds are deployed to provide very responsive services to the end-users. Especially, the increasing demand for cloud-based inference services requires the use of Graphics Processing Units (GPUs). However, the Edge resources such as CPUs and GPUs are limited and must be shared across multiple concurrently running clients. It is highly desirable to multiplex different inference tasks on the GPU and utilize the GPUs efficiently. Batched processing, CUDA streams and Multi-process-service (MPS) help. However, we find that these are not adequate for achieving scalability and do not guarantee predictable performance. Further, edge servers require considerable amounts of streaming data to be processed and account for load variations. In this talk, I will present GSLICE - a GPU virtualization platform. GSLICE incorporates a dynamic GPU resource allocation and management framework to maximize performance and resource utilization. We virtualize the GPU by apportioning the GPU resources across different Inference Functions (IFs), thus providing isolation and guaranteeing performance. We develop controlled spatial sharing of GPU, self-learning and adaptive GPU resource allocation and batching schemes that account for network traffic characteristics, while also keeping inference latencies below service level objectives. GSLICE adapts quickly to the streaming data’s workload intensity and the variability of GPU processing costs. GSLICE provides scalability of the GPU for IF processing through efficient and controlled spatial multiplexing, coupled with a GPU resource reallocation scheme with minimal (< 100 microseconds) downtime. Compared to TensorRT and default MPS, GSLICE significantly improves GPU utilization efficiency and achieves 2-13x improvement in aggregate throughput.

Sameer G Kulkarni is an Assistant Professor in Computer Sciences at Indian Institute of Technology, Gandhinagar. Prior to this, he worked as a postdoctoral  researcher at the University of California, Riverside. He  received a Ph.D.  degree  in  Computer  Science  from University  of  Göttingen, Germany  in  July  2018. He  received his M.S. degree in Computer Engineering from the University of Southern California, in 2010, and B.E. degree in Computer Science and Engineering from National Institute of Engineering, Mysore, in 2004. He is the recipient of the IEEE TCSC Best PhD Dissertation Award 2019. His work focuses on the resource management aspects towards building Efficient, Scalable and Resilient NFV/Edge platforms. His  research  interests  include  Software Defined  Networking, Network Function Virtualization, Edge Cloud Platforms, Distributed systems,  and  Disaster  Management.

Controlled Spatial Sharing of GPUs for scalable and Efficient Inference at Edge

Abstract

Sameer Kulkarni, IIT Gandhinagar