BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms

CNI Seminar Series

BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms

Prof. Vincent Y. F. Tan, Professor, National University of Singapore.

#255

Abstract

Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.

Bio

Prof. Vincent Y. F. Tan, Professor, National University of Singapore.

Vincent Y. F. Tan received the B.A. and M.Eng. degrees in electrical and information science from Cambridge University in 2005, and the Ph.D. degree in electrical engineering and computer science (EECS) from the Massachusetts Institute of Technology (MIT) in 2011. He is currently a Professor with the Department of Mathematics and the Department of Electrical and Computer Engineering (ECE), National University of Singapore (NUS). His research interests include information theory, machine learning, and statistical signal processing. Dr. Tan is an elected member of the IEEE Information Theory Society Board of Governors. He is currently serving as a Senior Area Editor for the IEEE Transactions on Signal Processing and as an Area Editor in Shannon Theory and Information Measures for the IEEE Transactions on Information Theory. He also regularly serves as an Area Chair or Senior Area Chair of prominent machine learning conferences such as the International Conference on Learning Representations (ICLR) and the Conference on Neural Information Processing Systems (NeurIPS).