Almost-Optimal Best Restless Markov Arm Identification with Fixed Confidence
# 162
Abstract
In this talk, I will describe some recent results on the problem of identifying the best arm in a restless multi-armed bandit with fixed confidence. Formally, the setting comprises a multi-armed bandit with finitely many arms, in which each arm is a homogenous and discrete-time Markov chain taking values in a common, finite state space, and the arms are restless. The state transitions on any arm are governed by the arm’s ergodic transition probability matrix (TPM), assumed to be parameterised by an unknown, real-valued parameter. Given a reward function defined on the common state space of the arms, the goal is to find the best arm--the arm with the largest mean reward computed under the arm’s stationary distribution-- by minimising the expected stopping time, subject to an upper bound on the error probability (fixed-confidence regime). For this problem, our results are in the form of (a) an asymptotic lower bound on the growth rate of the expected stopping time, where the asymptotics is as the error probability vanishes, and (b) a policy for best arm identification whose expected stopping time satisfies an asymptotic growth rate that matches with the lower bound and is hence asymptotically optimal. We use results from best policy identification in MDPs to establish asymptotic optimality. Prior works deal with independent observations from the arms, rested Markov arms, and restless Markov arms with known arm TPMs. Our work is the first to study best arm identification in restless bandits with unknown arm TPMs. This is joint work with Vincent Tan (NUS), Ali Tajer (RPI), and Arpan Mukherjee (RPI).
Dr. P N Karthik, NUS Singapore
Dr. Karthik is a Research Fellow in the Institute of Data Science at the National University of Singapore (NUS), where he works with Prof. Vincent Tan. Prior to joining NUS, he obtained the Ph.D. and Master of Science (Engineering) dual degree from the Department of Electrical Communication Engineering at the Indian Institute of Science (IISc), Bengaluru, with Prof. Rajesh Sundaresan as his thesis supervisor. Much earlier to joining the dual degree programme, he worked as a Project Assistant in Prof. Chandra R. Murthy's lab at IISc. He graduated with a Bachelor of Engineering degree in Electronics and Communications from R V College of Engineering. His recent research includes multi-armed bandits, Markov decision processes, distributed learning, and transfer learning, to name a few.