In the regret-based formulation of multi-armed Bandit (MAB) problems, except in rare instances, much of the literature focuses on arms with independent and identically distributed (i.i.d.) rewards. In this talk, we will consider the problem of obtaining regret guarantees for MAB problems, in which the rewards of each arm form a Markov chain that may not belong to a single parameter exponential family. To achieve a logarithmic regret in such problems is not difficult: a variation of standard Kullback–Leibler upper confidence bound (KL-UCB) does the job. However, the constants obtained from such an analysis are poor for the following reason: i.i.d. rewards are a special case of Markov rewards and it is difficult to design an algorithm that works well independent of whether the underlying model is truly Markovian or i.i.d. To overcome this issue, we will introduce a novel algorithm that identifies whether the rewards from each arm are truly Markovian or i.i.d. using a total variation distance-based test. Our algorithm then switches from using a standard KL-UCB to a specialized version of KL-UCB when it determines that the arm reward is Markovian, thus resulting in low regrets for both i.i.d. and Markovian settings.
Arghyadip Roy received his Ph.D. degree in electrical engineering from IIT Bombay, India, in 2019. Before that, he received the B.E. and M.Tech degrees from Jadavpur University and IIT Kharagpur, respectively. He worked as a postdoctoral researcher at the University of Illinois at Urbana-Champaign, USA from 2019 to 2021. He is currently working as an assistant professor in the Mehta Family School of Data Science and Artificial Intelligence at the Indian Institute of Technology, Guwahati, India. His research interests include resource allocation in wireless networks, optimization and control of stochastic systems, and reinforcement learning.