Continuous Time Bandits: A new learning model

# 54






Abstract

We consider a continuous time multi-arm bandit problem (CTMAB), where the learner can sample arms any number of times in a given interval and obtain a random reward from each sample, however, increasing the frequency of sampling incurs an additive penalty/cost. Thus, there is a trade-off between obtaining large reward and incurring sampling cost as a function of the sampling frequency. The goal is to design a learning algorithm that minimizes the regret, that is defined as the difference of the payoff of the oracle policy and that of the learning algorithm. CTMAB is fundamentally different than the usual multi-arm bandit problem (MAB), e.g., even the single arm case is non-trivial in CTMAB, since the optimal sampling frequency depends on the mean of the arm, which needs to be estimated. We establish lower and upper bounds that are tight up to logarithmic terms. Joint Work with Manjesh Hanawal, IIT-B

Rahul Vaze, TIFR Mumbai

Rahul Vaze obtained his Ph.D. from The University of Texas at Austin in 2009. Currently he is an Associate Professor at the School of Technology and Computer Science, Tata Institute of Fundamental Research, Mumbai, India. His research interests are in communication networks, combinatorial resource allocation, online algorithms. He is the author of "Random Wireless Networks", Cambridge University Press, 2015. He is a co-recipient of the Eurasip best paper award for year 2010 for the Journal of Wireless Communication and Networking, the best paper award WiOpt 2020, and the best paper award Performance 2020.