In real-world decision-making problems one needs to pick among multiple policies the one that performs best while respecting economic constraints. This motivates the problem of constrained best-arm identification for bandit problems where every arm is a joint distribution of reward and cost. We investigate the general case where reward and cost are dependent. The goal is to accurately identify the arm with the highest mean reward among all arms whose mean cost is below a given threshold. We prove information theoretic lower bounds on the sample complexity for three models: Gaussian with fixed covariance, Gaussian with unknown covariance, and non-parametric distributions of rectangular support. We propose a combination of a sampling and a stopping rule that correctly identifies the constrained best arm and matches the optimal sample complexities for each of the three models. Our simulations demonstrate the empirical performance of our algorithms.
Koolen is a senior researcher with the Machine Learning group at the Centrum Wiskunde & Informatica in the Netherlands. He is also a professor of Mathematical Machine Learning in the Statistics group at the University of Twente. Koolen earned his PhD in computer science from the University of Amsterdam cum laude in 2011. He was a postdoc at Royal Holloway and QUT Brisbane. He currently works on topics in machine learning theory and related areas, including game theory, information theory, statistics and optimisation. His interests include pure exploration in multi-armed bandit models, game tree search, and provably accelerated learning in statistical and individual-sequence settings, and learning faster from easy data.