Social Learning in Multi-Agent Multi-Armed Bandits

Consider a large number of agents, N, faced with the problem of choosing amongst a large number of options, K. The problem occurs repeatedly, and every time an agent chooses an option, it receives a random reward or payoff whose distribution depends on the option but not on the agent. The goal is to maximize the long-run payoff. The problem involves a trade-off between exploitation - choosing the option currently believed to be the best - and exploration - choosing possibly sub-optimal options in order to gain more information about their payoffs. The challenge is to optimize this trade-off. If there were a single agent, then this is an instance of the multi-armed bandit problem with K arms., which has been studied extensively for decades. If no communication is allowed between agents, then it is N parallel instances of the multi-armed bandit problem. If there are no communication constraints, then the agents act in aggregate as if they were a single agent. We are interested in the intermediate case where limited communication is allowed. We show that, even with limited communication, in the long run the system behaves in aggregate as if there were a single agent, i.e., as if there were no communication constraints. Joint work with Abhishek Sankararaman, Ronshee Chawla, Sanjay Shakkottai, ConorNewton and Henry Reeve.

Ayalvadi Ganesh is an Associate Professor at the School of Mathematics, University of Bristol. His research interests include large deviations, queueing theory, random graph dynamics, and decentralised algorithms. He won the INFORMS Best Publication Award in 2005 and the ACM Sigmetrics Best Paper Prize in 2010.

Abstract

Ayalvadi Ganesh, University of Bristol