Towards a Theory of Exploration in Continuous-Time Reinforcement Learning

# 174






Abstract

Reinforcement learning in continuous time is a suitable learning model for agents interacting with a stochastic environment at ultra-high frequency. However, most of the existing literature on continuous time RL focuses on semimartingale or Markovian environments, which do not accurately represent many real-world ultra-high frequency applications. For example, the volatility of stock market returns can be appropriately modeled as fractional Brownian motion (fBM), and controlled stochastic networks with heavy-tailed service in heavy-traffic are well approximated by reflected fBM processes. These are but two examples across a range of applications in science and engineering. This talk presents our recent results in developing a theoretical framework for modeling exploration processes in general continuous time stochastic environments using rough path theory. In continuous time, exploration becomes more complex since randomized policies at each time-step cannot be used to explore. Instead, actions must be continuously randomized in some way. Relaxed controls, in the sense of continuous curves of probability measures in Wasserstein space, provide a natural approach to modeling a randomized exploration process in continuous time. Therefore, we propose and analyze a pathwise relaxed control framework to model exploration in continuous-time reinforcement learning in general stochastic environments. Specifically, we establish the existence and uniqueness of the value function as a solution of a rough Hamilton-Jacobi-Bellman equation. This is our first step towards establishing a comprehensive theory of continuous-time reinforcement learning. As an immediate application of the analysis of the value function, we use it to characterize the optimal exploration-relaxed control for an entropy-regularized objective, which emulates maximum-entropy objectives in discrete-time reinforcement learning. This talk is based on joint work with Prakash Chakraborty at Penn State and Samy Tindel at Purdue. This project is supported by the National Science Foundation through grant DMS/2153915.

Prof. Harsha Honnappa, Purdue University

Harsha Honnappa is an Associate Professor in the School of Industrial Engineering at Purdue University and a J. Tinsley Oden Visiting Faculty Fellow at the Oden Institute at The University of Texas at Austin. His research interests as an applied probabilist encompass stochastic modeling, optimization and control, with applications to machine learning, simulation and statistical inference. His research is supported by the National Science Foundation, including an NSF CAREER award, and the Purdue Research Foundation. He is an editorial board member at Operations Research, Operations Research Letters and Queueing Systems journals.