Finite time analysis of temporal difference learning with linear function approximation:- Tail averaging and regularisation

# 136


We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm, when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal O(1/t) rate, both in expectation and with high probability. Our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation, and show that this variant fares favourably in problems with ill-conditioned features.

L A Prashanth, IIT Madras

Prashanth L.A. is an Associate Professor in the Department of Computer Science and Engineering at Indian Institute of Technology Madras. Prior to this, he was a postdoctoral researcher at the Institute for Systems Research, University of Maryland - College Park from 2015 to 2017 and at INRIA Lille - Team SequeL from 2012 to 2014. From 2002 to 2009, he was with Texas Instruments (India) Pvt Ltd, Bangalore, India. He received his Masters and Ph.D degrees in Computer Science and Automation from Indian Institute of Science, in 2008 and 2013, respectively. He was awarded the third prize for his Ph.D. dissertation, by the IEEE Intelligent Transportation Systems Society (ITSS). He is a coauthor of two books: `Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods', published by Springer in 2013, and `Risk-Sensitive Reinforcement Learning via Policy Gradient Search', which appeared in the `Foundations and Trends in Machine Learning' series by NOW publishers in 2022. His research interests are in reinforcement learning, simulation optimization and multi-armed bandits, with applications in transportation systems, wireless networks and recommendation systems.