Estimating Variance of Returns using Temporal Difference Methods

  • Author / Creator
    Bennett, Brendan
  • Temporal difference (TD) methods provide a powerful means of learning to make predictions in an online, model-free, and highly scalable manner. In the reinforcement learning (RL) framework, we formalize these prediction targets in terms of a (possibly discounted) sum of rewards, called the return. Historically, RL methods have mainly focused on learning to estimate the expected return, or the value, but there has been some indication that using TD methods to make more general predictions would be desirable. In this thesis, we describe an approach to making such predictions, with emphasis on estimating the variance of the return. Equipped with an estimate of the variance, a learning agent can gauge not just the mean outcome in a given situation, but also the degree to which an individual return will tend to deviate from the average. Such knowledge could be applied towards expressing more sophisticated predictions, decision making under uncertainty, or hyperparameter optimization, among other things. Previous work has shown that it is possible to construct an approximate Bellman equation for higher moments of the return using estimates of the preceding moments, which can then be used in a TD-style algorithm to learn those moments. This approach builds on the \\emph{raw} moments of the return, which tend to make for poor approximation targets due to the outsize effect that noise and other sources of error have on them. In contrast, the \\emph{central} moments generally make for more robust approximation targets. Learning to estimate the return's second central moment, \\ie the variance, would be useful on its own and as a prelude to future algorithms. However, defining a suitable prediction target for the return's variance is not straightforward. The expected return is easily expressed as a Bellman equation; variance, as a nonlinear function of the return, is hard to formulate in similar terms. Establishing convergence for nonlinear algorithms is more difficult as well. Our main contributions concern an algorithm that attempts to navigate these issues: Direct Variance Temporal Difference Learning (DVTD). It consists of two components: the first learns the value function, while the second learns to predict the discounted sum of squared TD errors emitted by the value learner. This $\\delta^{2}$-return is equivalent to the variance of the original return when the value function is unbiased. We provide an analysis demonstrating this equivalence, which also illuminates the relationship between the $\\delta^{2}$-return and alternative various moment-based targets. For the more typical case where the true value function is unavailable, we provide an interpretation for what DVTD is estimating, and show that it converges to a unique fixed-point under linear function approximation. We also describe how adjusting hyperparameters can yield new approximation targets, allowing us to estimate the variance of the λ-return. Finally, we report on some experiments indicating DVTD's superior performance relative to alternative methods, which also serve to validate our claims regarding DVTD's stability and practical usability.

  • Subjects / Keywords
  • Graduation date
    Spring 2021
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.