Interrelating Prediction and Control Objectives in Episodic Actor-Critic Reinforcement Learning

  • Author / Creator
    Chockalingam, Valliappa
  • The reinforcement learning framework provides a simple way to study computational intelligence as the interaction between an agent and an environment. The goal of an agent is to accrue as much reward as possible by intelligently choosing actions given states. This problem of finding a policy that maximizes the expected total reward is called the control problem. Many algorithms that solve the control problem also solve a related problem called the prediction problem. The goal in prediction is to accurately estimate the expected total reward from different states while following a given policy. Both the prediction and control problems can be formulated as optimization problems with different objectives. In this thesis, we present the first attempt at interrelating the prediction and control objectives. Prediction has often been studied independently in the past, but it is a subproblem subservient to the primary control problem that we generally care about. There is a need for interrelating the objectives particularly when the number of states is large and agents cannot learn about all states equally well. Agents are forced to use function approximation, and we have to specify how to allocate function approximation resources. The Emphatic Temporal Difference algorithm is the first prediction algorithm that allows a specification of a degree of caring about value function accuracy at different states. This specification is done through an interest function that changes the prediction objective. We now have a way to communicate to the agent how much we care about different states beyond how often they occur. However, interest is strictly a problem-related concept, defining the objective, in prediction. There were no clear insights about how to set the interest. We take this opportunity to use interest as a solution-related concept in control and ground the choice of prediction objective in improving control performance. As a running example, we use the episodic discounted control problem where distal rewards are worth exponentially less, and interaction terminates upon reaching a terminal state. In this problem setting, we study the actor-critic class of solution methods that neatly separate the two problems with the actor solving the control problem and the critic solving the prediction subproblem. First, we discuss a recent controversy in discounting and the role it should play in the learning updates of the actor. After concluding how we should update the actor, we move on to the critic. We find the first suggestion of an interest over states for the prediction problem by analyzing the updates made by the actor and hence present a choice of prediction objective motivated by a control objective. Experimental results confirm that control performance is indeed improved when the critic uses the new prediction objective.

  • Subjects / Keywords
  • Graduation date
    Fall 2020
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.