10.1_The_simplest_actor-critic_algorithm_QAC

10.1 The simplest actor-critic algorithm (QAC)

This section introduces the simplest actor-critic algorithm. This algorithm can be easily obtained by extending the policy gradient algorithm in (9.32).

Recall that the idea of the policy gradient method is to search for an optimal policy by maximizing a scalar metric J(θ)J(\theta) . The gradient-ascent algorithm for maximizing J(θ)J(\theta) is

θt+1=θt+αθJ(θt)=θt+αESη,Aπ[θlnπ(AS,θt)qπ(S,A)],(10.1)\begin{array}{l} \theta_ {t + 1} = \theta_ {t} + \alpha \nabla_ {\theta} J (\theta_ {t}) \\ = \theta_ {t} + \alpha \mathbb {E} _ {S \sim \eta , A \sim \pi} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) q _ {\pi} (S, A) \right], \tag {10.1} \\ \end{array}

where η\eta is a distribution of the states (see Theorem 9.1 for more information). Since the true gradient is unknown, we can use a stochastic gradient to approximate it:

θt+1=θt+αθlnπ(atst,θt)qt(st,at).(10.2)\theta_ {t + 1} = \theta_ {t} + \alpha \nabla_ {\theta} \ln \pi \left(a _ {t} \mid s _ {t}, \theta_ {t}\right) q _ {t} \left(s _ {t}, a _ {t}\right). \tag {10.2}

This is the algorithm given in (9.32).

Equation (10.2) is important because it clearly shows how policy-based and value-based methods can be combined. On the one hand, it is a policy-based algorithm since it directly updates the policy parameter. On the other hand, this equation requires knowing qt(st,at)q_{t}(s_{t},a_{t}) , which is an estimate of the action value qπ(st,at)q_{\pi}(s_t,a_t) . As a result, another value-based algorithm is required to generate qt(st,at)q_{t}(s_{t},a_{t}) . So far, we have studied two ways to estimate action values in this book. The first is based on Monte Carlo learning and the second is temporal-difference (TD) learning.

\diamond If qt(st,at)q_{t}(s_{t},a_{t}) is estimated by Monte Carlo learning, the corresponding algorithm is called REINFORCE or Monte Carlo policy gradient, which has already been introduced in Chapter 9.
\diamond If qt(st,at)q_{t}(s_{t},a_{t}) is estimated by TD learning, the corresponding algorithms are usually called actor-critic. Therefore, actor-critic methods can be obtained by incorporating TD-based value estimation into policy gradient methods.

The procedure of the simplest actor-critic algorithm is summarized in Algorithm 10.1. The critic corresponds to the value update step via the Sarsa algorithm presented in (8.35). The action values are represented by a parameterized function q(s,a,w)q(s, a, w) . The actor corresponds to the policy update step in (10.2). This actor-citric algorithm is sometimes called Q actor-critic (QAC). Although it is simple, QAC reveals the core idea of actor-critic methods. It can be extended to generate many advanced ones as shown in the rest of this chapter.

Algorithm 10.1: The simplest actor-critic algorithm (QAC)

Initialization: A policy function π(as,θ0)\pi(a|s, \theta_0) where θ0\theta_0 is the initial parameter. A value function q(s,a,w0)q(s, a, w_0) where w0w_0 is the initial parameter. αw,αθ>0\alpha_w, \alpha_\theta > 0 .

Goal: Learn an optimal policy to maximize J(θ)J(\theta) .

At time step tt in each episode, do

Generate ata_t following π(ast,θt)\pi(a|s_t, \theta_t) , observe rt+1,st+1r_{t+1}, s_{t+1} , and then generate at+1a_{t+1} following π(ast+1,θt)\pi(a|s_{t+1}, \theta_t) .

Actor (policy update):

θt+1=θt+αθθlnπ(atst,θt)q(st,at,wt)\theta_ {t + 1} = \theta_ {t} + \alpha_ {\theta} \nabla_ {\theta} \ln \pi \left(a _ {t} \mid s _ {t}, \theta_ {t}\right) q \left(s _ {t}, a _ {t}, w _ {t}\right)

Critic(valueupdate):

wt+1=wt+αw[rt+1+γq(st+1,at+1,wt)q(st,at,wt)]wq(st,at,wt)w _ {t + 1} = w _ {t} + \alpha_ {w} \big [ r _ {t + 1} + \gamma q (s _ {t + 1}, a _ {t + 1}, w _ {t}) - q (s _ {t}, a _ {t}, w _ {t}) \big ] \nabla_ {w} q (s _ {t}, a _ {t}, w _ {t})