10.1 The simplest actor-critic algorithm (QAC)

This section introduces the simplest actor-critic algorithm. This algorithm can be easily obtained by extending the policy gradient algorithm in (9.32).

Recall that the idea of the policy gradient method is to search for an optimal policy by maximizing a scalar metric $J(\theta)$ . The gradient-ascent algorithm for maximizing $J(\theta)$ is

\begin{array}{l} \theta_ {t + 1} = \theta_ {t} + \alpha \nabla_ {\theta} J (\theta_ {t}) \\ = \theta_ {t} + \alpha \mathbb {E} _ {S \sim \eta , A \sim \pi} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) q _ {\pi} (S, A) \right], \tag {10.1} \\ \end{array}

where $\eta$ is a distribution of the states (see Theorem 9.1 for more information). Since the true gradient is unknown, we can use a stochastic gradient to approximate it:

\theta_ {t + 1} = \theta_ {t} + \alpha \nabla_ {\theta} \ln \pi \left(a _ {t} \mid s _ {t}, \theta_ {t}\right) q _ {t} \left(s _ {t}, a _ {t}\right). \tag {10.2}

This is the algorithm given in (9.32).

Equation (10.2) is important because it clearly shows how policy-based and value-based methods can be combined. On the one hand, it is a policy-based algorithm since it directly updates the policy parameter. On the other hand, this equation requires knowing $q_{t}(s_{t},a_{t})$ , which is an estimate of the action value $q_{\pi}(s_t,a_t)$ . As a result, another value-based algorithm is required to generate $q_{t}(s_{t},a_{t})$ . So far, we have studied two ways to estimate action values in this book. The first is based on Monte Carlo learning and the second is temporal-difference (TD) learning.

$\diamond$ If $q_{t}(s_{t},a_{t})$ is estimated by Monte Carlo learning, the corresponding algorithm is called REINFORCE or Monte Carlo policy gradient, which has already been introduced in Chapter 9.
$\diamond$ If $q_{t}(s_{t},a_{t})$ is estimated by TD learning, the corresponding algorithms are usually called actor-critic. Therefore, actor-critic methods can be obtained by incorporating TD-based value estimation into policy gradient methods.

The procedure of the simplest actor-critic algorithm is summarized in Algorithm 10.1. The critic corresponds to the value update step via the Sarsa algorithm presented in (8.35). The action values are represented by a parameterized function $q(s, a, w)$ . The actor corresponds to the policy update step in (10.2). This actor-citric algorithm is sometimes called Q actor-critic (QAC). Although it is simple, QAC reveals the core idea of actor-critic methods. It can be extended to generate many advanced ones as shown in the rest of this chapter.

Algorithm 10.1: The simplest actor-critic algorithm (QAC)

Initialization: A policy function $\pi(a|s, \theta_0)$ where $\theta_0$ is the initial parameter. A value function $q(s, a, w_0)$ where $w_0$ is the initial parameter. $\alpha_w, \alpha_\theta > 0$ .

Goal: Learn an optimal policy to maximize $J(\theta)$ .

At time step $t$ in each episode, do

Generate $a_t$ following $\pi(a|s_t, \theta_t)$ , observe $r_{t+1}, s_{t+1}$ , and then generate $a_{t+1}$ following $\pi(a|s_{t+1}, \theta_t)$ .

Actor (policy update):

\theta_ {t + 1} = \theta_ {t} + \alpha_ {\theta} \nabla_ {\theta} \ln \pi \left(a _ {t} \mid s _ {t}, \theta_ {t}\right) q \left(s _ {t}, a _ {t}, w _ {t}\right)

Critic(valueupdate):

w _ {t + 1} = w _ {t} + \alpha_ {w} \big [ r _ {t + 1} + \gamma q (s _ {t + 1}, a _ {t + 1}, w _ {t}) - q (s _ {t}, a _ {t}, w _ {t}) \big ] \nabla_ {w} q (s _ {t}, a _ {t}, w _ {t})

10.1_The_simplest_actor-critic_algorithm_QAC

10.1 The simplest actor-critic algorithm (QAC)

Algorithm 10.1: The simplest actor-critic algorithm (QAC)