10.2 Advantage actor-critic (A2C)

We now introduce the algorithm of advantage actor-critic. The core idea of this algorithm is to introduce a baseline to reduce estimation variance.

10.2.1 Baseline invariance

One interesting property of the policy gradient is that it is invariant to an additional baseline. That is

\mathbb {E} _ {S \sim \eta , A \sim \pi} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) q _ {\pi} (S, A) \right] = \mathbb {E} _ {S \sim \eta , A \sim \pi} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) \left(q _ {\pi} (S, A) - b (S)\right) \right], \tag {10.3}

where the additional baseline $b(S)$ is a scalar function of $S$ . We next answer two questions about the baseline.

First, why is (10.3) valid?

Equation (10.3) holds if and only if

\mathbb {E} _ {S \sim \eta , A \sim \pi} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) b (S) \right] = 0.

This equation is valid because

\begin{array}{l} \mathbb {E} _ {S \sim \eta , A \sim \pi} \Big [ \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) b (S) \Big ] = \sum_ {s \in \mathcal {S}} \eta (s) \sum_ {a \in \mathcal {A}} \pi (a | s, \theta_ {t}) \nabla_ {\theta} \ln \pi (a | s, \theta_ {t}) b (s) \\ = \sum_ {s \in \mathcal {S}} \eta (s) \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s, \theta_ {t}) b (s) \\ = \sum_ {s \in \mathcal {S}} \eta (s) b (s) \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s, \theta_ {t}) \\ = \sum_ {s \in \mathcal {S}} \eta (s) b (s) \nabla_ {\theta} \sum_ {a \in \mathcal {A}} \pi (a | s, \theta_ {t}) \\ = \sum_ {s \in \mathcal {S}} \eta (s) b (s) \nabla_ {\theta} 1 = 0. \\ \end{array}

Second, why is the baseline useful?

The baseline is useful because it can reduce the approximation variance when we use samples to approximate the true gradient. In particular, let

X (S, A) \doteq \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) [ q _ {\pi} (S, A) - b (S) ]. \tag {10.4}

Then, the true gradient is $\mathbb{E}[X(S,A)]$ . Since we need to use a stochastic sample $x$ to approximate $\mathbb{E}[X]$ , it would be favorable if the variance $\operatorname{var}(X)$ is small. For example, if $\operatorname{var}(X)$ is close to zero, then any sample $x$ can accurately approximate $\mathbb{E}[X]$ . On the contrary, if $\operatorname{var}(X)$ is large, the value of a sample may be far from $\mathbb{E}[X]$ .

Although $\mathbb{E}[X]$ is invariant to the baseline, the variance $\operatorname{var}(X)$ is not. Our goal is to design a good baseline to minimize $\operatorname{var}(X)$ . In the algorithms of REINFORCE and QAC, we set $b = 0$ , which is not guaranteed to be a good baseline.

In fact, the optimal baseline that minimizes $\operatorname{var}(X)$ is

b ^ {*} (s) = \frac {\mathbb {E} _ {A \sim \pi} \left[ \| \nabla_ {\theta} \ln \pi (A | s , \theta_ {t}) \| ^ {2} q _ {\pi} (s , A) \right]}{\mathbb {E} _ {A \sim \pi} \left[ \| \nabla_ {\theta} \ln \pi (A | s , \theta_ {t}) \| ^ {2} \right]}, \quad s \in \mathcal {S}. \tag {10.5}

The proof is given in Box 10.1.

Although the baseline in (10.5) is optimal, it is too complex to be useful in practice. If the weight $\| \nabla_{\theta} \ln \pi(A|s, \theta_t) \|^2$ is removed from (10.5), we can obtain a suboptimal baseline that has a concise expression:

b ^ {\dagger} (s) = \mathbb {E} _ {A \sim \pi} [ q _ {\pi} (s, A) ] = v _ {\pi} (s), \quad s \in \mathcal {S}.

Interestingly, this suboptimal baseline is the state value.

Box 10.1: Showing that $b^{*}(s)$ in (10.5) is the optimal baseline

Let $\bar{x} \doteq \mathbb{E}[X]$ , which is invariant for any $b(s)$ . If $X$ is a vector, its variance is a matrix. It is common to select the trace of $\operatorname{var}(X)$ as a scalar objective function for optimization:

\begin{array}{l} \operatorname {t r} [ \operatorname {v a r} (X) ] = \operatorname {t r} \mathbb {E} [ (X - \bar {x}) (X - \bar {x}) ^ {T} ] \\ = \operatorname {t r} \mathbb {E} \left[ X X ^ {T} - \bar {x} X ^ {T} - X \bar {x} ^ {T} + \bar {x} \bar {x} ^ {T} \right] \\ = \mathbb {E} \left[ X ^ {T} X - X ^ {T} \bar {x} - \bar {x} ^ {T} X + \bar {x} ^ {T} \bar {x} \right] \\ = \mathbb {E} \left[ X ^ {T} X \right] - \bar {x} ^ {T} \bar {x}. \tag {10.6} \\ \end{array}

When deriving the above equation, we use the trace property $\operatorname{tr}(AB) = \operatorname{tr}(BA)$ for any squared matrices $A, B$ with appropriate dimensions. Since $\bar{x}$ is invariant, equation (10.6) suggests that we only need to minimize $\mathbb{E}[X^T X]$ . With $X$ defined in (10.4), we have

\begin{array}{l} \mathbb {E} \left[ X ^ {T} X \right] = \mathbb {E} \left[ \left(\nabla_ {\theta} \ln \pi\right) ^ {T} \left(\nabla_ {\theta} \ln \pi\right) \left(q _ {\pi} (S, A) - b (S)\right) ^ {2} \right] \\ = \mathbb {E} \left[ \| \nabla_ {\theta} \ln \pi \| ^ {2} \left(q _ {\pi} (S, A) - b (S)\right) ^ {2} \right], \\ \end{array}

where $\pi (A|S,\theta)$ is written as $\pi$ for short. Since $S\sim \eta$ and $A\sim \pi$ , the above equation can be rewritten as

\mathbb {E} [ X ^ {T} X ] = \sum_ {s \in \mathcal {S}} \eta (s) \mathbb {E} _ {A \sim \pi} \left[ \| \nabla_ {\theta} \ln \pi \| ^ {2} (q _ {\pi} (s, A) - b (s)) ^ {2} \right].

To ensure $\nabla_b\mathbb{E}[X^T X] = 0$ , $b(s)$ for any $s \in S$ should satisfy

\mathbb {E} _ {A \sim \pi} \big [ \| \nabla_ {\theta} \ln \pi \| ^ {2} (b (s) - q _ {\pi} (s, A)) \big ] = 0, \qquad s \in \mathcal {S}.

The above equation can be easily solved to obtain the optimal baseline:

b ^ {*} (s) = \frac {\mathbb {E} _ {A \sim \pi} [ \| \nabla_ {\theta} \ln \pi \| ^ {2} q _ {\pi} (s , A) ]}{\mathbb {E} _ {A \sim \pi} [ \| \nabla_ {\theta} \ln \pi \| ^ {2} ]}, \qquad s \in \mathcal {S}.

More discussions on optimal baselines in policy gradient methods can be found in [69, 70].

10.2.2 Algorithm description

When $b(s) = v_{\pi}(s)$ , the gradient-ascent algorithm in (10.1) becomes

\begin{array}{l} \theta_ {t + 1} = \theta_ {t} + \alpha \mathbb {E} \Big [ \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) [ q _ {\pi} (S, A) - v _ {\pi} (S) ] \Big ] \\ \dot {=} \theta_ {t} + \alpha \mathbb {E} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) \delta_ {\pi} (S, A) \right]. \tag {10.7} \\ \end{array}

Here,

\delta_ {\pi} (S, A) \doteq q _ {\pi} (S, A) - v _ {\pi} (S)

is called the advantage function, which reflects the advantage of one action over the others. More specifically, note that $v_{\pi}(s) = \sum_{a \in \mathcal{A}} \pi(a|s) q_{\pi}(s, a)$ is the mean of the action values. If $\delta_{\pi}(s, a) > 0$ , it means that the corresponding action has a greater value than the mean value.

The stochastic version of (10.7) is

\begin{array}{l} \theta_ {t + 1} = \theta_ {t} + \alpha \nabla_ {\theta} \ln \pi (a _ {t} | s _ {t}, \theta_ {t}) [ q _ {t} (s _ {t}, a _ {t}) - v _ {t} (s _ {t}) ] \\ = \theta_ {t} + \alpha \nabla_ {\theta} \ln \pi (a _ {t} | s _ {t}, \theta_ {t}) \delta_ {t} (s _ {t}, a _ {t}), \tag {10.8} \\ \end{array}

where $s_t, a_t$ are samples of $S, A$ at time $t$ . Here, $q_t(s_t, a_t)$ and $v_t(s_t)$ are approximations of $q_{\pi(\theta_t)}(s_t, a_t)$ and $v_{\pi(\theta_t)}(s_t)$ , respectively. The algorithm in (10.8) updates the policy based on the relative value of $q_t$ with respect to $v_t$ rather than the absolute value of $q_t$ . This is intuitively reasonable because, when we attempt to select an action at a state, we only care about which action has the greatest value relative to the others.

If $q_{t}(s_{t},a_{t})$ and $v_{t}(s_{t})$ are estimated by Monte Carlo learning, the algorithm in (10.8) is called REINFORCE with a baseline. If $q_{t}(s_{t},a_{t})$ and $v_{t}(s_{t})$ are estimated by TD learning, the algorithm is usually called advantage actor-critic (A2C). The implementation of A2C is summarized in Algorithm 10.2. It should be noted that the advantage function in this implementation is approximated by the TD error:

q _ {t} (s _ {t}, a _ {t}) - v _ {t} (s _ {t}) \approx r _ {t + 1} + \gamma v _ {t} (s _ {t + 1}) - v _ {t} (s _ {t}).

This approximation is reasonable because

q _ {\pi} (s _ {t}, a _ {t}) - v _ {\pi} (s _ {t}) = \mathbb {E} \Big [ R _ {t + 1} + \gamma v _ {\pi} (S _ {t + 1}) - v _ {\pi} (S _ {t}) | S _ {t} = s _ {t}, A _ {t} = a _ {t} \Big ],

which is valid due to the definition of $q_{\pi}(s_t, a_t)$ . One merit of using the TD error is that we only need to use a single neural network to represent $v_{\pi}(s)$ . Otherwise, if $\delta_t = q_t(s_t, a_t) - v_t(s_t)$ , we need to maintain two networks to represent $v_{\pi}(s)$ and $q_{\pi}(s, a)$ , respectively. When we use the TD error, the algorithm may also be called TD actor-critic. In addition, it is notable that the policy $\pi(\theta_t)$ is stochastic and hence exploratory. Therefore, it can be directly used to generate experience samples without relying on

Algorithm 10.2: Advantage actor-critic (A2C) or TD actor-critic
Initialization: A policy function $\pi (a|s,\theta_0)$ where $\theta_0$ is the initial parameter. A value function $v(s,w_0)$ where $w_{0}$ is the initial parameter. $\alpha_w,\alpha_\theta >0$
Goal: Learn an optimal policy to maximize $J(\theta)$
At time step $t$ in each episode, do Generate $a_{t}$ following $\pi (a|s_t,\theta_t)$ and then observe $r_{t + 1},s_{t + 1}$ Advantage (TD error): $\delta_t = r_{t + 1} + \gamma v(s_{t + 1},w_t) - v(s_t,w_t)$ Actor (policy update): $\theta_{t + 1} = \theta_t + \alpha_\theta \delta_t\nabla_\theta \ln \pi (a_t|s_t,\theta_t)$ Critic (value update): $w_{t + 1} = w_{t} + \alpha_{w}\delta_{t}\nabla_{w}v(s_{t},w_{t})$

techniques such as $\varepsilon$ -greedy. There are some variants of A2C such as asynchronous advantage actor-critic (A3C). Interested readers may check [71, 72].

10.2_Advantage_actor-critic_A2C

10.2 Advantage actor-critic (A2C)

10.2.1 Baseline invariance

Box 10.1: Showing that b∗(s)b^{*}(s)b∗(s) in (10.5) is the optimal baseline

10.2.2 Algorithm description

Box 10.1: Showing that $b^{*}(s)$ in (10.5) is the optimal baseline