10.2_Advantage_actor-critic_A2C

10.2 Advantage actor-critic (A2C)

We now introduce the algorithm of advantage actor-critic. The core idea of this algorithm is to introduce a baseline to reduce estimation variance.

10.2.1 Baseline invariance

One interesting property of the policy gradient is that it is invariant to an additional baseline. That is

ESη,Aπ[θlnπ(AS,θt)qπ(S,A)]=ESη,Aπ[θlnπ(AS,θt)(qπ(S,A)b(S))],(10.3)\mathbb {E} _ {S \sim \eta , A \sim \pi} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) q _ {\pi} (S, A) \right] = \mathbb {E} _ {S \sim \eta , A \sim \pi} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) \left(q _ {\pi} (S, A) - b (S)\right) \right], \tag {10.3}

where the additional baseline b(S)b(S) is a scalar function of SS . We next answer two questions about the baseline.

First, why is (10.3) valid?

Equation (10.3) holds if and only if

ESη,Aπ[θlnπ(AS,θt)b(S)]=0.\mathbb {E} _ {S \sim \eta , A \sim \pi} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) b (S) \right] = 0.

This equation is valid because

ESη,Aπ[θlnπ(AS,θt)b(S)]=sSη(s)aAπ(as,θt)θlnπ(as,θt)b(s)=sSη(s)aAθπ(as,θt)b(s)=sSη(s)b(s)aAθπ(as,θt)=sSη(s)b(s)θaAπ(as,θt)=sSη(s)b(s)θ1=0.\begin{array}{l} \mathbb {E} _ {S \sim \eta , A \sim \pi} \Big [ \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) b (S) \Big ] = \sum_ {s \in \mathcal {S}} \eta (s) \sum_ {a \in \mathcal {A}} \pi (a | s, \theta_ {t}) \nabla_ {\theta} \ln \pi (a | s, \theta_ {t}) b (s) \\ = \sum_ {s \in \mathcal {S}} \eta (s) \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s, \theta_ {t}) b (s) \\ = \sum_ {s \in \mathcal {S}} \eta (s) b (s) \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s, \theta_ {t}) \\ = \sum_ {s \in \mathcal {S}} \eta (s) b (s) \nabla_ {\theta} \sum_ {a \in \mathcal {A}} \pi (a | s, \theta_ {t}) \\ = \sum_ {s \in \mathcal {S}} \eta (s) b (s) \nabla_ {\theta} 1 = 0. \\ \end{array}

Second, why is the baseline useful?

The baseline is useful because it can reduce the approximation variance when we use samples to approximate the true gradient. In particular, let

X(S,A)θlnπ(AS,θt)[qπ(S,A)b(S)].(10.4)X (S, A) \doteq \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) [ q _ {\pi} (S, A) - b (S) ]. \tag {10.4}

Then, the true gradient is E[X(S,A)]\mathbb{E}[X(S,A)] . Since we need to use a stochastic sample xx to approximate E[X]\mathbb{E}[X] , it would be favorable if the variance var(X)\operatorname{var}(X) is small. For example, if var(X)\operatorname{var}(X) is close to zero, then any sample xx can accurately approximate E[X]\mathbb{E}[X] . On the contrary, if var(X)\operatorname{var}(X) is large, the value of a sample may be far from E[X]\mathbb{E}[X] .

Although E[X]\mathbb{E}[X] is invariant to the baseline, the variance var(X)\operatorname{var}(X) is not. Our goal is to design a good baseline to minimize var(X)\operatorname{var}(X) . In the algorithms of REINFORCE and QAC, we set b=0b = 0 , which is not guaranteed to be a good baseline.

In fact, the optimal baseline that minimizes var(X)\operatorname{var}(X) is

b(s)=EAπ[θlnπ(As,θt)2qπ(s,A)]EAπ[θlnπ(As,θt)2],sS.(10.5)b ^ {*} (s) = \frac {\mathbb {E} _ {A \sim \pi} \left[ \| \nabla_ {\theta} \ln \pi (A | s , \theta_ {t}) \| ^ {2} q _ {\pi} (s , A) \right]}{\mathbb {E} _ {A \sim \pi} \left[ \| \nabla_ {\theta} \ln \pi (A | s , \theta_ {t}) \| ^ {2} \right]}, \quad s \in \mathcal {S}. \tag {10.5}

The proof is given in Box 10.1.

Although the baseline in (10.5) is optimal, it is too complex to be useful in practice. If the weight θlnπ(As,θt)2\| \nabla_{\theta} \ln \pi(A|s, \theta_t) \|^2 is removed from (10.5), we can obtain a suboptimal baseline that has a concise expression:

b(s)=EAπ[qπ(s,A)]=vπ(s),sS.b ^ {\dagger} (s) = \mathbb {E} _ {A \sim \pi} [ q _ {\pi} (s, A) ] = v _ {\pi} (s), \quad s \in \mathcal {S}.

Interestingly, this suboptimal baseline is the state value.

Box 10.1: Showing that b(s)b^{*}(s) in (10.5) is the optimal baseline

Let xˉE[X]\bar{x} \doteq \mathbb{E}[X] , which is invariant for any b(s)b(s) . If XX is a vector, its variance is a matrix. It is common to select the trace of var(X)\operatorname{var}(X) as a scalar objective function for optimization:

tr[var(X)]=trE[(Xxˉ)(Xxˉ)T]=trE[XXTxˉXTXxˉT+xˉxˉT]=E[XTXXTxˉxˉTX+xˉTxˉ]=E[XTX]xˉTxˉ.(10.6)\begin{array}{l} \operatorname {t r} [ \operatorname {v a r} (X) ] = \operatorname {t r} \mathbb {E} [ (X - \bar {x}) (X - \bar {x}) ^ {T} ] \\ = \operatorname {t r} \mathbb {E} \left[ X X ^ {T} - \bar {x} X ^ {T} - X \bar {x} ^ {T} + \bar {x} \bar {x} ^ {T} \right] \\ = \mathbb {E} \left[ X ^ {T} X - X ^ {T} \bar {x} - \bar {x} ^ {T} X + \bar {x} ^ {T} \bar {x} \right] \\ = \mathbb {E} \left[ X ^ {T} X \right] - \bar {x} ^ {T} \bar {x}. \tag {10.6} \\ \end{array}

When deriving the above equation, we use the trace property tr(AB)=tr(BA)\operatorname{tr}(AB) = \operatorname{tr}(BA) for any squared matrices A,BA, B with appropriate dimensions. Since xˉ\bar{x} is invariant, equation (10.6) suggests that we only need to minimize E[XTX]\mathbb{E}[X^T X] . With XX defined in (10.4), we have

E[XTX]=E[(θlnπ)T(θlnπ)(qπ(S,A)b(S))2]=E[θlnπ2(qπ(S,A)b(S))2],\begin{array}{l} \mathbb {E} \left[ X ^ {T} X \right] = \mathbb {E} \left[ \left(\nabla_ {\theta} \ln \pi\right) ^ {T} \left(\nabla_ {\theta} \ln \pi\right) \left(q _ {\pi} (S, A) - b (S)\right) ^ {2} \right] \\ = \mathbb {E} \left[ \| \nabla_ {\theta} \ln \pi \| ^ {2} \left(q _ {\pi} (S, A) - b (S)\right) ^ {2} \right], \\ \end{array}

where π(AS,θ)\pi (A|S,\theta) is written as π\pi for short. Since SηS\sim \eta and AπA\sim \pi , the above equation can be rewritten as

E[XTX]=sSη(s)EAπ[θlnπ2(qπ(s,A)b(s))2].\mathbb {E} [ X ^ {T} X ] = \sum_ {s \in \mathcal {S}} \eta (s) \mathbb {E} _ {A \sim \pi} \left[ \| \nabla_ {\theta} \ln \pi \| ^ {2} (q _ {\pi} (s, A) - b (s)) ^ {2} \right].

To ensure bE[XTX]=0\nabla_b\mathbb{E}[X^T X] = 0 , b(s)b(s) for any sSs \in S should satisfy

EAπ[θlnπ2(b(s)qπ(s,A))]=0,sS.\mathbb {E} _ {A \sim \pi} \big [ \| \nabla_ {\theta} \ln \pi \| ^ {2} (b (s) - q _ {\pi} (s, A)) \big ] = 0, \qquad s \in \mathcal {S}.

The above equation can be easily solved to obtain the optimal baseline:

b(s)=EAπ[θlnπ2qπ(s,A)]EAπ[θlnπ2],sS.b ^ {*} (s) = \frac {\mathbb {E} _ {A \sim \pi} [ \| \nabla_ {\theta} \ln \pi \| ^ {2} q _ {\pi} (s , A) ]}{\mathbb {E} _ {A \sim \pi} [ \| \nabla_ {\theta} \ln \pi \| ^ {2} ]}, \qquad s \in \mathcal {S}.

More discussions on optimal baselines in policy gradient methods can be found in [69, 70].

10.2.2 Algorithm description

When b(s)=vπ(s)b(s) = v_{\pi}(s) , the gradient-ascent algorithm in (10.1) becomes

θt+1=θt+αE[θlnπ(AS,θt)[qπ(S,A)vπ(S)]]=˙θt+αE[θlnπ(AS,θt)δπ(S,A)].(10.7)\begin{array}{l} \theta_ {t + 1} = \theta_ {t} + \alpha \mathbb {E} \Big [ \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) [ q _ {\pi} (S, A) - v _ {\pi} (S) ] \Big ] \\ \dot {=} \theta_ {t} + \alpha \mathbb {E} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) \delta_ {\pi} (S, A) \right]. \tag {10.7} \\ \end{array}

Here,

δπ(S,A)qπ(S,A)vπ(S)\delta_ {\pi} (S, A) \doteq q _ {\pi} (S, A) - v _ {\pi} (S)

is called the advantage function, which reflects the advantage of one action over the others. More specifically, note that vπ(s)=aAπ(as)qπ(s,a)v_{\pi}(s) = \sum_{a \in \mathcal{A}} \pi(a|s) q_{\pi}(s, a) is the mean of the action values. If δπ(s,a)>0\delta_{\pi}(s, a) > 0 , it means that the corresponding action has a greater value than the mean value.

The stochastic version of (10.7) is

θt+1=θt+αθlnπ(atst,θt)[qt(st,at)vt(st)]=θt+αθlnπ(atst,θt)δt(st,at),(10.8)\begin{array}{l} \theta_ {t + 1} = \theta_ {t} + \alpha \nabla_ {\theta} \ln \pi (a _ {t} | s _ {t}, \theta_ {t}) [ q _ {t} (s _ {t}, a _ {t}) - v _ {t} (s _ {t}) ] \\ = \theta_ {t} + \alpha \nabla_ {\theta} \ln \pi (a _ {t} | s _ {t}, \theta_ {t}) \delta_ {t} (s _ {t}, a _ {t}), \tag {10.8} \\ \end{array}

where st,ats_t, a_t are samples of S,AS, A at time tt . Here, qt(st,at)q_t(s_t, a_t) and vt(st)v_t(s_t) are approximations of qπ(θt)(st,at)q_{\pi(\theta_t)}(s_t, a_t) and vπ(θt)(st)v_{\pi(\theta_t)}(s_t) , respectively. The algorithm in (10.8) updates the policy based on the relative value of qtq_t with respect to vtv_t rather than the absolute value of qtq_t . This is intuitively reasonable because, when we attempt to select an action at a state, we only care about which action has the greatest value relative to the others.

If qt(st,at)q_{t}(s_{t},a_{t}) and vt(st)v_{t}(s_{t}) are estimated by Monte Carlo learning, the algorithm in (10.8) is called REINFORCE with a baseline. If qt(st,at)q_{t}(s_{t},a_{t}) and vt(st)v_{t}(s_{t}) are estimated by TD learning, the algorithm is usually called advantage actor-critic (A2C). The implementation of A2C is summarized in Algorithm 10.2. It should be noted that the advantage function in this implementation is approximated by the TD error:

qt(st,at)vt(st)rt+1+γvt(st+1)vt(st).q _ {t} (s _ {t}, a _ {t}) - v _ {t} (s _ {t}) \approx r _ {t + 1} + \gamma v _ {t} (s _ {t + 1}) - v _ {t} (s _ {t}).

This approximation is reasonable because

qπ(st,at)vπ(st)=E[Rt+1+γvπ(St+1)vπ(St)St=st,At=at],q _ {\pi} (s _ {t}, a _ {t}) - v _ {\pi} (s _ {t}) = \mathbb {E} \Big [ R _ {t + 1} + \gamma v _ {\pi} (S _ {t + 1}) - v _ {\pi} (S _ {t}) | S _ {t} = s _ {t}, A _ {t} = a _ {t} \Big ],

which is valid due to the definition of qπ(st,at)q_{\pi}(s_t, a_t) . One merit of using the TD error is that we only need to use a single neural network to represent vπ(s)v_{\pi}(s) . Otherwise, if δt=qt(st,at)vt(st)\delta_t = q_t(s_t, a_t) - v_t(s_t) , we need to maintain two networks to represent vπ(s)v_{\pi}(s) and qπ(s,a)q_{\pi}(s, a) , respectively. When we use the TD error, the algorithm may also be called TD actor-critic. In addition, it is notable that the policy π(θt)\pi(\theta_t) is stochastic and hence exploratory. Therefore, it can be directly used to generate experience samples without relying on

Algorithm 10.2: Advantage actor-critic (A2C) or TD actor-critic
Initialization: A policy function π(as,θ0)\pi (a|s,\theta_0) where θ0\theta_0 is the initial parameter. A value function v(s,w0)v(s,w_0) where w0w_{0} is the initial parameter. αw,αθ>0\alpha_w,\alpha_\theta >0
Goal: Learn an optimal policy to maximize J(θ)J(\theta)
At time step tt in each episode, do Generate ata_{t} following π(ast,θt)\pi (a|s_t,\theta_t) and then observe rt+1,st+1r_{t + 1},s_{t + 1} Advantage (TD error): δt=rt+1+γv(st+1,wt)v(st,wt)\delta_t = r_{t + 1} + \gamma v(s_{t + 1},w_t) - v(s_t,w_t) Actor (policy update): θt+1=θt+αθδtθlnπ(atst,θt)\theta_{t + 1} = \theta_t + \alpha_\theta \delta_t\nabla_\theta \ln \pi (a_t|s_t,\theta_t) Critic (value update): wt+1=wt+αwδtwv(st,wt)w_{t + 1} = w_{t} + \alpha_{w}\delta_{t}\nabla_{w}v(s_{t},w_{t})

techniques such as ε\varepsilon -greedy. There are some variants of A2C such as asynchronous advantage actor-critic (A3C). Interested readers may check [71, 72].