10.2 Advantage actor-critic (A2C)
We now introduce the algorithm of advantage actor-critic. The core idea of this algorithm is to introduce a baseline to reduce estimation variance.
10.2.1 Baseline invariance
One interesting property of the policy gradient is that it is invariant to an additional baseline. That is
ES∼η,A∼π[∇θlnπ(A∣S,θt)qπ(S,A)]=ES∼η,A∼π[∇θlnπ(A∣S,θt)(qπ(S,A)−b(S))],(10.3) where the additional baseline b(S) is a scalar function of S . We next answer two questions about the baseline.
First, why is (10.3) valid?
Equation (10.3) holds if and only if
ES∼η,A∼π[∇θlnπ(A∣S,θt)b(S)]=0. This equation is valid because
ES∼η,A∼π[∇θlnπ(A∣S,θt)b(S)]=∑s∈Sη(s)∑a∈Aπ(a∣s,θt)∇θlnπ(a∣s,θt)b(s)=∑s∈Sη(s)∑a∈A∇θπ(a∣s,θt)b(s)=∑s∈Sη(s)b(s)∑a∈A∇θπ(a∣s,θt)=∑s∈Sη(s)b(s)∇θ∑a∈Aπ(a∣s,θt)=∑s∈Sη(s)b(s)∇θ1=0. Second, why is the baseline useful?
The baseline is useful because it can reduce the approximation variance when we use samples to approximate the true gradient. In particular, let
X(S,A)≐∇θlnπ(A∣S,θt)[qπ(S,A)−b(S)].(10.4) Then, the true gradient is E[X(S,A)] . Since we need to use a stochastic sample x to approximate E[X] , it would be favorable if the variance var(X) is small. For example, if var(X) is close to zero, then any sample x can accurately approximate E[X] . On the contrary, if var(X) is large, the value of a sample may be far from E[X] .
Although E[X] is invariant to the baseline, the variance var(X) is not. Our goal is to design a good baseline to minimize var(X) . In the algorithms of REINFORCE and QAC, we set b=0 , which is not guaranteed to be a good baseline.
In fact, the optimal baseline that minimizes var(X) is
b∗(s)=EA∼π[∥∇θlnπ(A∣s,θt)∥2]EA∼π[∥∇θlnπ(A∣s,θt)∥2qπ(s,A)],s∈S.(10.5) The proof is given in Box 10.1.
Although the baseline in (10.5) is optimal, it is too complex to be useful in practice. If the weight ∥∇θlnπ(A∣s,θt)∥2 is removed from (10.5), we can obtain a suboptimal baseline that has a concise expression:
b†(s)=EA∼π[qπ(s,A)]=vπ(s),s∈S. Interestingly, this suboptimal baseline is the state value.
Box 10.1: Showing that b∗(s) in (10.5) is the optimal baseline
Let xˉ≐E[X] , which is invariant for any b(s) . If X is a vector, its variance is a matrix. It is common to select the trace of var(X) as a scalar objective function for optimization:
tr[var(X)]=trE[(X−xˉ)(X−xˉ)T]=trE[XXT−xˉXT−XxˉT+xˉxˉT]=E[XTX−XTxˉ−xˉTX+xˉTxˉ]=E[XTX]−xˉTxˉ.(10.6) When deriving the above equation, we use the trace property tr(AB)=tr(BA) for any squared matrices A,B with appropriate dimensions. Since xˉ is invariant, equation (10.6) suggests that we only need to minimize E[XTX] . With X defined in (10.4), we have
E[XTX]=E[(∇θlnπ)T(∇θlnπ)(qπ(S,A)−b(S))2]=E[∥∇θlnπ∥2(qπ(S,A)−b(S))2], where π(A∣S,θ) is written as π for short. Since S∼η and A∼π , the above equation can be rewritten as
E[XTX]=s∈S∑η(s)EA∼π[∥∇θlnπ∥2(qπ(s,A)−b(s))2]. To ensure ∇bE[XTX]=0 , b(s) for any s∈S should satisfy
EA∼π[∥∇θlnπ∥2(b(s)−qπ(s,A))]=0,s∈S. The above equation can be easily solved to obtain the optimal baseline:
b∗(s)=EA∼π[∥∇θlnπ∥2]EA∼π[∥∇θlnπ∥2qπ(s,A)],s∈S. More discussions on optimal baselines in policy gradient methods can be found in [69, 70].
10.2.2 Algorithm description
When b(s)=vπ(s) , the gradient-ascent algorithm in (10.1) becomes
θt+1=θt+αE[∇θlnπ(A∣S,θt)[qπ(S,A)−vπ(S)]]=˙θt+αE[∇θlnπ(A∣S,θt)δπ(S,A)].(10.7) Here,
δπ(S,A)≐qπ(S,A)−vπ(S) is called the advantage function, which reflects the advantage of one action over the others. More specifically, note that vπ(s)=∑a∈Aπ(a∣s)qπ(s,a) is the mean of the action values. If δπ(s,a)>0 , it means that the corresponding action has a greater value than the mean value.
The stochastic version of (10.7) is
θt+1=θt+α∇θlnπ(at∣st,θt)[qt(st,at)−vt(st)]=θt+α∇θlnπ(at∣st,θt)δt(st,at),(10.8) where st,at are samples of S,A at time t . Here, qt(st,at) and vt(st) are approximations of qπ(θt)(st,at) and vπ(θt)(st) , respectively. The algorithm in (10.8) updates the policy based on the relative value of qt with respect to vt rather than the absolute value of qt . This is intuitively reasonable because, when we attempt to select an action at a state, we only care about which action has the greatest value relative to the others.
If qt(st,at) and vt(st) are estimated by Monte Carlo learning, the algorithm in (10.8) is called REINFORCE with a baseline. If qt(st,at) and vt(st) are estimated by TD learning, the algorithm is usually called advantage actor-critic (A2C). The implementation of A2C is summarized in Algorithm 10.2. It should be noted that the advantage function in this implementation is approximated by the TD error:
qt(st,at)−vt(st)≈rt+1+γvt(st+1)−vt(st). This approximation is reasonable because
qπ(st,at)−vπ(st)=E[Rt+1+γvπ(St+1)−vπ(St)∣St=st,At=at], which is valid due to the definition of qπ(st,at) . One merit of using the TD error is that we only need to use a single neural network to represent vπ(s) . Otherwise, if δt=qt(st,at)−vt(st) , we need to maintain two networks to represent vπ(s) and qπ(s,a) , respectively. When we use the TD error, the algorithm may also be called TD actor-critic. In addition, it is notable that the policy π(θt) is stochastic and hence exploratory. Therefore, it can be directly used to generate experience samples without relying on
Algorithm 10.2: Advantage actor-critic (A2C) or TD actor-critic
Initialization: A policy function π(a∣s,θ0) where θ0 is the initial parameter. A value function v(s,w0) where w0 is the initial parameter. αw,αθ>0
Goal: Learn an optimal policy to maximize J(θ)
At time step t in each episode, do Generate at following π(a∣st,θt) and then observe rt+1,st+1 Advantage (TD error): δt=rt+1+γv(st+1,wt)−v(st,wt) Actor (policy update): θt+1=θt+αθδt∇θlnπ(at∣st,θt) Critic (value update): wt+1=wt+αwδt∇wv(st,wt)
techniques such as ε -greedy. There are some variants of A2C such as asynchronous advantage actor-critic (A3C). Interested readers may check [71, 72].