where α>0 is a constant learning rate. Since the true gradient in (9.31) is unknown, we
can replace the true gradient with a stochastic gradient to obtain the following algorithm:
θt+1=θt+α∇θlnπ(at∣st,θt)qt(st,at),(9.32)
where qt(st,at) is an approximation of qπ(st,at) . If qt(st,at) is obtained by Monte Carlo estimation, the algorithm is called REINFORCE [68] or Monte Carlo policy gradient, which is one of earliest and simplest policy gradient algorithms.
The algorithm in (9.32) is important since many other policy gradient algorithms can be obtained by extending it. We next examine the interpretation of (9.32) more closely. Since ∇θlnπ(at∣st,θt)=π(at∣st,θt)∇θπ(at∣st,θt) , we can rewrite (9.32) as
Two important interpretations can be seen from this equation.
First, since (9.33) is a simple gradient-ascent algorithm, the following observations can be obtained.
If βt≥0 , the probability of choosing (st,at) is enhanced. That is
π(at∣st,θt+1)≥π(at∣st,θt).
The greater βt is, the stronger the enhancement is.
If βt<0 , the probability of choosing (st,at) decreases. That is
π(at∣st,θt+1)<π(at∣st,θt).
The above observations can be proven as follows. When θt+1−θt is sufficiently small, it follows from the Taylor expansion that
π(at∣st,θt+1)≈π(at∣st,θt)+(∇θπ(at∣st,θt))T(θt+1−θt)=π(at∣st,θt)+αβt(∇θπ(at∣st,θt))T(∇θπ(at∣st,θt))(s u b s t i t u t i n g (9 . 3 3))=π(at∣st,θt)+αβt∥∇θπ(at∣st,θt)∥22.
It is clear that π(at∣st,θt+1)≥π(at∣st,θt) when βt≥0 and π(at∣st,θt+1)<π(at∣st,θt) when βt<0
Second, the algorithm can strike a balance between exploration and exploitation to a
Algorithm 9.1: Policy Gradient by Monte Carlo (REINFORCE) Initialization: Initial parameter θ ; γ∈(0,1) ; α>0 . Goal: Learn an optimal policy for maximizing J(θ) . For each episode, do Generate an episode {s0,a0,r1,…,sT−1,aT−1,rT} following π(θ) . For t=0,1,…,T−1 : Value update: qt(st,at)=∑k=t+1Tγk−t−1rk Policy update: θ←θ+α∇θlnπ(at∣st,θ)qt(st,at)
certain extent due to the expression of
βt=π(at∣st,θt)qt(st,at).
On the one hand, βt is proportional to qt(st,at) . As a result, if the action value of (st,at) is large, then π(at∣st,θt) is enhanced so that the probability of selecting at increases. Therefore, the algorithm attempts to exploit actions with greater values. One the other hand, βt is inversely proportional to π(at∣st,θt) when qt(st,at)>0 . As a result, if the probability of selecting at is small, then π(at∣st,θt) is enhanced so that the probability of selecting at increases. Therefore, the algorithm attempts to explore actions with low probabilities.
Moreover, since (9.32) uses samples to approximate the true gradient in (9.31), it is important to understand how the samples should be obtained.
⋄ How to sample S ? S in the true gradient E[∇θlnπ(A∣S,θt)qπ(S,A)] should obey the distribution η which is either the stationary distribution dπ or the discounted total probability distribution ρπ in (9.19). Either dπ or ρπ represents the long-term behavior exhibited under π . ⋄ How to sample A ? A in E[∇θlnπ(A∣S,θt)qπ(S,A)] should obey the distribution of π(A∣S,θ) . The ideal way to sample A is to select at following π(a∣st,θt) . Therefore, the policy gradient algorithm is on-policy.
Unfortunately, the ideal ways for sampling S and A are not strictly followed in practice due to their low efficiency of sample usage. A more sample-efficient implementation of (9.32) is given in Algorithm 9.1. In this implementation, an episode is first generated by following π(θ) . Then, θ is updated multiple times using every experience sample in the episode.