9.4_Monte_Carlo_policy_gradient_REINFORCE

9.4 Monte Carlo policy gradient (REINFORCE)

With the gradient presented in Theorem 9.1, we next show how to use the gradient-based method to optimize the metrics to obtain optimal policies.

The gradient-ascent algorithm for maximizing J(θ)J(\theta) is

θt+1=θt+αθJ(θt)=θt+αE[θlnπ(AS,θt)qπ(S,A)],(9.31)\begin{array}{l} \theta_ {t + 1} = \theta_ {t} + \alpha \nabla_ {\theta} J (\theta_ {t}) \\ = \theta_ {t} + \alpha \mathbb {E} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) q _ {\pi} (S, A) \right], \tag {9.31} \\ \end{array}

where α>0\alpha > 0 is a constant learning rate. Since the true gradient in (9.31) is unknown, we

can replace the true gradient with a stochastic gradient to obtain the following algorithm:

θt+1=θt+αθlnπ(atst,θt)qt(st,at),(9.32)\theta_ {t + 1} = \theta_ {t} + \alpha \nabla_ {\theta} \ln \pi (a _ {t} | s _ {t}, \theta_ {t}) q _ {t} (s _ {t}, a _ {t}), \tag {9.32}

where qt(st,at)q_{t}(s_{t},a_{t}) is an approximation of qπ(st,at)q_{\pi}(s_t,a_t) . If qt(st,at)q_{t}(s_{t},a_{t}) is obtained by Monte Carlo estimation, the algorithm is called REINFORCE [68] or Monte Carlo policy gradient, which is one of earliest and simplest policy gradient algorithms.

The algorithm in (9.32) is important since many other policy gradient algorithms can be obtained by extending it. We next examine the interpretation of (9.32) more closely. Since θlnπ(atst,θt)=θπ(atst,θt)π(atst,θt)\nabla_{\theta}\ln \pi (a_t|s_t,\theta_t) = \frac{\nabla_{\theta}\pi(a_t|s_t,\theta_t)}{\pi(a_t|s_t,\theta_t)} , we can rewrite (9.32) as

θt+1=θt+α(qt(st,at)π(atst,θt))βtθπ(atst,θt),\theta_ {t + 1} = \theta_ {t} + \alpha \underbrace {\left(\frac {q _ {t} (s _ {t} , a _ {t})}{\pi (a _ {t} | s _ {t} , \theta_ {t})}\right)} _ {\beta_ {t}} \nabla_ {\theta} \pi (a _ {t} | s _ {t}, \theta_ {t}),

which can be further written concisely as

θt+1=θt+αβtθπ(atst,θt).(9.33)\theta_ {t + 1} = \theta_ {t} + \alpha \beta_ {t} \nabla_ {\theta} \pi (a _ {t} | s _ {t}, \theta_ {t}). \tag {9.33}

Two important interpretations can be seen from this equation.

First, since (9.33) is a simple gradient-ascent algorithm, the following observations can be obtained.

  • If βt0\beta_{t} \geq 0 , the probability of choosing (st,at)(s_{t}, a_{t}) is enhanced. That is

π(atst,θt+1)π(atst,θt).\pi \left(a _ {t} \mid s _ {t}, \theta_ {t + 1}\right) \geq \pi \left(a _ {t} \mid s _ {t}, \theta_ {t}\right).

The greater βt\beta_{t} is, the stronger the enhancement is.

  • If βt<0\beta_{t} < 0 , the probability of choosing (st,at)(s_{t}, a_{t}) decreases. That is

π(atst,θt+1)<π(atst,θt).\pi (a _ {t} | s _ {t}, \theta_ {t + 1}) < \pi (a _ {t} | s _ {t}, \theta_ {t}).

The above observations can be proven as follows. When θt+1θt\theta_{t + 1} - \theta_t is sufficiently small, it follows from the Taylor expansion that

π(atst,θt+1)π(atst,θt)+(θπ(atst,θt))T(θt+1θt)=π(atst,θt)+αβt(θπ(atst,θt))T(θπ(atst,θt))(s u b s t i t u t i n g (9 . 3 3))=π(atst,θt)+αβtθπ(atst,θt)22.\begin{array}{l} \pi (a _ {t} | s _ {t}, \theta_ {t + 1}) \approx \pi (a _ {t} | s _ {t}, \theta_ {t}) + (\nabla_ {\theta} \pi (a _ {t} | s _ {t}, \theta_ {t})) ^ {T} (\theta_ {t + 1} - \theta_ {t}) \\ = \pi \left(a _ {t} \mid s _ {t}, \theta_ {t}\right) + \alpha \beta_ {t} \left(\nabla_ {\theta} \pi \left(a _ {t} \mid s _ {t}, \theta_ {t}\right)\right) ^ {T} \left(\nabla_ {\theta} \pi \left(a _ {t} \mid s _ {t}, \theta_ {t}\right)\right) \quad (\text {s u b s t i t u t i n g (9 . 3 3)}) \\ = \pi (a _ {t} | s _ {t}, \theta_ {t}) + \alpha \beta_ {t} \| \nabla_ {\theta} \pi (a _ {t} | s _ {t}, \theta_ {t}) \| _ {2} ^ {2}. \\ \end{array}

It is clear that π(atst,θt+1)π(atst,θt)\pi (a_t|s_t,\theta_{t + 1})\geq \pi (a_t|s_t,\theta_t) when βt0\beta_{t}\geq 0 and π(atst,θt+1)<π(atst,θt)\pi (a_t|s_t,\theta_{t + 1}) < \pi (a_t|s_t,\theta_t) when βt<0\beta_{t} < 0

Second, the algorithm can strike a balance between exploration and exploitation to a

Algorithm 9.1: Policy Gradient by Monte Carlo (REINFORCE)
Initialization: Initial parameter θ\theta ; γ(0,1)\gamma \in (0,1) ; α>0\alpha > 0 .
Goal: Learn an optimal policy for maximizing J(θ)J(\theta) .
For each episode, do
Generate an episode {s0,a0,r1,,sT1,aT1,rT}\{s_0, a_0, r_1, \ldots, s_{T-1}, a_{T-1}, r_T\} following π(θ)\pi(\theta) .
For t=0,1,,T1t = 0,1,\ldots,T-1 :
Value update: qt(st,at)=k=t+1Tγkt1rkq_t(s_t, a_t) = \sum_{k=t+1}^{T} \gamma^{k-t-1} r_k
Policy update: θθ+αθlnπ(atst,θ)qt(st,at)\theta \gets \theta + \alpha \nabla_\theta \ln \pi(a_t | s_t, \theta) q_t(s_t, a_t)

certain extent due to the expression of

βt=qt(st,at)π(atst,θt).\beta_ {t} = \frac {q _ {t} (s _ {t} , a _ {t})}{\pi (a _ {t} | s _ {t} , \theta_ {t})}.

On the one hand, βt\beta_{t} is proportional to qt(st,at)q_{t}(s_{t},a_{t}) . As a result, if the action value of (st,at)(s_t,a_t) is large, then π(atst,θt)\pi (a_t|s_t,\theta_t) is enhanced so that the probability of selecting ata_{t} increases. Therefore, the algorithm attempts to exploit actions with greater values. One the other hand, βt\beta_{t} is inversely proportional to π(atst,θt)\pi (a_t|s_t,\theta_t) when qt(st,at)>0q_{t}(s_{t},a_{t}) > 0 . As a result, if the probability of selecting ata_{t} is small, then π(atst,θt)\pi (a_t|s_t,\theta_t) is enhanced so that the probability of selecting ata_{t} increases. Therefore, the algorithm attempts to explore actions with low probabilities.

Moreover, since (9.32) uses samples to approximate the true gradient in (9.31), it is important to understand how the samples should be obtained.

\diamond How to sample SS ? SS in the true gradient E[θlnπ(AS,θt)qπ(S,A)]\mathbb{E}[\nabla_{\theta}\ln \pi (A|S,\theta_t)q_{\pi}(S,A)] should obey the distribution η\eta which is either the stationary distribution dπd_{\pi} or the discounted total probability distribution ρπ\rho_{\pi} in (9.19). Either dπd_{\pi} or ρπ\rho_{\pi} represents the long-term behavior exhibited under π\pi .
\diamond How to sample AA ? AA in E[θlnπ(AS,θt)qπ(S,A)]\mathbb{E}[\nabla_{\theta}\ln \pi (A|S,\theta_t)q_{\pi}(S,A)] should obey the distribution of π(AS,θ)\pi (A|S,\theta) . The ideal way to sample AA is to select ata_{t} following π(ast,θt)\pi (a|s_t,\theta_t) . Therefore, the policy gradient algorithm is on-policy.

Unfortunately, the ideal ways for sampling SS and AA are not strictly followed in practice due to their low efficiency of sample usage. A more sample-efficient implementation of (9.32) is given in Algorithm 9.1. In this implementation, an episode is first generated by following π(θ)\pi(\theta) . Then, θ\theta is updated multiple times using every experience sample in the episode.