9.4 Monte Carlo policy gradient (REINFORCE)

With the gradient presented in Theorem 9.1, we next show how to use the gradient-based method to optimize the metrics to obtain optimal policies.

The gradient-ascent algorithm for maximizing $J(\theta)$ is

\begin{array}{l} \theta_ {t + 1} = \theta_ {t} + \alpha \nabla_ {\theta} J (\theta_ {t}) \\ = \theta_ {t} + \alpha \mathbb {E} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta_ {t}) q _ {\pi} (S, A) \right], \tag {9.31} \\ \end{array}

where $\alpha > 0$ is a constant learning rate. Since the true gradient in (9.31) is unknown, we

can replace the true gradient with a stochastic gradient to obtain the following algorithm:

\theta_ {t + 1} = \theta_ {t} + \alpha \nabla_ {\theta} \ln \pi (a _ {t} | s _ {t}, \theta_ {t}) q _ {t} (s _ {t}, a _ {t}), \tag {9.32}

where $q_{t}(s_{t},a_{t})$ is an approximation of $q_{\pi}(s_t,a_t)$ . If $q_{t}(s_{t},a_{t})$ is obtained by Monte Carlo estimation, the algorithm is called REINFORCE [68] or Monte Carlo policy gradient, which is one of earliest and simplest policy gradient algorithms.

The algorithm in (9.32) is important since many other policy gradient algorithms can be obtained by extending it. We next examine the interpretation of (9.32) more closely. Since $\nabla_{\theta}\ln \pi (a_t|s_t,\theta_t) = \frac{\nabla_{\theta}\pi(a_t|s_t,\theta_t)}{\pi(a_t|s_t,\theta_t)}$ , we can rewrite (9.32) as

\theta_ {t + 1} = \theta_ {t} + \alpha \underbrace {\left(\frac {q _ {t} (s _ {t} , a _ {t})}{\pi (a _ {t} | s _ {t} , \theta_ {t})}\right)} _ {\beta_ {t}} \nabla_ {\theta} \pi (a _ {t} | s _ {t}, \theta_ {t}),

which can be further written concisely as

\theta_ {t + 1} = \theta_ {t} + \alpha \beta_ {t} \nabla_ {\theta} \pi (a _ {t} | s _ {t}, \theta_ {t}). \tag {9.33}

Two important interpretations can be seen from this equation.

First, since (9.33) is a simple gradient-ascent algorithm, the following observations can be obtained.

If $\beta_{t} \geq 0$ , the probability of choosing $(s_{t}, a_{t})$ is enhanced. That is

\pi \left(a _ {t} \mid s _ {t}, \theta_ {t + 1}\right) \geq \pi \left(a _ {t} \mid s _ {t}, \theta_ {t}\right).

The greater $\beta_{t}$ is, the stronger the enhancement is.

If $\beta_{t} < 0$ , the probability of choosing $(s_{t}, a_{t})$ decreases. That is

\pi (a _ {t} | s _ {t}, \theta_ {t + 1}) < \pi (a _ {t} | s _ {t}, \theta_ {t}).

The above observations can be proven as follows. When $\theta_{t + 1} - \theta_t$ is sufficiently small, it follows from the Taylor expansion that

\begin{array}{l} \pi (a _ {t} | s _ {t}, \theta_ {t + 1}) \approx \pi (a _ {t} | s _ {t}, \theta_ {t}) + (\nabla_ {\theta} \pi (a _ {t} | s _ {t}, \theta_ {t})) ^ {T} (\theta_ {t + 1} - \theta_ {t}) \\ = \pi \left(a _ {t} \mid s _ {t}, \theta_ {t}\right) + \alpha \beta_ {t} \left(\nabla_ {\theta} \pi \left(a _ {t} \mid s _ {t}, \theta_ {t}\right)\right) ^ {T} \left(\nabla_ {\theta} \pi \left(a _ {t} \mid s _ {t}, \theta_ {t}\right)\right) \quad (\text {s u b s t i t u t i n g (9 . 3 3)}) \\ = \pi (a _ {t} | s _ {t}, \theta_ {t}) + \alpha \beta_ {t} \| \nabla_ {\theta} \pi (a _ {t} | s _ {t}, \theta_ {t}) \| _ {2} ^ {2}. \\ \end{array}

It is clear that $\pi (a_t|s_t,\theta_{t + 1})\geq \pi (a_t|s_t,\theta_t)$ when $\beta_{t}\geq 0$ and $\pi (a_t|s_t,\theta_{t + 1}) < \pi (a_t|s_t,\theta_t)$ when $\beta_{t} < 0$

Second, the algorithm can strike a balance between exploration and exploitation to a

Algorithm 9.1: Policy Gradient by Monte Carlo (REINFORCE)
Initialization: Initial parameter $\theta$ ; $\gamma \in (0,1)$ ; $\alpha > 0$ .
Goal: Learn an optimal policy for maximizing $J(\theta)$ .
For each episode, do
Generate an episode $\{s_0, a_0, r_1, \ldots, s_{T-1}, a_{T-1}, r_T\}$ following $\pi(\theta)$ .
For $t = 0,1,\ldots,T-1$ :
Value update: $q_t(s_t, a_t) = \sum_{k=t+1}^{T} \gamma^{k-t-1} r_k$
Policy update: $\theta \gets \theta + \alpha \nabla_\theta \ln \pi(a_t | s_t, \theta) q_t(s_t, a_t)$

certain extent due to the expression of

\beta_ {t} = \frac {q _ {t} (s _ {t} , a _ {t})}{\pi (a _ {t} | s _ {t} , \theta_ {t})}.

On the one hand, $\beta_{t}$ is proportional to $q_{t}(s_{t},a_{t})$ . As a result, if the action value of $(s_t,a_t)$ is large, then $\pi (a_t|s_t,\theta_t)$ is enhanced so that the probability of selecting $a_{t}$ increases. Therefore, the algorithm attempts to exploit actions with greater values. One the other hand, $\beta_{t}$ is inversely proportional to $\pi (a_t|s_t,\theta_t)$ when $q_{t}(s_{t},a_{t}) > 0$ . As a result, if the probability of selecting $a_{t}$ is small, then $\pi (a_t|s_t,\theta_t)$ is enhanced so that the probability of selecting $a_{t}$ increases. Therefore, the algorithm attempts to explore actions with low probabilities.

Moreover, since (9.32) uses samples to approximate the true gradient in (9.31), it is important to understand how the samples should be obtained.

$\diamond$ How to sample $S$ ? $S$ in the true gradient $\mathbb{E}[\nabla_{\theta}\ln \pi (A|S,\theta_t)q_{\pi}(S,A)]$ should obey the distribution $\eta$ which is either the stationary distribution $d_{\pi}$ or the discounted total probability distribution $\rho_{\pi}$ in (9.19). Either $d_{\pi}$ or $\rho_{\pi}$ represents the long-term behavior exhibited under $\pi$ .
$\diamond$ How to sample $A$ ? $A$ in $\mathbb{E}[\nabla_{\theta}\ln \pi (A|S,\theta_t)q_{\pi}(S,A)]$ should obey the distribution of $\pi (A|S,\theta)$ . The ideal way to sample $A$ is to select $a_{t}$ following $\pi (a|s_t,\theta_t)$ . Therefore, the policy gradient algorithm is on-policy.

Unfortunately, the ideal ways for sampling $S$ and $A$ are not strictly followed in practice due to their low efficiency of sample usage. A more sample-efficient implementation of (9.32) is given in Algorithm 9.1. In this implementation, an episode is first generated by following $\pi(\theta)$ . Then, $\theta$ is updated multiple times using every experience sample in the episode.

9.4_Monte_Carlo_policy_gradient_REINFORCE

9.4 Monte Carlo policy gradient (REINFORCE)