7.3 TD learning of action values: $n$ -step Sarsa

This section introduces $n$ -step Sarsa, an extension of Sarsa. We will see that Sarsa and MC learning are two extreme cases of $n$ -step Sarsa.

Recall that the definition of the action value is

q _ {\pi} (s, a) = \mathbb {E} [ G _ {t} | S _ {t} = s, A _ {t} = a ], \tag {7.16}

where $G_{t}$ is the discounted return satisfying

G _ {t} = R _ {t + 1} + \gamma R _ {t + 2} + \gamma^ {2} R _ {t + 3} + \ldots .

In fact, $G_{t}$ can also be decomposed into different forms:

\mathrm {S a r s a} \longleftarrow G _ {t} ^ {(1)} = R _ {t + 1} + \gamma q _ {\pi} (S _ {t + 1}, A _ {t + 1}),

G _ {t} ^ {(2)} = R _ {t + 1} + \gamma R _ {t + 2} + \gamma^ {2} q _ {\pi} (S _ {t + 2}, A _ {t + 2}),

：

n \text {- s t e p} G _ {t} ^ {(n)} = R _ {t + 1} + \gamma R _ {t + 2} + \dots + \gamma^ {n} q _ {\pi} (S _ {t + n}, A _ {t + n}),

：

\mathrm {M C} \longleftarrow G _ {t} ^ {(\infty)} = R _ {t + 1} + \gamma R _ {t + 2} + \gamma^ {2} R _ {t + 3} + \gamma^ {3} R _ {t + 4} \dots

It should be noted that $G_{t} = G_{t}^{(1)} = G_{t}^{(2)} = G_{t}^{(n)} = G_{t}^{(\infty)}$ , where the superscripts merely indicate the different decomposition structures of $G_{t}$ .

Substituting different decompositions of $G_t^{(n)}$ into $q_{\pi}(s, a)$ in (7.16) results in different algorithms.

$\diamond$ When $n = 1$ , we have

q _ {\pi} (s, a) = \mathbb {E} [ G _ {t} ^ {(1)} | s, a ] = \mathbb {E} [ R _ {t + 1} + \gamma q _ {\pi} (S _ {t + 1}, A _ {t + 1}) | s, a ].

The corresponding stochastic approximation algorithm for solving this equation is

q _ {t + 1} (s _ {t}, a _ {t}) = q _ {t} (s _ {t}, a _ {t}) - \alpha_ {t} (s _ {t}, a _ {t}) \Big [ q _ {t} (s _ {t}, a _ {t}) - (r _ {t + 1} + \gamma q _ {t} (s _ {t + 1}, a _ {t + 1})) \Big ],

which is the Sarsa algorithm in (7.12).

When $n = \infty$ , we have

q _ {\pi} (s, a) = \mathbb {E} [ G _ {t} ^ {(\infty)} | s, a ] = \mathbb {E} [ R _ {t + 1} + \gamma R _ {t + 2} + \gamma^ {2} R _ {t + 3} + \dots | s, a ].

The corresponding algorithm for solving this equation is

q _ {t + 1} (s _ {t}, a _ {t}) = g _ {t} \doteq r _ {t + 1} + \gamma r _ {t + 2} + \gamma^ {2} r _ {t + 3} + \ldots ,

where $g_{t}$ is a sample of $G_{t}$ . In fact, this is the MC learning algorithm, which approximates the action value of $(s_{t},a_{t})$ using the discounted return of an episode starting from $(s_{t},a_{t})$ .

For a general value of $n$ , we have

q _ {\pi} (s, a) = \mathbb {E} [ G _ {t} ^ {(n)} | s, a ] = \mathbb {E} [ R _ {t + 1} + \gamma R _ {t + 2} + \dots + \gamma^ {n} q _ {\pi} (S _ {t + n}, A _ {t + n}) | s, a ].

The corresponding algorithm for solving the above equation is

\begin{array}{l} q _ {t + 1} \left(s _ {t}, a _ {t}\right) = q _ {t} \left(s _ {t}, a _ {t}\right) \\ \left. - \alpha_ {t} \left(s _ {t}, a _ {t}\right) \left[ q _ {t} \left(s _ {t}, a _ {t}\right) - \left(r _ {t + 1} + \gamma r _ {t + 2} + \dots + \gamma^ {n} q _ {t} \left(s _ {t + n}, a _ {t + n}\right)\right) \right]. \right. \tag {7.17} \\ \end{array}

This algorithm is called $n$ -step Sarna.

In summary, $n$ -step Sarsa is a more general algorithm because it becomes the (one-step) Sarsa algorithm when $n = 1$ and the MC learning algorithm when $n = \infty$ (by setting $\alpha_{t} = 1$ ).

To implement the $n$ -step Sarsa algorithm in (7.17), we need the experience samples $(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}, \ldots, r_{t+n}, s_{t+n}, a_{t+n})$ . Since $(r_{t+n}, s_{t+n}, a_{t+n})$ has not been collected at time $t$ , we have to wait until time $t + n$ to update the q-value of $(s_t, a_t)$ . To that end, (7.17) can be rewritten as

\begin{array}{l} q _ {t + n} \left(s _ {t}, a _ {t}\right) = q _ {t + n - 1} \left(s _ {t}, a _ {t}\right) \\ - \alpha_ {t + n - 1} (s _ {t}, a _ {t}) \Big [ q _ {t + n - 1} (s _ {t}, a _ {t}) - \big (r _ {t + 1} + \gamma r _ {t + 2} + \dots + \gamma^ {n} q _ {t + n - 1} (s _ {t + n}, a _ {t + n}) \big) \Big ], \\ \end{array}

where $q_{t + n}(s_t, a_t)$ is the estimate of $q_{\pi}(s_t, a_t)$ at time $t + n$ .

Since $n$ -step Sarsa includes Sarsa and MC learning as two extreme cases, it is not surprising that the performance of $n$ -step Sarsa is between that of Sarsa and MC learning. In particular, if $n$ is selected as a large number, $n$ -step Sarsa is close to MC learning: the estimate has a relatively high variance but a small bias. If $n$ is selected to be small, $n$ -step Sarsa is close to Sarsa: the estimate has a relatively large bias but a low variance. Finally, the $n$ -step Sarsa algorithm presented here is merely used for policy evaluation. It must be combined with a policy improvement step to learn optimal policies. The implementation is similar to that of Sarsa and is omitted here. Interested readers may check [3, Chapter 7] for a detailed analysis of multi-step TD learning.

7.3_TD_learning_of_action_values_n-step_Sarsa

7.3 TD learning of action values: nnn -step Sarsa

7.3 TD learning of action values: $n$ -step Sarsa