7.3_TD_learning_of_action_values_n-step_Sarsa

7.3 TD learning of action values: nn -step Sarsa

This section introduces nn -step Sarsa, an extension of Sarsa. We will see that Sarsa and MC learning are two extreme cases of nn -step Sarsa.

Recall that the definition of the action value is

qπ(s,a)=E[GtSt=s,At=a],(7.16)q _ {\pi} (s, a) = \mathbb {E} [ G _ {t} | S _ {t} = s, A _ {t} = a ], \tag {7.16}

where GtG_{t} is the discounted return satisfying

Gt=Rt+1+γRt+2+γ2Rt+3+.G _ {t} = R _ {t + 1} + \gamma R _ {t + 2} + \gamma^ {2} R _ {t + 3} + \ldots .

In fact, GtG_{t} can also be decomposed into different forms:

SarsaGt(1)=Rt+1+γqπ(St+1,At+1),\mathrm {S a r s a} \longleftarrow G _ {t} ^ {(1)} = R _ {t + 1} + \gamma q _ {\pi} (S _ {t + 1}, A _ {t + 1}),
Gt(2)=Rt+1+γRt+2+γ2qπ(St+2,At+2),G _ {t} ^ {(2)} = R _ {t + 1} + \gamma R _ {t + 2} + \gamma^ {2} q _ {\pi} (S _ {t + 2}, A _ {t + 2}),

n- s t e pGt(n)=Rt+1+γRt+2++γnqπ(St+n,At+n),n \text {- s t e p} G _ {t} ^ {(n)} = R _ {t + 1} + \gamma R _ {t + 2} + \dots + \gamma^ {n} q _ {\pi} (S _ {t + n}, A _ {t + n}),

MCGt()=Rt+1+γRt+2+γ2Rt+3+γ3Rt+4\mathrm {M C} \longleftarrow G _ {t} ^ {(\infty)} = R _ {t + 1} + \gamma R _ {t + 2} + \gamma^ {2} R _ {t + 3} + \gamma^ {3} R _ {t + 4} \dots

It should be noted that Gt=Gt(1)=Gt(2)=Gt(n)=Gt()G_{t} = G_{t}^{(1)} = G_{t}^{(2)} = G_{t}^{(n)} = G_{t}^{(\infty)} , where the superscripts merely indicate the different decomposition structures of GtG_{t} .

Substituting different decompositions of Gt(n)G_t^{(n)} into qπ(s,a)q_{\pi}(s, a) in (7.16) results in different algorithms.

\diamond When n=1n = 1 , we have

qπ(s,a)=E[Gt(1)s,a]=E[Rt+1+γqπ(St+1,At+1)s,a].q _ {\pi} (s, a) = \mathbb {E} [ G _ {t} ^ {(1)} | s, a ] = \mathbb {E} [ R _ {t + 1} + \gamma q _ {\pi} (S _ {t + 1}, A _ {t + 1}) | s, a ].

The corresponding stochastic approximation algorithm for solving this equation is

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γqt(st+1,at+1))],q _ {t + 1} (s _ {t}, a _ {t}) = q _ {t} (s _ {t}, a _ {t}) - \alpha_ {t} (s _ {t}, a _ {t}) \Big [ q _ {t} (s _ {t}, a _ {t}) - (r _ {t + 1} + \gamma q _ {t} (s _ {t + 1}, a _ {t + 1})) \Big ],

which is the Sarsa algorithm in (7.12).

When n=n = \infty , we have

qπ(s,a)=E[Gt()s,a]=E[Rt+1+γRt+2+γ2Rt+3+s,a].q _ {\pi} (s, a) = \mathbb {E} [ G _ {t} ^ {(\infty)} | s, a ] = \mathbb {E} [ R _ {t + 1} + \gamma R _ {t + 2} + \gamma^ {2} R _ {t + 3} + \dots | s, a ].

The corresponding algorithm for solving this equation is

qt+1(st,at)=gtrt+1+γrt+2+γ2rt+3+,q _ {t + 1} (s _ {t}, a _ {t}) = g _ {t} \doteq r _ {t + 1} + \gamma r _ {t + 2} + \gamma^ {2} r _ {t + 3} + \ldots ,

where gtg_{t} is a sample of GtG_{t} . In fact, this is the MC learning algorithm, which approximates the action value of (st,at)(s_{t},a_{t}) using the discounted return of an episode starting from (st,at)(s_{t},a_{t}) .

For a general value of nn , we have

qπ(s,a)=E[Gt(n)s,a]=E[Rt+1+γRt+2++γnqπ(St+n,At+n)s,a].q _ {\pi} (s, a) = \mathbb {E} [ G _ {t} ^ {(n)} | s, a ] = \mathbb {E} [ R _ {t + 1} + \gamma R _ {t + 2} + \dots + \gamma^ {n} q _ {\pi} (S _ {t + n}, A _ {t + n}) | s, a ].

The corresponding algorithm for solving the above equation is

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γrt+2++γnqt(st+n,at+n))].(7.17)\begin{array}{l} q _ {t + 1} \left(s _ {t}, a _ {t}\right) = q _ {t} \left(s _ {t}, a _ {t}\right) \\ \left. - \alpha_ {t} \left(s _ {t}, a _ {t}\right) \left[ q _ {t} \left(s _ {t}, a _ {t}\right) - \left(r _ {t + 1} + \gamma r _ {t + 2} + \dots + \gamma^ {n} q _ {t} \left(s _ {t + n}, a _ {t + n}\right)\right) \right]. \right. \tag {7.17} \\ \end{array}

This algorithm is called nn -step Sarna.

In summary, nn -step Sarsa is a more general algorithm because it becomes the (one-step) Sarsa algorithm when n=1n = 1 and the MC learning algorithm when n=n = \infty (by setting αt=1\alpha_{t} = 1 ).

To implement the nn -step Sarsa algorithm in (7.17), we need the experience samples (st,at,rt+1,st+1,at+1,,rt+n,st+n,at+n)(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}, \ldots, r_{t+n}, s_{t+n}, a_{t+n}) . Since (rt+n,st+n,at+n)(r_{t+n}, s_{t+n}, a_{t+n}) has not been collected at time tt , we have to wait until time t+nt + n to update the q-value of (st,at)(s_t, a_t) . To that end, (7.17) can be rewritten as

qt+n(st,at)=qt+n1(st,at)αt+n1(st,at)[qt+n1(st,at)(rt+1+γrt+2++γnqt+n1(st+n,at+n))],\begin{array}{l} q _ {t + n} \left(s _ {t}, a _ {t}\right) = q _ {t + n - 1} \left(s _ {t}, a _ {t}\right) \\ - \alpha_ {t + n - 1} (s _ {t}, a _ {t}) \Big [ q _ {t + n - 1} (s _ {t}, a _ {t}) - \big (r _ {t + 1} + \gamma r _ {t + 2} + \dots + \gamma^ {n} q _ {t + n - 1} (s _ {t + n}, a _ {t + n}) \big) \Big ], \\ \end{array}

where qt+n(st,at)q_{t + n}(s_t, a_t) is the estimate of qπ(st,at)q_{\pi}(s_t, a_t) at time t+nt + n .

Since nn -step Sarsa includes Sarsa and MC learning as two extreme cases, it is not surprising that the performance of nn -step Sarsa is between that of Sarsa and MC learning. In particular, if nn is selected as a large number, nn -step Sarsa is close to MC learning: the estimate has a relatively high variance but a small bias. If nn is selected to be small, nn -step Sarsa is close to Sarsa: the estimate has a relatively large bias but a low variance. Finally, the nn -step Sarsa algorithm presented here is merely used for policy evaluation. It must be combined with a policy improvement step to learn optimal policies. The implementation is similar to that of Sarsa and is omitted here. Interested readers may check [3, Chapter 7] for a detailed analysis of multi-step TD learning.