7.3 TD learning of action values: n -step Sarsa
This section introduces n -step Sarsa, an extension of Sarsa. We will see that Sarsa and MC learning are two extreme cases of n -step Sarsa.
Recall that the definition of the action value is
qπ(s,a)=E[Gt∣St=s,At=a],(7.16) where Gt is the discounted return satisfying
Gt=Rt+1+γRt+2+γ2Rt+3+…. In fact, Gt can also be decomposed into different forms:
Sarsa⟵Gt(1)=Rt+1+γqπ(St+1,At+1), Gt(2)=Rt+1+γRt+2+γ2qπ(St+2,At+2), :
n- s t e pGt(n)=Rt+1+γRt+2+⋯+γnqπ(St+n,At+n), :
MC⟵Gt(∞)=Rt+1+γRt+2+γ2Rt+3+γ3Rt+4… It should be noted that Gt=Gt(1)=Gt(2)=Gt(n)=Gt(∞) , where the superscripts merely indicate the different decomposition structures of Gt .
Substituting different decompositions of Gt(n) into qπ(s,a) in (7.16) results in different algorithms.
⋄ When n=1 , we have
qπ(s,a)=E[Gt(1)∣s,a]=E[Rt+1+γqπ(St+1,At+1)∣s,a]. The corresponding stochastic approximation algorithm for solving this equation is
qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γqt(st+1,at+1))], which is the Sarsa algorithm in (7.12).
When n=∞ , we have
qπ(s,a)=E[Gt(∞)∣s,a]=E[Rt+1+γRt+2+γ2Rt+3+…∣s,a]. The corresponding algorithm for solving this equation is
qt+1(st,at)=gt≐rt+1+γrt+2+γ2rt+3+…, where gt is a sample of Gt . In fact, this is the MC learning algorithm, which approximates the action value of (st,at) using the discounted return of an episode starting from (st,at) .
For a general value of n , we have
qπ(s,a)=E[Gt(n)∣s,a]=E[Rt+1+γRt+2+⋯+γnqπ(St+n,At+n)∣s,a]. The corresponding algorithm for solving the above equation is
qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γrt+2+⋯+γnqt(st+n,at+n))].(7.17) This algorithm is called n -step Sarna.
In summary, n -step Sarsa is a more general algorithm because it becomes the (one-step) Sarsa algorithm when n=1 and the MC learning algorithm when n=∞ (by setting αt=1 ).
To implement the n -step Sarsa algorithm in (7.17), we need the experience samples (st,at,rt+1,st+1,at+1,…,rt+n,st+n,at+n) . Since (rt+n,st+n,at+n) has not been collected at time t , we have to wait until time t+n to update the q-value of (st,at) . To that end, (7.17) can be rewritten as
qt+n(st,at)=qt+n−1(st,at)−αt+n−1(st,at)[qt+n−1(st,at)−(rt+1+γrt+2+⋯+γnqt+n−1(st+n,at+n))], where qt+n(st,at) is the estimate of qπ(st,at) at time t+n .
Since n -step Sarsa includes Sarsa and MC learning as two extreme cases, it is not surprising that the performance of n -step Sarsa is between that of Sarsa and MC learning. In particular, if n is selected as a large number, n -step Sarsa is close to MC learning: the estimate has a relatively high variance but a small bias. If n is selected to be small, n -step Sarsa is close to Sarsa: the estimate has a relatively large bias but a low variance. Finally, the n -step Sarsa algorithm presented here is merely used for policy evaluation. It must be combined with a policy improvement step to learn optimal policies. The implementation is similar to that of Sarsa and is omitted here. Interested readers may check [3, Chapter 7] for a detailed analysis of multi-step TD learning.