7.2_TD_learning_of_action_values_Sarsa

7.2 TD learning of action values: Sarsa

The TD algorithm introduced in Section 7.1 can only estimate state values. This section introduces another TD algorithm called Sarsa that can directly estimate action values. Estimating action values is important because it can be combined with a policy improvement step to learn optimal policies.

7.2.1 Algorithm description

Given a policy π\pi , our goal is to estimate the action values. Suppose that we have some experience samples generated following π\pi : (s0,a0,r1,s1,a1,,st,at,rt+1,st+1,at+1,)(s_0, a_0, r_1, s_1, a_1, \ldots, s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}, \ldots) . We can use the following Sarsa algorithm to estimate the action values:

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γqt(st+1,at+1))],(7.12)q _ {t + 1} \left(s _ {t}, a _ {t}\right) = q _ {t} \left(s _ {t}, a _ {t}\right) - \alpha_ {t} \left(s _ {t}, a _ {t}\right) \left[ q _ {t} \left(s _ {t}, a _ {t}\right) - \left(r _ {t + 1} + \gamma q _ {t} \left(s _ {t + 1}, a _ {t + 1}\right)\right) \right], \tag {7.12}
qt+1(s,a)=qt(s,a),f o r a l l(s,a)(st,at),q _ {t + 1} (s, a) = q _ {t} (s, a), \quad \text {f o r a l l} (s, a) \neq (s _ {t}, a _ {t}),

where t=0,1,2,t = 0,1,2,\ldots and αt(st,at)\alpha_{t}(s_{t},a_{t}) is the learning rate. Here, qt(st,at)q_{t}(s_{t},a_{t}) is the estimate of qπ(st,at)q_{\pi}(s_t,a_t) . At time tt , only the q-value of (st,at)(s_t,a_t) is updated, whereas the q-values of the others remain the same.

Some important properties of the Sarsa algorithm are discussed as follows.

\diamond Why is this algorithm called "Sarsa"? That is because each iteration of the algorithm requires (st,at,rt+1,st+1,at+1)(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}) . Sarsa is an abbreviation for state-action-reward-state-action. The Sarsa algorithm was first proposed in [35] and its name was coined by [3].
\diamond Why is Sarsa designed in this way? One may have noticed that Sarsa is similar to the TD algorithm in (7.1). In fact, Sarsa can be easily obtained from the TD algorithm by replacing state value estimation with action value estimation.
\diamond What does Sarsa do mathematically? Similar to the TD algorithm in (7.1), Sarsa is a stochastic approximation algorithm for solving the Bellman equation of a given policy:

qπ(s,a)=E[R+γqπ(S,A)s,a],f o r a l l(s,a).(7.13)q _ {\pi} (s, a) = \mathbb {E} \left[ R + \gamma q _ {\pi} \left(S ^ {\prime}, A ^ {\prime}\right) | s, a \right], \quad \text {f o r a l l} (s, a). \tag {7.13}

Equation (7.13) is the Bellman equation expressed in terms of action values. A proof is given in Box 7.3.

Box 7.3: Showing that (7.13) is the Bellman equation

As introduced in Section 2.8.2, the Bellman equation expressed in terms of action values is

qπ(s,a)=rrp(rs,a)+γsaqπ(s,a)p(ss,a)π(as)=rrp(rs,a)+γsp(ss,a)aqπ(s,a)π(as).(7.14)\begin{array}{l} q _ {\pi} (s, a) = \sum_ {r} r p (r | s, a) + \gamma \sum_ {s ^ {\prime}} \sum_ {a ^ {\prime}} q _ {\pi} (s ^ {\prime}, a ^ {\prime}) p (s ^ {\prime} | s, a) \pi (a ^ {\prime} | s ^ {\prime}) \\ = \sum_ {r} r p (r | s, a) + \gamma \sum_ {s ^ {\prime}} p \left(s ^ {\prime} \mid s, a\right) \sum_ {a ^ {\prime}} q _ {\pi} \left(s ^ {\prime}, a ^ {\prime}\right) \pi \left(a ^ {\prime} \mid s ^ {\prime}\right). \tag {7.14} \\ \end{array}

This equation establishes the relationships among the action values. Since

p(s,as,a)=p(ss,a)p(as,s,a)=p(ss,a)p(as)(duetoconditionalindependence)=˙p(ss,a)π(as),\begin{array}{l} p \left(s ^ {\prime}, a ^ {\prime} \mid s, a\right) = p \left(s ^ {\prime} \mid s, a\right) p \left(a ^ {\prime} \mid s ^ {\prime}, s, a\right) \\ = p (s ^ {\prime} | s, a) p (a ^ {\prime} | s ^ {\prime}) \quad (\mathrm {d u e t o c o n d i t i o n a l i n d e p e n d e n c e}) \\ \dot {=} p (s ^ {\prime} | s, a) \pi (a ^ {\prime} | s ^ {\prime}), \\ \end{array}

(7.14) can be rewritten as

qπ(s,a)=rrp(rs,a)+γsaqπ(s,a)p(s,as,a).q _ {\pi} (s, a) = \sum_ {r} r p (r | s, a) + \gamma \sum_ {s ^ {\prime}} \sum_ {a ^ {\prime}} q _ {\pi} \left(s ^ {\prime}, a ^ {\prime}\right) p \left(s ^ {\prime}, a ^ {\prime} \mid s, a\right).

By the definition of the expected value, the above equation is equivalent to (7.13). Hence, (7.13) is the Bellman equation.

Is Sarsa convergent? Since Sarsa is the action-value version of the TD algorithm in (7.1), the convergence result is similar to Theorem 7.1 and given below.

Theorem 7.2 (Convergence of Sarsa). Given a policy π\pi , by the Sarsa algorithm in (7.12), qt(s,a)q_{t}(s,a) converges almost surely to the action value qπ(s,a)q_{\pi}(s,a) as tt \to \infty for all (s,a)(s,a) if tαt(s,a)=\sum_{t} \alpha_{t}(s,a) = \infty and tαt2(s,a)<\sum_{t} \alpha_{t}^{2}(s,a) < \infty for all (s,a)(s,a) .

The proof is similar to that of Theorem 7.1 and is omitted here. The condition of tαt(s,a)=\sum_{t}\alpha_{t}(s,a) = \infty and tαt2(s,a)<\sum_{t}\alpha_{t}^{2}(s,a) < \infty should be valid for all (s,a)(s,a) . In particular, tαt(s,a)=\sum_{t}\alpha_{t}(s,a) = \infty requires that every state-action pair must be visited an infinite (or sufficiently many) number of times. At time tt , if (s,a)=(st,at)(s,a) = (s_t,a_t) , then αt(s,a)>0\alpha_{t}(s,a) > 0 ; otherwise, αt(s,a)=0\alpha_{t}(s,a) = 0 .

7.2.2 Optimal policy learning via Sarsa

The Sarsa algorithm in (7.12) can only estimate the action values of a given policy. To find optimal policies, we can combine it with a policy improvement step. The combination is also often called Sarsa, and its implementation procedure is given in Algorithm 7.1.

Algorithm 7.1: Optimal policy learning by Sarsa

Initialization: αt(s,a)=α>0\alpha_{t}(s,a) = \alpha >0 for all (s,a)(s,a) and all tt . ϵ(0,1)\epsilon \in (0,1) . Initial q0(s,a)q_{0}(s,a) for all (s,a)(s,a) . Initial ϵ\epsilon -greedy policy π0\pi_0 derived from q0q_{0} .

Goal: Learn an optimal policy that can lead the agent to the target state from an initial state s0s_0 .

For each episode, do

Generate a0a_0 at s0s_0 following π0(s0)\pi_0(s_0)

If sts_t ( t=0,1,2,t = 0, 1, 2, \ldots ) is not the target state, do

Collect an experience sample (rt+1,st+1,at+1)(r_{t+1}, s_{t+1}, a_{t+1}) given (st,at)(s_t, a_t) : generate rt+1,st+1r_{t+1}, s_{t+1} by interacting with the environment; generate at+1a_{t+1} following πt(st+1)\pi_t(s_{t+1}) .

Update qq -value for (st,at)(s_t, a_t) :

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γqt(st+1,at+1))]q _ {t + 1} \left(s _ {t}, a _ {t}\right) = q _ {t} \left(s _ {t}, a _ {t}\right) - \alpha_ {t} \left(s _ {t}, a _ {t}\right) \left[ q _ {t} \left(s _ {t}, a _ {t}\right) - \left(r _ {t + 1} + \gamma q _ {t} \left(s _ {t + 1}, a _ {t + 1}\right)\right) \right]

Update policy for sts_t :

πt+1(ast)=1ϵA(st)(A(st)1)i fa=argmaxaqt+1(st,a)πt+1(ast)=ϵA(st)otherwisestst+1,atat+1\begin{array}{l} \pi_ {t + 1} (a | s _ {t}) = 1 - \frac {\epsilon}{| \mathcal {A} (s _ {t}) |} (| \mathcal {A} (s _ {t}) | - 1) \text {i f} a = \arg \max _ {a} q _ {t + 1} (s _ {t}, a) \\ \pi_ {t + 1} (a | s _ {t}) = \frac {\epsilon}{| \mathcal {A} (s _ {t}) |} o t h e r w i s e \\ s _ {t} \leftarrow s _ {t + 1}, a _ {t} \leftarrow a _ {t + 1} \\ \end{array}

As shown in Algorithm 7.1, each iteration has two steps. The first step is to update the q-value of the visited state-action pair. The second step is to update the policy to an ϵ\epsilon -greedy one. The q-value update step only updates the single state-action pair visited at time tt . Afterward, the policy of sts_t is immediately updated. Therefore, we do not evaluate a given policy sufficiently well before updating the policy. This is based on the idea of generalized policy iteration. Moreover, after the policy is updated, the policy is immediately used to generate the next experience sample. The policy here is ϵ\epsilon -greedy so that it is exploratory.

A simulation example is shown in Figure 7.2 to demonstrate the Sarsa algorithm. Unlike all the tasks we have seen in this book, the task here aims to find an optimal path from a specific starting state to a target state. It does not aim to find the optimal policies for all states. This task is often encountered in practice where the starting state (e.g., home) and the target state (e.g., workplace) are fixed, and we only need to find an optimal path connecting them. This task is relatively simple because we only need to explore the states that are close to the path and do not need to explore all the states. However, if we do not explore all the states, the final path may be locally optimal rather than globally optimal.

The simulation setup and simulation results are discussed below.

\diamond Simulation setup: In this example, all the episodes start from the top-left state and terminate at the target state. The reward settings are rtarget=0r_{\mathrm{target}} = 0 , rforbidden=rboundary=10r_{\mathrm{forbidden}} = r_{\mathrm{boundary}} = -10 , and rother=1r_{\mathrm{other}} = -1 . Moreover, αt(s,a)=0.1\alpha_{t}(s,a) = 0.1 for all tt and ϵ=0.1\epsilon = 0.1 . The initial guesses of the action values are q0(s,a)=0q_{0}(s,a) = 0 for all (s,a)(s,a) . The initial policy has a


Figure 7.2: An example for demonstrating Sarsa. All the episodes start from the top-left state and terminate when reaching the target state (the blue cell). The goal is to find an optimal path from the starting state to the target state. The reward settings are rtarget=0r_{\mathrm{target}} = 0 , rforbidden=rboundary=10r_{\mathrm{forbidden}} = r_{\mathrm{boundary}} = -10 , and rother=1r_{\mathrm{other}} = -1 . The learning rate is α=0.1\alpha = 0.1 and the value of ϵ\epsilon is 0.1. The left figure shows the final policy obtained by the algorithm. The right figures show the total reward and length of every episode.

uniform distribution: π0(as)=0.2\pi_0(a|s) = 0.2 for all s,as, a .

\diamond Learned policy: The left figure in Figure 7.2 shows the final policy learned by Sarsa. As can be seen, this policy can successfully lead to the target state from the starting state. However, the policies of some other states may not be optimal. That is because the other states are not well explored.
\diamond Total reward of each episode: The top-right subfigure in Figure 7.2 shows the total reward of each episode. Here, the total reward is the non-discounted sum of all immediate rewards. As can be seen, the total reward of each episode increases gradually. That is because the initial policy is not good and hence negative rewards are frequently obtained. As the policy becomes better, the total reward increases.
\diamond Length of each episode: The bottom-right subfigure in Figure 7.2 shows that the length of each episode drops gradually. That is because the initial policy is not good and may take many detours before reaching the target. As the policy becomes better, the length of the trajectory becomes shorter. Notably, the length of an episode may increase abruptly (e.g., the 460th episode) and the corresponding total reward also drops sharply. That is because the policy is ϵ\epsilon -greedy, and there is a chance for it to take non-optimal actions. One way to resolve this problem is to use decaying ϵ\epsilon whose value converges to zero gradually.

Finally, Sarsa also has some variants such as Expected Sarsa. Interested readers may check Box 7.4.

Box 7.4: Expected Sarna

Given a policy π\pi , its action values can be evaluated by Expected Sarsa, which is a variant of Sarsa. The Expected Sarsa algorithm is

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γE[qt(st+1,A)])],q _ {t + 1} (s _ {t}, a _ {t}) = q _ {t} (s _ {t}, a _ {t}) - \alpha_ {t} (s _ {t}, a _ {t}) \Big [ q _ {t} (s _ {t}, a _ {t}) - (r _ {t + 1} + \gamma \mathbb {E} [ q _ {t} (s _ {t + 1}, A) ]) \Big ],
qt+1(s,a)=qt(s,a),forall(s,a)(st,at),q _ {t + 1} (s, a) = q _ {t} (s, a), \quad \mathrm {f o r a l l} (s, a) \neq (s _ {t}, a _ {t}),

where

E[qt(st+1,A)]=aπt(ast+1)qt(st+1,a)vt(st+1)\mathbb {E} \left[ q _ {t} \left(s _ {t + 1}, A\right) \right] = \sum_ {a} \pi_ {t} (a \mid s _ {t + 1}) q _ {t} \left(s _ {t + 1}, a\right) \doteq v _ {t} \left(s _ {t + 1}\right)

is the expected value of qt(st+1,a)q_{t}(s_{t + 1},a) under policy πt\pi_t . The expression of the Expected Sarsa algorithm is very similar to that of Sarsa. They are different only in terms of their TD targets. In particular, the TD target in Expected Sarsa is rt+1+γE[qt(st+1,A)]r_{t + 1} + \gamma \mathbb{E}[q_t(s_{t + 1},A)] , while that of Sarsa is rt+1+γqt(st+1,at+1)r_{t + 1} + \gamma q_t(s_{t + 1},a_{t + 1}) . Since the algorithm involves an expected value, it is called Expected Sarsa. Although calculating the expected value may increase the computational complexity slightly, it is beneficial in the sense that it reduces the estimation variances because it reduces the random variables in Sarsa from {st,at,rt+1,st+1,at+1}\{s_t,a_t,r_{t + 1},s_{t + 1},a_{t + 1}\} to {st,at,rt+1,st+1}\{s_t,a_t,r_{t + 1},s_{t + 1}\} .

Similar to the TD learning algorithm in (7.1), Expected Sarsa can be viewed as a stochastic approximation algorithm for solving the following equation:

qπ(s,a)=E[Rt+1+γE[qπ(St+1,At+1)St+1]St=s,At=a],f o r a l ls,a.(7.15)q _ {\pi} (s, a) = \mathbb {E} \left[ R _ {t + 1} + \gamma \mathbb {E} \left[ q _ {\pi} \left(S _ {t + 1}, A _ {t + 1}\right) \mid S _ {t + 1} \right] \mid S _ {t} = s, A _ {t} = a \right], \quad \text {f o r a l l} s, a. \tag {7.15}

The above equation may look strange at first glance. In fact, it is another expression of the Bellman equation. To see that, substituting

E[qπ(St+1,At+1)St+1]=Aqπ(St+1,A)π(ASt+1)=vπ(St+1)\mathbb {E} \left[ q _ {\pi} \left(S _ {t + 1}, A _ {t + 1}\right) \mid S _ {t + 1} \right] = \sum_ {A ^ {\prime}} q _ {\pi} \left(S _ {t + 1}, A ^ {\prime}\right) \pi \left(A ^ {\prime} \mid S _ {t + 1}\right) = v _ {\pi} \left(S _ {t + 1}\right)

into (7.15) gives

qπ(s,a)=E[Rt+1+γvπ(St+1)St=s,At=a],q _ {\pi} (s, a) = \mathbb {E} \Big [ R _ {t + 1} + \gamma v _ {\pi} (S _ {t + 1}) | S _ {t} = s, A _ {t} = a \Big ],

which is clearly the Bellman equation.

The implementation of Expected Sarsa is similar to that of Sarsa. More details can be found in [3, 36, 37].