8.3 TD learning of action values based on function approximation

While Section 8.2 introduced the problem of state value estimation, the present section introduces how to estimate action values. The tabular Sarsa and tabular Q-learning algorithms are extended to the case of value function approximation. Readers will see that the extension is straightforward.

8.3.1 Sarsa with function approximation

The Sarsa algorithm with function approximation can be readily obtained from (8.13) by replacing the state values with action values. In particular, suppose that $q_{\pi}(s,a)$ is approximated by $\hat{q} (s,a,w)$ . Replacing $\hat{v} (s,w)$ in (8.13) by $\hat{q} (s,a,w)$ gives

w _ {t + 1} = w _ {t} + \alpha_ {t} \left[ r _ {t + 1} + \gamma \hat {q} \left(s _ {t + 1}, a _ {t + 1}, w _ {t}\right) - \hat {q} \left(s _ {t}, a _ {t}, w _ {t}\right) \right] \nabla_ {w} \hat {q} \left(s _ {t}, a _ {t}, w _ {t}\right). \tag {8.35}

The analysis of (8.35) is similar to that of (8.13) and is omitted here. When linear functions are used, we have

\hat {q} (s, a, w) = \phi^ {T} (s, a) w,

where $\phi (s,a)$ is a feature vector. In this case, $\nabla_w\hat{q} (s,a,w) = \phi (s,a)$

The value estimation step in (8.35) can be combined with a policy improvement step to learn optimal policies. The procedure is summarized in Algorithm 8.2. It should be noted that accurately estimating the action values of a given policy requires (8.35) to be run sufficiently many times. However, (8.35) is executed only once before switching to the policy improvement step. This is similar to the tabular Sarsa algorithm. Moreover, the implementation in Algorithm 8.2 aims to solve the task of finding a good path to the target state from a prespecified starting state. As a result, it cannot find the optimal policy for every state. However, if sufficient experience data are available, the implementation process can be easily adapted to find optimal policies for every state.

An illustrative example is shown in Figure 8.9. In this example, the task is to find a good policy that can lead the agent to the target when starting from the top-left state. Both the total reward and the length of each episode gradually converge to steady values. In this example, the linear feature vector is selected as the Fourier function of order 5. The expression of a Fourier feature vector is given in (8.18).

Figure 8.9: Sarsa with linear function approximation. Here, $\gamma = 0.9$ , $\epsilon = 0.1$ , $r_{\mathrm{boundary}} = r_{\mathrm{forbidden}} = -10$ , $r_{\mathrm{target}} = 1$ , and $\alpha = 0.001$ .

Algorithm 8.2: Sarsa with function approximation

Initialization: Initial parameter $w_0$ . Initial policy $\pi_0$ . $\alpha_{t} = \alpha > 0$ for all $t$ . $\epsilon \in (0,1)$ .

Goal: Learn an optimal policy that can lead the agent to the target state from an initial state $s_0$ .

For each episode, do

Generate $a_0$ at $s_0$ following $\pi_0(s_0)$

If $s_t$ ( $t = 0, 1, 2, \ldots$ ) is not the target state, do

Collect the experience sample $(r_{t + 1},s_{t + 1},a_{t + 1})$ given $(s_t,a_t)$ : generate $r_{t + 1},s_{t + 1}$

by interacting with the environment; generate $a_{t + 1}$ following $\pi_t(s_{t + 1})$ .

Update q-value:

w _ {t + 1} = w _ {t} + \alpha_ {t} \left[ r _ {t + 1} + \gamma \hat {q} \left(s _ {t + 1}, a _ {t + 1}, w _ {t}\right) - \hat {q} \left(s _ {t}, a _ {t}, w _ {t}\right) \right] \nabla_ {w} \hat {q} \left(s _ {t}, a _ {t}, w _ {t}\right)

Update policy:

\pi_ {t + 1} (a | s _ {t}) = 1 - \frac {\epsilon}{| \mathcal {A} (s _ {t}) |} (| \mathcal {A} (s _ {t}) | - 1) \text {i f} a = \arg \max _ {a \in \mathcal {A} (s _ {t})} \hat {q} (s _ {t}, a, w _ {t + 1})

\pi_ {t + 1} (a | s _ {t}) = \frac {\epsilon}{| \mathcal {A} (s _ {t}) |} o t h e r w i s e

s _ {t} \leftarrow s _ {t + 1}, a _ {t} \leftarrow a _ {t + 1}

8.3.2 Q-learning with function approximation

Tabular Q-learning can also be extended to the case of function approximation. The update rule is

w _ {t + 1} = w _ {t} + \alpha_ {t} \left[ r _ {t + 1} + \gamma \max _ {a \in \mathcal {A} \left(s _ {t + 1}\right)} \hat {q} \left(s _ {t + 1}, a, w _ {t}\right) - \hat {q} \left(s _ {t}, a _ {t}, w _ {t}\right) \right] \nabla_ {w} \hat {q} \left(s _ {t}, a _ {t}, w _ {t}\right). \tag {8.36}

The above update rule is similar to (8.35) except that $\hat{q}(s_{t+1}, a_{t+1}, w_t)$ in (8.35) is replaced with $\max_{a \in \mathcal{A}(s_{t+1})} \hat{q}(s_{t+1}, a, w_t)$ .

Similar to the tabular case, (8.36) can be implemented in either an on-policy or off-policy fashion. An on-policy version is given in Algorithm 8.3. An example for demonstrating the on-policy version is shown in Figure 8.10. In this example, the task is to find a good policy that can lead the agent to the target state from the top-left state.

Algorithm 8.3: Q-learning with function approximation (on-policy version)

Initialization: Initial parameter $w_0$ . Initial policy $\pi_0$ . $\alpha_{t} = \alpha > 0$ for all $t$ . $\epsilon \in (0,1)$ .

Goal: Learn an optimal path that can lead the agent to the target state from an initial state $s_0$ .

For each episode, do

If $s_t$ ( $t = 0, 1, 2, \ldots$ ) is not the target state, do

Collect the experience sample $(a_{t}, r_{t+1}, s_{t+1})$ given $s_{t}$ : generate $a_{t}$ following $\pi_{t}(s_{t})$ ; generate $r_{t+1}, s_{t+1}$ by interacting with the environment.

Update q-value:

w _ {t + 1} = w _ {t} + \alpha_ {t} \left[ r _ {t + 1} + \gamma \max _ {a \in \mathcal {A} (s _ {t + 1})} \hat {q} (s _ {t + 1}, a, w _ {t}) - \hat {q} (s _ {t}, a _ {t}, w _ {t}) \right] \nabla_ {w} \hat {q} (s _ {t}, a _ {t}, w _ {t})

Update policy:

\pi_ {t + 1} (a | s _ {t}) = 1 - \frac {\epsilon}{| \mathcal {A} (s _ {t}) |} (| \mathcal {A} (s _ {t}) | - 1) \mathrm {i f} a = \arg \max _ {a \in \mathcal {A} (s _ {t})} \hat {q} (s _ {t}, a, w _ {t + 1})

\pi_ {t + 1} (a | s _ {t}) = \frac {\epsilon}{| \mathcal {A} (s _ {t}) |} o t h e r w i s e

As can be seen, Q-learning with linear function approximation can successfully learn an optimal policy. Here, linear Fourier basis functions of order five are used. The off-policy version will be demonstrated when we introduce deep Q-learning in Section 8.4.

Figure 8.10: Q-learning with linear function approximation. Here, $\gamma = 0.9$ , $\epsilon = 0.1$ , $r_{\mathrm{boundary}} = r_{\mathrm{forbidden}} = -10$ , $r_{\mathrm{target}} = 1$ , and $\alpha = 0.001$ .

One may notice in Algorithm 8.2 and Algorithm 8.3 that, although the values are represented as functions, the policy $\pi(a|s)$ is still represented as a table. Thus, it still assumes finite numbers of states and actions. In Chapter 9, we will see that the policies can be represented as functions so that continuous state and action spaces can be handled.

8.3_TD_learning_of_action_values_based_on_function_approximation

8.3 TD learning of action values based on function approximation

8.3.1 Sarsa with function approximation

Algorithm 8.2: Sarsa with function approximation

8.3.2 Q-learning with function approximation