8.3_TD_learning_of_action_values_based_on_function_approximation

8.3 TD learning of action values based on function approximation

While Section 8.2 introduced the problem of state value estimation, the present section introduces how to estimate action values. The tabular Sarsa and tabular Q-learning algorithms are extended to the case of value function approximation. Readers will see that the extension is straightforward.

8.3.1 Sarsa with function approximation

The Sarsa algorithm with function approximation can be readily obtained from (8.13) by replacing the state values with action values. In particular, suppose that qπ(s,a)q_{\pi}(s,a) is approximated by q^(s,a,w)\hat{q} (s,a,w) . Replacing v^(s,w)\hat{v} (s,w) in (8.13) by q^(s,a,w)\hat{q} (s,a,w) gives

wt+1=wt+αt[rt+1+γq^(st+1,at+1,wt)q^(st,at,wt)]wq^(st,at,wt).(8.35)w _ {t + 1} = w _ {t} + \alpha_ {t} \left[ r _ {t + 1} + \gamma \hat {q} \left(s _ {t + 1}, a _ {t + 1}, w _ {t}\right) - \hat {q} \left(s _ {t}, a _ {t}, w _ {t}\right) \right] \nabla_ {w} \hat {q} \left(s _ {t}, a _ {t}, w _ {t}\right). \tag {8.35}

The analysis of (8.35) is similar to that of (8.13) and is omitted here. When linear functions are used, we have

q^(s,a,w)=ϕT(s,a)w,\hat {q} (s, a, w) = \phi^ {T} (s, a) w,

where ϕ(s,a)\phi (s,a) is a feature vector. In this case, wq^(s,a,w)=ϕ(s,a)\nabla_w\hat{q} (s,a,w) = \phi (s,a)

The value estimation step in (8.35) can be combined with a policy improvement step to learn optimal policies. The procedure is summarized in Algorithm 8.2. It should be noted that accurately estimating the action values of a given policy requires (8.35) to be run sufficiently many times. However, (8.35) is executed only once before switching to the policy improvement step. This is similar to the tabular Sarsa algorithm. Moreover, the implementation in Algorithm 8.2 aims to solve the task of finding a good path to the target state from a prespecified starting state. As a result, it cannot find the optimal policy for every state. However, if sufficient experience data are available, the implementation process can be easily adapted to find optimal policies for every state.

An illustrative example is shown in Figure 8.9. In this example, the task is to find a good policy that can lead the agent to the target when starting from the top-left state. Both the total reward and the length of each episode gradually converge to steady values. In this example, the linear feature vector is selected as the Fourier function of order 5. The expression of a Fourier feature vector is given in (8.18).


Figure 8.9: Sarsa with linear function approximation. Here, γ=0.9\gamma = 0.9 , ϵ=0.1\epsilon = 0.1 , rboundary=rforbidden=10r_{\mathrm{boundary}} = r_{\mathrm{forbidden}} = -10 , rtarget=1r_{\mathrm{target}} = 1 , and α=0.001\alpha = 0.001 .

Algorithm 8.2: Sarsa with function approximation

Initialization: Initial parameter w0w_0 . Initial policy π0\pi_0 . αt=α>0\alpha_{t} = \alpha > 0 for all tt . ϵ(0,1)\epsilon \in (0,1) .

Goal: Learn an optimal policy that can lead the agent to the target state from an initial state s0s_0 .

For each episode, do

Generate a0a_0 at s0s_0 following π0(s0)\pi_0(s_0)

If sts_t ( t=0,1,2,t = 0, 1, 2, \ldots ) is not the target state, do

Collect the experience sample (rt+1,st+1,at+1)(r_{t + 1},s_{t + 1},a_{t + 1}) given (st,at)(s_t,a_t) : generate rt+1,st+1r_{t + 1},s_{t + 1}

by interacting with the environment; generate at+1a_{t + 1} following πt(st+1)\pi_t(s_{t + 1}) .

Update q-value:

wt+1=wt+αt[rt+1+γq^(st+1,at+1,wt)q^(st,at,wt)]wq^(st,at,wt)w _ {t + 1} = w _ {t} + \alpha_ {t} \left[ r _ {t + 1} + \gamma \hat {q} \left(s _ {t + 1}, a _ {t + 1}, w _ {t}\right) - \hat {q} \left(s _ {t}, a _ {t}, w _ {t}\right) \right] \nabla_ {w} \hat {q} \left(s _ {t}, a _ {t}, w _ {t}\right)

Update policy:

πt+1(ast)=1ϵA(st)(A(st)1)i fa=argmaxaA(st)q^(st,a,wt+1)\pi_ {t + 1} (a | s _ {t}) = 1 - \frac {\epsilon}{| \mathcal {A} (s _ {t}) |} (| \mathcal {A} (s _ {t}) | - 1) \text {i f} a = \arg \max _ {a \in \mathcal {A} (s _ {t})} \hat {q} (s _ {t}, a, w _ {t + 1})
πt+1(ast)=ϵA(st)otherwise\pi_ {t + 1} (a | s _ {t}) = \frac {\epsilon}{| \mathcal {A} (s _ {t}) |} o t h e r w i s e
stst+1,atat+1s _ {t} \leftarrow s _ {t + 1}, a _ {t} \leftarrow a _ {t + 1}

8.3.2 Q-learning with function approximation

Tabular Q-learning can also be extended to the case of function approximation. The update rule is

wt+1=wt+αt[rt+1+γmaxaA(st+1)q^(st+1,a,wt)q^(st,at,wt)]wq^(st,at,wt).(8.36)w _ {t + 1} = w _ {t} + \alpha_ {t} \left[ r _ {t + 1} + \gamma \max _ {a \in \mathcal {A} \left(s _ {t + 1}\right)} \hat {q} \left(s _ {t + 1}, a, w _ {t}\right) - \hat {q} \left(s _ {t}, a _ {t}, w _ {t}\right) \right] \nabla_ {w} \hat {q} \left(s _ {t}, a _ {t}, w _ {t}\right). \tag {8.36}

The above update rule is similar to (8.35) except that q^(st+1,at+1,wt)\hat{q}(s_{t+1}, a_{t+1}, w_t) in (8.35) is replaced with maxaA(st+1)q^(st+1,a,wt)\max_{a \in \mathcal{A}(s_{t+1})} \hat{q}(s_{t+1}, a, w_t) .

Similar to the tabular case, (8.36) can be implemented in either an on-policy or off-policy fashion. An on-policy version is given in Algorithm 8.3. An example for demonstrating the on-policy version is shown in Figure 8.10. In this example, the task is to find a good policy that can lead the agent to the target state from the top-left state.

Algorithm 8.3: Q-learning with function approximation (on-policy version)

Initialization: Initial parameter w0w_0 . Initial policy π0\pi_0 . αt=α>0\alpha_{t} = \alpha > 0 for all tt . ϵ(0,1)\epsilon \in (0,1) .

Goal: Learn an optimal path that can lead the agent to the target state from an initial state s0s_0 .

For each episode, do

If sts_t ( t=0,1,2,t = 0, 1, 2, \ldots ) is not the target state, do

Collect the experience sample (at,rt+1,st+1)(a_{t}, r_{t+1}, s_{t+1}) given sts_{t} : generate ata_{t} following πt(st)\pi_{t}(s_{t}) ; generate rt+1,st+1r_{t+1}, s_{t+1} by interacting with the environment.

Update q-value:

wt+1=wt+αt[rt+1+γmaxaA(st+1)q^(st+1,a,wt)q^(st,at,wt)]wq^(st,at,wt)w _ {t + 1} = w _ {t} + \alpha_ {t} \left[ r _ {t + 1} + \gamma \max _ {a \in \mathcal {A} (s _ {t + 1})} \hat {q} (s _ {t + 1}, a, w _ {t}) - \hat {q} (s _ {t}, a _ {t}, w _ {t}) \right] \nabla_ {w} \hat {q} (s _ {t}, a _ {t}, w _ {t})

Update policy:

πt+1(ast)=1ϵA(st)(A(st)1)ifa=argmaxaA(st)q^(st,a,wt+1)\pi_ {t + 1} (a | s _ {t}) = 1 - \frac {\epsilon}{| \mathcal {A} (s _ {t}) |} (| \mathcal {A} (s _ {t}) | - 1) \mathrm {i f} a = \arg \max _ {a \in \mathcal {A} (s _ {t})} \hat {q} (s _ {t}, a, w _ {t + 1})
πt+1(ast)=ϵA(st)otherwise\pi_ {t + 1} (a | s _ {t}) = \frac {\epsilon}{| \mathcal {A} (s _ {t}) |} o t h e r w i s e

As can be seen, Q-learning with linear function approximation can successfully learn an optimal policy. Here, linear Fourier basis functions of order five are used. The off-policy version will be demonstrated when we introduce deep Q-learning in Section 8.4.


Figure 8.10: Q-learning with linear function approximation. Here, γ=0.9\gamma = 0.9 , ϵ=0.1\epsilon = 0.1 , rboundary=rforbidden=10r_{\mathrm{boundary}} = r_{\mathrm{forbidden}} = -10 , rtarget=1r_{\mathrm{target}} = 1 , and α=0.001\alpha = 0.001 .

One may notice in Algorithm 8.2 and Algorithm 8.3 that, although the values are represented as functions, the policy π(as)\pi(a|s) is still represented as a table. Thus, it still assumes finite numbers of states and actions. In Chapter 9, we will see that the policies can be represented as functions so that continuous state and action spaces can be handled.