7.1_TD_learning_of_state_values

7.1 TD learning of state values

TD learning often refers to a broad class of reinforcement learning algorithms. For example, all the algorithms introduced in this chapter fall into the scope of TD learning. However, TD learning in this section specifically refers to a classic algorithm for estimating state values.

7.1.1 Algorithm description

Given a policy π\pi , our goal is to estimate vπ(s)v_{\pi}(s) for all sSs \in S . Suppose that we have some experience samples (s0,r1,s1,,st,rt+1,st+1,)(s_0, r_1, s_1, \ldots, s_t, r_{t+1}, s_{t+1}, \ldots) generated following π\pi . Here, tt denotes the time step. The following TD algorithm can estimate the state values using these samples:

vt+1(st)=vt(st)αt(st)[vt(st)(rt+1+γvt(st+1))],(7.1)v _ {t + 1} \left(s _ {t}\right) = v _ {t} \left(s _ {t}\right) - \alpha_ {t} \left(s _ {t}\right) \left[ v _ {t} \left(s _ {t}\right) - \left(r _ {t + 1} + \gamma v _ {t} \left(s _ {t + 1}\right)\right) \right], \tag {7.1}
vt+1(s)=vt(s),f o r a l lsst,(7.2)v _ {t + 1} (s) = v _ {t} (s), \quad \text {f o r a l l} s \neq s _ {t}, \tag {7.2}

where t=0,1,2,t = 0,1,2,\ldots . Here, vt(st)v_{t}(s_{t}) is the estimate of vπ(st)v_{\pi}(s_t) at time tt ; αt(st)\alpha_{t}(s_{t}) is the learning rate for sts_t at time tt .

It should be noted that, at time tt , only the value of the visited state sts_t is updated. The values of the unvisited states ssts \neq s_t remain unchanged as shown in (7.2). Equation (7.2) is often omitted for simplicity, but it should be kept in mind because the algorithm would be mathematically incomplete without this equation.

Readers who see the TD learning algorithm for the first time may wonder why it is designed like this. In fact, it can be viewed as a special stochastic approximation algorithm for solving the Bellman equation. To see that, first recall that the definition of the state value is

vπ(s)=E[Rt+1+γGt+1St=s],sS.(7.3)v _ {\pi} (s) = \mathbb {E} \left[ R _ {t + 1} + \gamma G _ {t + 1} \mid S _ {t} = s \right], \quad s \in \mathcal {S}. \tag {7.3}

We can rewrite (7.3) as

vπ(s)=E[Rt+1+γvπ(St+1)St=s],sS.(7.4)v _ {\pi} (s) = \mathbb {E} \left[ R _ {t + 1} + \gamma v _ {\pi} \left(S _ {t + 1}\right) \mid S _ {t} = s \right], \quad s \in \mathcal {S}. \tag {7.4}

That is because E[Gt+1St=s]=aπ(as)sp(ss,a)vπ(s)=E[vπ(St+1)St=s]\mathbb{E}[G_{t + 1}|S_t = s] = \sum_a\pi (a|s)\sum_{s'}p(s'|s,a)v_\pi (s') = \mathbb{E}[v_\pi (S_{t + 1})|S_t = s] . Equation (7.4) is another expression of the Bellman equation. It is sometimes called the Bellman expectation equation.

The TD algorithm can be derived by applying the Robbins-Monro algorithm (Chapter 6) to solve the Bellman equation in (7.4). Interested readers can check the details in Box 7.1.

Box 7.1: Derivation of the TD algorithm

We next show that the TD algorithm in (7.1) can be obtained by applying the Robbins-Monro algorithm to solve (7.4).

For state sts_t , we define a function as

g(vπ(st))vπ(st)E[Rt+1+γvπ(St+1)St=st].g (v _ {\pi} (s _ {t})) \doteq v _ {\pi} (s _ {t}) - \mathbb {E} \big [ R _ {t + 1} + \gamma v _ {\pi} (S _ {t + 1}) | S _ {t} = s _ {t} \big ].

Then, (7.4) is equivalent to

g(vπ(st))=0.g (v _ {\pi} (s _ {t})) = 0.

Our goal is to solve the above equation to obtain vπ(st)v_{\pi}(s_t) using the Robbins-Monro algorithm. Since we can obtain rt+1r_{t + 1} and st+1s_{t + 1} , which are the samples of Rt+1R_{t + 1} and St+1S_{t + 1} , the noisy observation of g(vπ(st))g(v_{\pi}(s_t)) that we can obtain is

g~(vπ(st))=vπ(st)[rt+1+γvπ(st+1)]=(vπ(st)E[Rt+1+γvπ(St+1)St=st])g(vπ(st))+(E[Rt+1+γvπ(St+1)St=st][rt+1+γvπ(st+1)])η.\begin{array}{l} \tilde {g} \left(v _ {\pi} \left(s _ {t}\right)\right) = v _ {\pi} \left(s _ {t}\right) - \left[ r _ {t + 1} + \gamma v _ {\pi} \left(s _ {t + 1}\right) \right] \\ = \underbrace {\left(v _ {\pi} (s _ {t}) - \mathbb {E} \left[ R _ {t + 1} + \gamma v _ {\pi} (S _ {t + 1}) | S _ {t} = s _ {t} \right]\right)} _ {g (v _ {\pi} (s _ {t}))} \\ + \underbrace {\left(\mathbb {E} \left[ R _ {t + 1} + \gamma v _ {\pi} (S _ {t + 1}) | S _ {t} = s _ {t} \right] - \left[ r _ {t + 1} + \gamma v _ {\pi} (s _ {t + 1}) \right]\right)} _ {\eta}. \\ \end{array}

Therefore, the Robbins-Monro algorithm (Section 6.2) for solving g(vπ(st))=0g(v_{\pi}(s_t)) = 0 is

vt+1(st)=vt(st)αt(st)g~(vt(st))=vt(st)αt(st)(vt(st)[rt+1+γvπ(st+1)]),(7.5)\begin{array}{l} v _ {t + 1} (s _ {t}) = v _ {t} (s _ {t}) - \alpha_ {t} (s _ {t}) \tilde {g} (v _ {t} (s _ {t})) \\ = v _ {t} \left(s _ {t}\right) - \alpha_ {t} \left(s _ {t}\right) \left(v _ {t} \left(s _ {t}\right) - \left[ r _ {t + 1} + \gamma v _ {\pi} \left(s _ {t + 1}\right) \right]\right), \tag {7.5} \\ \end{array}

where vt(st)v_{t}(s_{t}) is the estimate of vπ(st)v_{\pi}(s_t) at time tt , and αt(st)\alpha_{t}(s_{t}) is the learning rate.

The algorithm in (7.5) has a similar expression to that of the TD algorithm in (7.1). The only difference is that the right-hand side of (7.5) contains vπ(st+1)v_{\pi}(s_{t + 1}) , whereas (7.1) contains vt(st+1)v_{t}(s_{t + 1}) . That is because (7.5) is designed to merely estimate the state value of sts_t by assuming that the state values of the other states are already known. If we would like to estimate the state values of all the states, then vπ(st+1)v_{\pi}(s_{t + 1}) on the right-hand side should be replaced with vt(st+1)v_{t}(s_{t + 1}) . Then, (7.5) is exactly the same as (7.1). However, can such a replacement still ensure convergence? The answer is yes, and it will be proven later in Theorem 7.1.

7.1.2 Property analysis

Some important properties of the TD algorithm are discussed as follows.

First, we examine the expression of the TD algorithm more closely. In particular, (7.1) can be described as

vt+1(st)n e w e s t i m a t e=vt(st)c u r r e n t e s t i m a t eαt(st)[vt(st)(rt+1+γvt(st+1)T D t a r g e tvˉt)T D e r r o rδt],(7.6)\underbrace {v _ {t + 1} \left(s _ {t}\right)} _ {\text {n e w e s t i m a t e}} = \underbrace {v _ {t} \left(s _ {t}\right)} _ {\text {c u r r e n t e s t i m a t e}} - \alpha_ {t} \left(s _ {t}\right) \left[ \overbrace {v _ {t} \left(s _ {t}\right) - \left(\underbrace {r _ {t + 1} + \gamma v _ {t} \left(s _ {t + 1}\right)} _ {\text {T D t a r g e t} \bar {v} _ {t}}\right)} ^ {\text {T D e r r o r} \delta_ {t}} \right], \tag {7.6}

where

vˉtrt+1+γvt(st+1)\bar {v} _ {t} \doteq r _ {t + 1} + \gamma v _ {t} (s _ {t + 1})

is called the TDTD target and

δtv(st)vˉt=vt(st)(rt+1+γvt(st+1))\delta_ {t} \doteq v (s _ {t}) - \bar {v} _ {t} = v _ {t} (s _ {t}) - \left(r _ {t + 1} + \gamma v _ {t} (s _ {t + 1})\right)

is called the TDTD error. It can be seen that the new estimate vt+1(st)v_{t + 1}(s_t) is a combination of the current estimate vt(st)v_{t}(s_{t}) and the TD error δt\delta_t .

\diamond Why is vˉt\bar{v}_t called the TD target?

This is because vˉt\bar{v}_t is the target value that the algorithm attempts to drive v(st)v(s_t) to. To see that, subtracting vˉt\bar{v}_t from both sides of (7.6) gives

vt+1(st)vˉt=[vt(st)vˉt]αt(st)[vt(st)vˉt]=[1αt(st)][vt(st)vˉt].\begin{array}{l} v _ {t + 1} (s _ {t}) - \bar {v} _ {t} = [ v _ {t} (s _ {t}) - \bar {v} _ {t} ] - \alpha_ {t} (s _ {t}) [ v _ {t} (s _ {t}) - \bar {v} _ {t} ] \\ = \left[ 1 - \alpha_ {t} \left(s _ {t}\right) \right] \left[ v _ {t} \left(s _ {t}\right) - \bar {v} _ {t} \right]. \\ \end{array}

Taking the absolute values of both sides of the above equation gives

vt+1(st)vˉt=1αt(st)vt(st)vˉt.| v _ {t + 1} (s _ {t}) - \bar {v} _ {t} | = | 1 - \alpha_ {t} (s _ {t}) | | v _ {t} (s _ {t}) - \bar {v} _ {t} |.

Since αt(st)\alpha_{t}(s_{t}) is a small positive number, we have 0<1αt(st)<10 < 1 - \alpha_{t}(s_{t}) < 1 . It then follows that

vt+1(st)vˉt<vt(st)vˉt.\left| v _ {t + 1} (s _ {t}) - \bar {v} _ {t} \right| < \left| v _ {t} (s _ {t}) - \bar {v} _ {t} \right|.

The above inequality is important because it indicates that the new value vt+1(st)v_{t+1}(s_t) is closer to vˉt\bar{v}_t than the old value vt(st)v_t(s_t) . Therefore, this algorithm mathematically drives vt(st)v_t(s_t) toward vˉt\bar{v}_t . This is why vˉt\bar{v}_t is called the TD target.

What is the interpretation of the TD error?

First, this error is called temporal-difference because δt=vt(st)(rt+1+γvt(st+1))\delta_t = v_t(s_t) - (r_{t+1} + \gamma v_t(s_{t+1})) reflects the discrepancy between two time steps tt and t+1t+1 . Second, the TD error is zero in the expectation sense when the state value estimate is accurate. To see that, when vt=vπv_t = v_\pi , the expected value of the TD error is

E[δtSt=st]=E[vπ(St)(Rt+1+γvπ(St+1))St=st]=vπ(st)E[Rt+1+γvπ(St+1)St=st]=0.(d u e t o(7.3))\begin{array}{l} \mathbb {E} [ \delta_ {t} | S _ {t} = s _ {t} ] = \mathbb {E} \big [ v _ {\pi} (S _ {t}) - (R _ {t + 1} + \gamma v _ {\pi} (S _ {t + 1})) | S _ {t} = s _ {t} \big ] \\ = v _ {\pi} (s _ {t}) - \mathbb {E} \left[ R _ {t + 1} + \gamma v _ {\pi} (S _ {t + 1}) | S _ {t} = s _ {t} \right] \\ = 0. \quad (\text {d u e t o} (7. 3)) \\ \end{array}

Therefore, the TD error reflects not only the discrepancy between two time steps but also, more importantly, the discrepancy between the estimate vtv_{t} and the true state value vπv_{\pi} .

On a more abstract level, the TD error can be interpreted as the innovation, which indicates new information obtained from the experience sample (st,rt+1,st+1)(s_t,r_{t + 1},s_{t + 1}) . The fundamental idea of TD learning is to correct our current estimate of the state value based on the newly obtained information. Innovation is fundamental in many estimation problems such as Kalman filtering [33,34].

Second, the TD algorithm in (7.1) can only estimate the state values of a given policy. To find optimal policies, we still need to further calculate the action values and then conduct policy improvement. This will be introduced in Section 7.2. Nevertheless, the TD algorithm introduced in this section is very basic and important for understanding the other algorithms in this chapter.

Third, while both TD learning and MC learning are model-free, what are their advantages and disadvantages? The answers are summarized in Table 7.1.

Table 7.1: A comparison between TD learning and MC learning.

7.1.3 Convergence analysis

The convergence analysis of the TD algorithm in (7.1) is given below.

Theorem 7.1 (Convergence of TD learning). Given a policy π\pi , by the TD algorithm in (7.1), vt(s)v_{t}(s) converges almost surely to vπ(s)v_{\pi}(s) as tt \to \infty for all sSs \in S if tαt(s)=\sum_{t} \alpha_{t}(s) = \infty and tαt2(s)<\sum_{t} \alpha_{t}^{2}(s) < \infty for all sSs \in S .

Some remarks about αt\alpha_{t} are given below. First, the condition of tαt(s)=\sum_{t} \alpha_{t}(s) = \infty and tαt2(s)<\sum_{t} \alpha_{t}^{2}(s) < \infty must be valid for all sSs \in S . Note that, at time tt , αt(s)>0\alpha_{t}(s) > 0 if ss is being visited and αt(s)=0\alpha_{t}(s) = 0 otherwise. The condition tαt(s)=\sum_{t} \alpha_{t}(s) = \infty requires the state ss to be visited an infinite (or sufficiently many) number of times. This requires either the condition of exploring starts or an exploratory policy so that every state-action pair can possibly be visited many times. Second, the learning rate αt\alpha_{t} is often selected as a small

positive constant in practice. In this case, the condition that tαt2(s)<\sum_{t}\alpha_{t}^{2}(s) < \infty is no longer valid. When α\alpha is constant, it can still be shown that the algorithm converges in the sense of expectation [24, Section 1.5].

Box 7.2: Proof of Theorem 7.1

We prove the convergence based on Theorem 6.3 in Chapter 6. To do that, we need first to construct a stochastic process as that in Theorem 6.3. Consider an arbitrary state sSs \in S . At time tt , it follows from the TD algorithm in (7.1) that

vt+1(s)=vt(s)αt(s)(vt(s)(rt+1+γvt(st+1))),ifs=st,(7.7)v _ {t + 1} (s) = v _ {t} (s) - \alpha_ {t} (s) \Big (v _ {t} (s) - (r _ {t + 1} + \gamma v _ {t} (s _ {t + 1})) \Big), \quad \mathrm {i f} s = s _ {t}, \qquad (7. 7)

or

vt+1(s)=vt(s),i fsst.(7.8)v _ {t + 1} (s) = v _ {t} (s), \quad \text {i f} s \neq s _ {t}. \tag {7.8}

The estimation error is defined as

Δt(s)vt(s)vπ(s),\Delta_ {t} (s) \doteq v _ {t} (s) - v _ {\pi} (s),

where vπ(s)v_{\pi}(s) is the state value of ss under policy π\pi . Deducing vπ(s)v_{\pi}(s) from both sides of (7.7) gives

Δt+1(s)=(1αt(s))Δt(s)+αt(s)(rt+1+γvt(st+1)vπ(s)ηt(s))=(1αt(s))Δt(s)+αt(s)ηt(s),s=st.(7.9)\begin{array}{l} \Delta_ {t + 1} (s) = (1 - \alpha_ {t} (s)) \Delta_ {t} (s) + \alpha_ {t} (s) (\underbrace {r _ {t + 1} + \gamma v _ {t} (s _ {t + 1}) - v _ {\pi} (s)} _ {\eta_ {t} (s)}) \\ = (1 - \alpha_ {t} (s)) \Delta_ {t} (s) + \alpha_ {t} (s) \eta_ {t} (s), \quad s = s _ {t}. \tag {7.9} \\ \end{array}

Deducting vπ(s)v_{\pi}(s) from both sides of (7.8) gives

Δt+1(s)=Δt(s)=(1αt(s))Δt(s)+αt(s)ηt(s),sst,\Delta_ {t + 1} (s) = \Delta_ {t} (s) = (1 - \alpha_ {t} (s)) \Delta_ {t} (s) + \alpha_ {t} (s) \eta_ {t} (s), \qquad s \neq s _ {t},

whose expression is the same as that of (7.9) except that αt(s)=0\alpha_{t}(s) = 0 and ηt(s)=0\eta_t(s) = 0 . Therefore, regardless of whether s=sts = s_t , we obtain the following unified expression:

Δt+1(s)=(1αt(s))Δt(s)+αt(s)ηt(s).\Delta_ {t + 1} (s) = (1 - \alpha_ {t} (s)) \Delta_ {t} (s) + \alpha_ {t} (s) \eta_ {t} (s).

This is the process in Theorem 6.3. Our goal is to show that the three conditions in Theorem 6.3 are satisfied and hence the process converges.

The first condition is valid as assumed in Theorem 7.1. We next show that the second condition is valid. That is, E[ηt(s)Ht]γΔt(s)\| \mathbb{E}[\eta_t(s)|\mathcal{H}_t]\|_\infty \leq \gamma \| \Delta_t(s)\|_\infty for all sSs \in S . Here, Ht\mathcal{H}_t represents the historical information (see the definition in Theorem 6.3). Due to the Markovian property, ηt(s)=rt+1+γvt(st+1)vπ(s)\eta_t(s) = r_{t + 1} + \gamma v_t(s_{t + 1}) - v_\pi (s) or ηt(s)=0\eta_t(s) = 0 does not depend

on the historical information once ss is given. As a result, we have E[ηt(s)Ht]=E[ηt(s)]\mathbb{E}[\eta_t(s)|\mathcal{H}_t] = \mathbb{E}[\eta_t(s)] . For ssts \neq s_t , we have ηt(s)=0\eta_t(s) = 0 . Then, it is trivial to see that

E[ηt(s)]=0γΔt(s).(7.10)\left| \mathbb {E} \left[ \eta_ {t} (s) \right] \right| = 0 \leq \gamma \| \Delta_ {t} (s) \| _ {\infty}. \tag {7.10}

For s=sts = s_t , we have

E[ηt(s)]=E[ηt(st)]=E[rt+1+γvt(st+1)vπ(st)st]=E[rt+1+γvt(st+1)st]vπ(st).\begin{array}{l} \mathbb {E} \left[ \eta_ {t} (s) \right] = \mathbb {E} \left[ \eta_ {t} \left(s _ {t}\right) \right] \\ = \mathbb {E} \left[ r _ {t + 1} + \gamma v _ {t} \left(s _ {t + 1}\right) - v _ {\pi} \left(s _ {t}\right) \mid s _ {t} \right] \\ = \mathbb {E} \left[ r _ {t + 1} + \gamma v _ {t} \left(s _ {t + 1}\right) \mid s _ {t} \right] - v _ {\pi} \left(s _ {t}\right). \\ \end{array}

Since vπ(st)=E[rt+1+γvπ(st+1)st]v_{\pi}(s_t) = \mathbb{E}[r_{t + 1} + \gamma v_{\pi}(s_{t + 1})|s_t] , the above equation implies that

E[ηt(s)]=γE[vt(st+1)vπ(st+1)st]=γsSp(sst)[vt(s)vπ(s)].\begin{array}{l} \mathbb {E} \left[ \eta_ {t} (s) \right] = \gamma \mathbb {E} \left[ v _ {t} \left(s _ {t + 1}\right) - v _ {\pi} \left(s _ {t + 1}\right) \mid s _ {t} \right] \\ = \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p \left(s ^ {\prime} \mid s _ {t}\right) \left[ v _ {t} \left(s ^ {\prime}\right) - v _ {\pi} \left(s ^ {\prime}\right) \right]. \\ \end{array}

It follows that

E[ηt(s)]=γsSp(sst)[vt(s)vπ(s)]γsSp(sst)maxsSvt(s)vπ(s)=γmaxsSvt(s)vπ(s)=γvt(s)vπ(s)=γΔt(s).(7.11)\begin{array}{l} | \mathbb {E} [ \eta_ {t} (s) ] | = \gamma \left| \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s _ {t}) [ v _ {t} (s ^ {\prime}) - v _ {\pi} (s ^ {\prime}) ] \right| \\ \leq \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s _ {t}) \max _ {s ^ {\prime} \in \mathcal {S}} | v _ {t} (s ^ {\prime}) - v _ {\pi} (s ^ {\prime}) | \\ = \gamma \max _ {s ^ {\prime} \in \mathcal {S}} \left| v _ {t} \left(s ^ {\prime}\right) - v _ {\pi} \left(s ^ {\prime}\right) \right| \\ = \gamma \| v _ {t} \left(s ^ {\prime}\right) - v _ {\pi} \left(s ^ {\prime}\right) \| _ {\infty} \\ = \gamma \| \Delta_ {t} (s) \| _ {\infty}. \tag {7.11} \\ \end{array}

Therefore, at time tt , we know from (7.10) and (7.11) that E[ηt(s)]γΔt(s)|\mathbb{E}[\eta_t(s)]| \leq \gamma \| \Delta_t(s)\|_\infty for all sSs \in S regardless of whether s=sts = s_t . Thus,

E[ηt(s)]γΔt(s),\| \mathbb {E} [ \eta_ {t} (s) ] \| _ {\infty} \leq \gamma \| \Delta_ {t} (s) \| _ {\infty},

which is the second condition in Theorem 6.3. Finally, regarding the third condition, we have var[ηt(s)Ht]=var[rt+1+γvt(st+1)vπ(st)st]=var[rt+1+γvt(st+1)st]\operatorname{var}[\eta_t(s)|\mathcal{H}_t] = \operatorname{var}[r_{t + 1} + \gamma v_t(s_{t + 1}) - v_\pi (s_t)|s_t] = \operatorname{var}[r_{t + 1} + \gamma v_t(s_{t + 1})|s_t] for s=sts = s_t and var[ηt(s)Ht]=0\operatorname{var}[\eta_t(s)|\mathcal{H}_t] = 0 for ssts\neq s_t . Since rt+1r_{t + 1} is bounded, the third condition can be proven without difficulty.

The above proof is inspired by [32].