7.1 TD learning of state values

TD learning often refers to a broad class of reinforcement learning algorithms. For example, all the algorithms introduced in this chapter fall into the scope of TD learning. However, TD learning in this section specifically refers to a classic algorithm for estimating state values.

7.1.1 Algorithm description

Given a policy $\pi$ , our goal is to estimate $v_{\pi}(s)$ for all $s \in S$ . Suppose that we have some experience samples $(s_0, r_1, s_1, \ldots, s_t, r_{t+1}, s_{t+1}, \ldots)$ generated following $\pi$ . Here, $t$ denotes the time step. The following TD algorithm can estimate the state values using these samples:

v _ {t + 1} \left(s _ {t}\right) = v _ {t} \left(s _ {t}\right) - \alpha_ {t} \left(s _ {t}\right) \left[ v _ {t} \left(s _ {t}\right) - \left(r _ {t + 1} + \gamma v _ {t} \left(s _ {t + 1}\right)\right) \right], \tag {7.1}

v _ {t + 1} (s) = v _ {t} (s), \quad \text {f o r a l l} s \neq s _ {t}, \tag {7.2}

where $t = 0,1,2,\ldots$ . Here, $v_{t}(s_{t})$ is the estimate of $v_{\pi}(s_t)$ at time $t$ ; $\alpha_{t}(s_{t})$ is the learning rate for $s_t$ at time $t$ .

It should be noted that, at time $t$ , only the value of the visited state $s_t$ is updated. The values of the unvisited states $s \neq s_t$ remain unchanged as shown in (7.2). Equation (7.2) is often omitted for simplicity, but it should be kept in mind because the algorithm would be mathematically incomplete without this equation.

Readers who see the TD learning algorithm for the first time may wonder why it is designed like this. In fact, it can be viewed as a special stochastic approximation algorithm for solving the Bellman equation. To see that, first recall that the definition of the state value is

v _ {\pi} (s) = \mathbb {E} \left[ R _ {t + 1} + \gamma G _ {t + 1} \mid S _ {t} = s \right], \quad s \in \mathcal {S}. \tag {7.3}

We can rewrite (7.3) as

v _ {\pi} (s) = \mathbb {E} \left[ R _ {t + 1} + \gamma v _ {\pi} \left(S _ {t + 1}\right) \mid S _ {t} = s \right], \quad s \in \mathcal {S}. \tag {7.4}

That is because $\mathbb{E}[G_{t + 1}|S_t = s] = \sum_a\pi (a|s)\sum_{s'}p(s'|s,a)v_\pi (s') = \mathbb{E}[v_\pi (S_{t + 1})|S_t = s]$ . Equation (7.4) is another expression of the Bellman equation. It is sometimes called the Bellman expectation equation.

The TD algorithm can be derived by applying the Robbins-Monro algorithm (Chapter 6) to solve the Bellman equation in (7.4). Interested readers can check the details in Box 7.1.

Box 7.1: Derivation of the TD algorithm

We next show that the TD algorithm in (7.1) can be obtained by applying the Robbins-Monro algorithm to solve (7.4).

For state $s_t$ , we define a function as

g (v _ {\pi} (s _ {t})) \doteq v _ {\pi} (s _ {t}) - \mathbb {E} \big [ R _ {t + 1} + \gamma v _ {\pi} (S _ {t + 1}) | S _ {t} = s _ {t} \big ].

Then, (7.4) is equivalent to

g (v _ {\pi} (s _ {t})) = 0.

Our goal is to solve the above equation to obtain $v_{\pi}(s_t)$ using the Robbins-Monro algorithm. Since we can obtain $r_{t + 1}$ and $s_{t + 1}$ , which are the samples of $R_{t + 1}$ and $S_{t + 1}$ , the noisy observation of $g(v_{\pi}(s_t))$ that we can obtain is

\begin{array}{l} \tilde {g} \left(v _ {\pi} \left(s _ {t}\right)\right) = v _ {\pi} \left(s _ {t}\right) - \left[ r _ {t + 1} + \gamma v _ {\pi} \left(s _ {t + 1}\right) \right] \\ = \underbrace {\left(v _ {\pi} (s _ {t}) - \mathbb {E} \left[ R _ {t + 1} + \gamma v _ {\pi} (S _ {t + 1}) | S _ {t} = s _ {t} \right]\right)} _ {g (v _ {\pi} (s _ {t}))} \\ + \underbrace {\left(\mathbb {E} \left[ R _ {t + 1} + \gamma v _ {\pi} (S _ {t + 1}) | S _ {t} = s _ {t} \right] - \left[ r _ {t + 1} + \gamma v _ {\pi} (s _ {t + 1}) \right]\right)} _ {\eta}. \\ \end{array}

Therefore, the Robbins-Monro algorithm (Section 6.2) for solving $g(v_{\pi}(s_t)) = 0$ is

\begin{array}{l} v _ {t + 1} (s _ {t}) = v _ {t} (s _ {t}) - \alpha_ {t} (s _ {t}) \tilde {g} (v _ {t} (s _ {t})) \\ = v _ {t} \left(s _ {t}\right) - \alpha_ {t} \left(s _ {t}\right) \left(v _ {t} \left(s _ {t}\right) - \left[ r _ {t + 1} + \gamma v _ {\pi} \left(s _ {t + 1}\right) \right]\right), \tag {7.5} \\ \end{array}

where $v_{t}(s_{t})$ is the estimate of $v_{\pi}(s_t)$ at time $t$ , and $\alpha_{t}(s_{t})$ is the learning rate.

The algorithm in (7.5) has a similar expression to that of the TD algorithm in (7.1). The only difference is that the right-hand side of (7.5) contains $v_{\pi}(s_{t + 1})$ , whereas (7.1) contains $v_{t}(s_{t + 1})$ . That is because (7.5) is designed to merely estimate the state value of $s_t$ by assuming that the state values of the other states are already known. If we would like to estimate the state values of all the states, then $v_{\pi}(s_{t + 1})$ on the right-hand side should be replaced with $v_{t}(s_{t + 1})$ . Then, (7.5) is exactly the same as (7.1). However, can such a replacement still ensure convergence? The answer is yes, and it will be proven later in Theorem 7.1.

7.1.2 Property analysis

Some important properties of the TD algorithm are discussed as follows.

First, we examine the expression of the TD algorithm more closely. In particular, (7.1) can be described as

\underbrace {v _ {t + 1} \left(s _ {t}\right)} _ {\text {n e w e s t i m a t e}} = \underbrace {v _ {t} \left(s _ {t}\right)} _ {\text {c u r r e n t e s t i m a t e}} - \alpha_ {t} \left(s _ {t}\right) \left[ \overbrace {v _ {t} \left(s _ {t}\right) - \left(\underbrace {r _ {t + 1} + \gamma v _ {t} \left(s _ {t + 1}\right)} _ {\text {T D t a r g e t} \bar {v} _ {t}}\right)} ^ {\text {T D e r r o r} \delta_ {t}} \right], \tag {7.6}

where

\bar {v} _ {t} \doteq r _ {t + 1} + \gamma v _ {t} (s _ {t + 1})

is called the $TD$ target and

\delta_ {t} \doteq v (s _ {t}) - \bar {v} _ {t} = v _ {t} (s _ {t}) - \left(r _ {t + 1} + \gamma v _ {t} (s _ {t + 1})\right)

is called the $TD$ error. It can be seen that the new estimate $v_{t + 1}(s_t)$ is a combination of the current estimate $v_{t}(s_{t})$ and the TD error $\delta_t$ .

$\diamond$ Why is $\bar{v}_t$ called the TD target?

This is because $\bar{v}_t$ is the target value that the algorithm attempts to drive $v(s_t)$ to. To see that, subtracting $\bar{v}_t$ from both sides of (7.6) gives

\begin{array}{l} v _ {t + 1} (s _ {t}) - \bar {v} _ {t} = [ v _ {t} (s _ {t}) - \bar {v} _ {t} ] - \alpha_ {t} (s _ {t}) [ v _ {t} (s _ {t}) - \bar {v} _ {t} ] \\ = \left[ 1 - \alpha_ {t} \left(s _ {t}\right) \right] \left[ v _ {t} \left(s _ {t}\right) - \bar {v} _ {t} \right]. \\ \end{array}

Taking the absolute values of both sides of the above equation gives

| v _ {t + 1} (s _ {t}) - \bar {v} _ {t} | = | 1 - \alpha_ {t} (s _ {t}) | | v _ {t} (s _ {t}) - \bar {v} _ {t} |.

Since $\alpha_{t}(s_{t})$ is a small positive number, we have $0 < 1 - \alpha_{t}(s_{t}) < 1$ . It then follows that

\left| v _ {t + 1} (s _ {t}) - \bar {v} _ {t} \right| < \left| v _ {t} (s _ {t}) - \bar {v} _ {t} \right|.

The above inequality is important because it indicates that the new value $v_{t+1}(s_t)$ is closer to $\bar{v}_t$ than the old value $v_t(s_t)$ . Therefore, this algorithm mathematically drives $v_t(s_t)$ toward $\bar{v}_t$ . This is why $\bar{v}_t$ is called the TD target.

What is the interpretation of the TD error?

First, this error is called temporal-difference because $\delta_t = v_t(s_t) - (r_{t+1} + \gamma v_t(s_{t+1}))$ reflects the discrepancy between two time steps $t$ and $t+1$ . Second, the TD error is zero in the expectation sense when the state value estimate is accurate. To see that, when $v_t = v_\pi$ , the expected value of the TD error is

\begin{array}{l} \mathbb {E} [ \delta_ {t} | S _ {t} = s _ {t} ] = \mathbb {E} \big [ v _ {\pi} (S _ {t}) - (R _ {t + 1} + \gamma v _ {\pi} (S _ {t + 1})) | S _ {t} = s _ {t} \big ] \\ = v _ {\pi} (s _ {t}) - \mathbb {E} \left[ R _ {t + 1} + \gamma v _ {\pi} (S _ {t + 1}) | S _ {t} = s _ {t} \right] \\ = 0. \quad (\text {d u e t o} (7. 3)) \\ \end{array}

Therefore, the TD error reflects not only the discrepancy between two time steps but also, more importantly, the discrepancy between the estimate $v_{t}$ and the true state value $v_{\pi}$ .

On a more abstract level, the TD error can be interpreted as the innovation, which indicates new information obtained from the experience sample $(s_t,r_{t + 1},s_{t + 1})$ . The fundamental idea of TD learning is to correct our current estimate of the state value based on the newly obtained information. Innovation is fundamental in many estimation problems such as Kalman filtering [33,34].

Second, the TD algorithm in (7.1) can only estimate the state values of a given policy. To find optimal policies, we still need to further calculate the action values and then conduct policy improvement. This will be introduced in Section 7.2. Nevertheless, the TD algorithm introduced in this section is very basic and important for understanding the other algorithms in this chapter.

Third, while both TD learning and MC learning are model-free, what are their advantages and disadvantages? The answers are summarized in Table 7.1.

Table 7.1: A comparison between TD learning and MC learning.

7.1.3 Convergence analysis

The convergence analysis of the TD algorithm in (7.1) is given below.

Theorem 7.1 (Convergence of TD learning). Given a policy $\pi$ , by the TD algorithm in (7.1), $v_{t}(s)$ converges almost surely to $v_{\pi}(s)$ as $t \to \infty$ for all $s \in S$ if $\sum_{t} \alpha_{t}(s) = \infty$ and $\sum_{t} \alpha_{t}^{2}(s) < \infty$ for all $s \in S$ .

Some remarks about $\alpha_{t}$ are given below. First, the condition of $\sum_{t} \alpha_{t}(s) = \infty$ and $\sum_{t} \alpha_{t}^{2}(s) < \infty$ must be valid for all $s \in S$ . Note that, at time $t$ , $\alpha_{t}(s) > 0$ if $s$ is being visited and $\alpha_{t}(s) = 0$ otherwise. The condition $\sum_{t} \alpha_{t}(s) = \infty$ requires the state $s$ to be visited an infinite (or sufficiently many) number of times. This requires either the condition of exploring starts or an exploratory policy so that every state-action pair can possibly be visited many times. Second, the learning rate $\alpha_{t}$ is often selected as a small

positive constant in practice. In this case, the condition that $\sum_{t}\alpha_{t}^{2}(s) < \infty$ is no longer valid. When $\alpha$ is constant, it can still be shown that the algorithm converges in the sense of expectation [24, Section 1.5].

Box 7.2: Proof of Theorem 7.1

We prove the convergence based on Theorem 6.3 in Chapter 6. To do that, we need first to construct a stochastic process as that in Theorem 6.3. Consider an arbitrary state $s \in S$ . At time $t$ , it follows from the TD algorithm in (7.1) that

v _ {t + 1} (s) = v _ {t} (s) - \alpha_ {t} (s) \Big (v _ {t} (s) - (r _ {t + 1} + \gamma v _ {t} (s _ {t + 1})) \Big), \quad \mathrm {i f} s = s _ {t}, \qquad (7. 7)

v _ {t + 1} (s) = v _ {t} (s), \quad \text {i f} s \neq s _ {t}. \tag {7.8}

The estimation error is defined as

\Delta_ {t} (s) \doteq v _ {t} (s) - v _ {\pi} (s),

where $v_{\pi}(s)$ is the state value of $s$ under policy $\pi$ . Deducing $v_{\pi}(s)$ from both sides of (7.7) gives

\begin{array}{l} \Delta_ {t + 1} (s) = (1 - \alpha_ {t} (s)) \Delta_ {t} (s) + \alpha_ {t} (s) (\underbrace {r _ {t + 1} + \gamma v _ {t} (s _ {t + 1}) - v _ {\pi} (s)} _ {\eta_ {t} (s)}) \\ = (1 - \alpha_ {t} (s)) \Delta_ {t} (s) + \alpha_ {t} (s) \eta_ {t} (s), \quad s = s _ {t}. \tag {7.9} \\ \end{array}

Deducting $v_{\pi}(s)$ from both sides of (7.8) gives

\Delta_ {t + 1} (s) = \Delta_ {t} (s) = (1 - \alpha_ {t} (s)) \Delta_ {t} (s) + \alpha_ {t} (s) \eta_ {t} (s), \qquad s \neq s _ {t},

whose expression is the same as that of (7.9) except that $\alpha_{t}(s) = 0$ and $\eta_t(s) = 0$ . Therefore, regardless of whether $s = s_t$ , we obtain the following unified expression:

\Delta_ {t + 1} (s) = (1 - \alpha_ {t} (s)) \Delta_ {t} (s) + \alpha_ {t} (s) \eta_ {t} (s).

This is the process in Theorem 6.3. Our goal is to show that the three conditions in Theorem 6.3 are satisfied and hence the process converges.

The first condition is valid as assumed in Theorem 7.1. We next show that the second condition is valid. That is, $\| \mathbb{E}[\eta_t(s)|\mathcal{H}_t]\|_\infty \leq \gamma \| \Delta_t(s)\|_\infty$ for all $s \in S$ . Here, $\mathcal{H}_t$ represents the historical information (see the definition in Theorem 6.3). Due to the Markovian property, $\eta_t(s) = r_{t + 1} + \gamma v_t(s_{t + 1}) - v_\pi (s)$ or $\eta_t(s) = 0$ does not depend

on the historical information once $s$ is given. As a result, we have $\mathbb{E}[\eta_t(s)|\mathcal{H}_t] = \mathbb{E}[\eta_t(s)]$ . For $s \neq s_t$ , we have $\eta_t(s) = 0$ . Then, it is trivial to see that

\left| \mathbb {E} \left[ \eta_ {t} (s) \right] \right| = 0 \leq \gamma \| \Delta_ {t} (s) \| _ {\infty}. \tag {7.10}

For $s = s_t$ , we have

\begin{array}{l} \mathbb {E} \left[ \eta_ {t} (s) \right] = \mathbb {E} \left[ \eta_ {t} \left(s _ {t}\right) \right] \\ = \mathbb {E} \left[ r _ {t + 1} + \gamma v _ {t} \left(s _ {t + 1}\right) - v _ {\pi} \left(s _ {t}\right) \mid s _ {t} \right] \\ = \mathbb {E} \left[ r _ {t + 1} + \gamma v _ {t} \left(s _ {t + 1}\right) \mid s _ {t} \right] - v _ {\pi} \left(s _ {t}\right). \\ \end{array}

Since $v_{\pi}(s_t) = \mathbb{E}[r_{t + 1} + \gamma v_{\pi}(s_{t + 1})|s_t]$ , the above equation implies that

\begin{array}{l} \mathbb {E} \left[ \eta_ {t} (s) \right] = \gamma \mathbb {E} \left[ v _ {t} \left(s _ {t + 1}\right) - v _ {\pi} \left(s _ {t + 1}\right) \mid s _ {t} \right] \\ = \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p \left(s ^ {\prime} \mid s _ {t}\right) \left[ v _ {t} \left(s ^ {\prime}\right) - v _ {\pi} \left(s ^ {\prime}\right) \right]. \\ \end{array}

It follows that

\begin{array}{l} | \mathbb {E} [ \eta_ {t} (s) ] | = \gamma \left| \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s _ {t}) [ v _ {t} (s ^ {\prime}) - v _ {\pi} (s ^ {\prime}) ] \right| \\ \leq \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s _ {t}) \max _ {s ^ {\prime} \in \mathcal {S}} | v _ {t} (s ^ {\prime}) - v _ {\pi} (s ^ {\prime}) | \\ = \gamma \max _ {s ^ {\prime} \in \mathcal {S}} \left| v _ {t} \left(s ^ {\prime}\right) - v _ {\pi} \left(s ^ {\prime}\right) \right| \\ = \gamma \| v _ {t} \left(s ^ {\prime}\right) - v _ {\pi} \left(s ^ {\prime}\right) \| _ {\infty} \\ = \gamma \| \Delta_ {t} (s) \| _ {\infty}. \tag {7.11} \\ \end{array}

Therefore, at time $t$ , we know from (7.10) and (7.11) that $|\mathbb{E}[\eta_t(s)]| \leq \gamma \| \Delta_t(s)\|_\infty$ for all $s \in S$ regardless of whether $s = s_t$ . Thus,

\| \mathbb {E} [ \eta_ {t} (s) ] \| _ {\infty} \leq \gamma \| \Delta_ {t} (s) \| _ {\infty},

which is the second condition in Theorem 6.3. Finally, regarding the third condition, we have $\operatorname{var}[\eta_t(s)|\mathcal{H}_t] = \operatorname{var}[r_{t + 1} + \gamma v_t(s_{t + 1}) - v_\pi (s_t)|s_t] = \operatorname{var}[r_{t + 1} + \gamma v_t(s_{t + 1})|s_t]$ for $s = s_t$ and $\operatorname{var}[\eta_t(s)|\mathcal{H}_t] = 0$ for $s\neq s_t$ . Since $r_{t + 1}$ is bounded, the third condition can be proven without difficulty.