TD learning often refers to a broad class of reinforcement learning algorithms. For example, all the algorithms introduced in this chapter fall into the scope of TD learning. However, TD learning in this section specifically refers to a classic algorithm for estimating state values.
7.1.1 Algorithm description
Given a policy π , our goal is to estimate vπ(s) for all s∈S . Suppose that we have some experience samples (s0,r1,s1,…,st,rt+1,st+1,…) generated following π . Here, t denotes the time step. The following TD algorithm can estimate the state values using these samples:
where t=0,1,2,… . Here, vt(st) is the estimate of vπ(st) at time t ; αt(st) is the learning rate for st at time t .
It should be noted that, at time t , only the value of the visited state st is updated. The values of the unvisited states s=st remain unchanged as shown in (7.2). Equation (7.2) is often omitted for simplicity, but it should be kept in mind because the algorithm would be mathematically incomplete without this equation.
Readers who see the TD learning algorithm for the first time may wonder why it is designed like this. In fact, it can be viewed as a special stochastic approximation algorithm for solving the Bellman equation. To see that, first recall that the definition of the state value is
vπ(s)=E[Rt+1+γGt+1∣St=s],s∈S.(7.3)
We can rewrite (7.3) as
vπ(s)=E[Rt+1+γvπ(St+1)∣St=s],s∈S.(7.4)
That is because E[Gt+1∣St=s]=∑aπ(a∣s)∑s′p(s′∣s,a)vπ(s′)=E[vπ(St+1)∣St=s] . Equation (7.4) is another expression of the Bellman equation. It is sometimes called the Bellman expectation equation.
The TD algorithm can be derived by applying the Robbins-Monro algorithm (Chapter 6) to solve the Bellman equation in (7.4). Interested readers can check the details in Box 7.1.
Box 7.1: Derivation of the TD algorithm
We next show that the TD algorithm in (7.1) can be obtained by applying the Robbins-Monro algorithm to solve (7.4).
Our goal is to solve the above equation to obtain vπ(st) using the Robbins-Monro algorithm. Since we can obtain rt+1 and st+1 , which are the samples of Rt+1 and St+1 , the noisy observation of g(vπ(st)) that we can obtain is
where vt(st) is the estimate of vπ(st) at time t , and αt(st) is the learning rate.
The algorithm in (7.5) has a similar expression to that of the TD algorithm in (7.1). The only difference is that the right-hand side of (7.5) contains vπ(st+1) , whereas (7.1) contains vt(st+1) . That is because (7.5) is designed to merely estimate the state value of st by assuming that the state values of the other states are already known. If we would like to estimate the state values of all the states, then vπ(st+1) on the right-hand side should be replaced with vt(st+1) . Then, (7.5) is exactly the same as (7.1). However, can such a replacement still ensure convergence? The answer is yes, and it will be proven later in Theorem 7.1.
7.1.2 Property analysis
Some important properties of the TD algorithm are discussed as follows.
First, we examine the expression of the TD algorithm more closely. In particular, (7.1) can be described as
n e w e s t i m a t evt+1(st)=c u r r e n t e s t i m a t evt(st)−αt(st)vt(st)−T D t a r g e tvˉtrt+1+γvt(st+1)T D e r r o rδt,(7.6)
where
vˉt≐rt+1+γvt(st+1)
is called the TD target and
δt≐v(st)−vˉt=vt(st)−(rt+1+γvt(st+1))
is called the TD error. It can be seen that the new estimate vt+1(st) is a combination of the current estimate vt(st) and the TD error δt .
⋄ Why is vˉt called the TD target?
This is because vˉt is the target value that the algorithm attempts to drive v(st) to. To see that, subtracting vˉt from both sides of (7.6) gives
Taking the absolute values of both sides of the above equation gives
∣vt+1(st)−vˉt∣=∣1−αt(st)∣∣vt(st)−vˉt∣.
Since αt(st) is a small positive number, we have 0<1−αt(st)<1 . It then follows that
∣vt+1(st)−vˉt∣<∣vt(st)−vˉt∣.
The above inequality is important because it indicates that the new value vt+1(st) is closer to vˉt than the old value vt(st) . Therefore, this algorithm mathematically drives vt(st) toward vˉt . This is why vˉt is called the TD target.
What is the interpretation of the TD error?
First, this error is called temporal-difference because δt=vt(st)−(rt+1+γvt(st+1)) reflects the discrepancy between two time steps t and t+1 . Second, the TD error is zero in the expectation sense when the state value estimate is accurate. To see that, when vt=vπ , the expected value of the TD error is
E[δt∣St=st]=E[vπ(St)−(Rt+1+γvπ(St+1))∣St=st]=vπ(st)−E[Rt+1+γvπ(St+1)∣St=st]=0.(d u e t o(7.3))
Therefore, the TD error reflects not only the discrepancy between two time steps but also, more importantly, the discrepancy between the estimate vt and the true state value vπ .
On a more abstract level, the TD error can be interpreted as the innovation, which indicates new information obtained from the experience sample (st,rt+1,st+1) . The fundamental idea of TD learning is to correct our current estimate of the state value based on the newly obtained information. Innovation is fundamental in many estimation problems such as Kalman filtering [33,34].
Second, the TD algorithm in (7.1) can only estimate the state values of a given policy. To find optimal policies, we still need to further calculate the action values and then conduct policy improvement. This will be introduced in Section 7.2. Nevertheless, the TD algorithm introduced in this section is very basic and important for understanding the other algorithms in this chapter.
Third, while both TD learning and MC learning are model-free, what are their advantages and disadvantages? The answers are summarized in Table 7.1.
Table 7.1: A comparison between TD learning and MC learning.
7.1.3 Convergence analysis
The convergence analysis of the TD algorithm in (7.1) is given below.
Theorem 7.1 (Convergence of TD learning). Given a policy π , by the TD algorithm in (7.1), vt(s) converges almost surely to vπ(s) as t→∞ for all s∈S if ∑tαt(s)=∞ and ∑tαt2(s)<∞ for all s∈S .
Some remarks about αt are given below. First, the condition of ∑tαt(s)=∞ and ∑tαt2(s)<∞ must be valid for all s∈S . Note that, at time t , αt(s)>0 if s is being visited and αt(s)=0 otherwise. The condition ∑tαt(s)=∞ requires the state s to be visited an infinite (or sufficiently many) number of times. This requires either the condition of exploring starts or an exploratory policy so that every state-action pair can possibly be visited many times. Second, the learning rate αt is often selected as a small
positive constant in practice. In this case, the condition that ∑tαt2(s)<∞ is no longer valid. When α is constant, it can still be shown that the algorithm converges in the sense of expectation [24, Section 1.5].
Box 7.2: Proof of Theorem 7.1
We prove the convergence based on Theorem 6.3 in Chapter 6. To do that, we need first to construct a stochastic process as that in Theorem 6.3. Consider an arbitrary state s∈S . At time t , it follows from the TD algorithm in (7.1) that
whose expression is the same as that of (7.9) except that αt(s)=0 and ηt(s)=0 . Therefore, regardless of whether s=st , we obtain the following unified expression:
Δt+1(s)=(1−αt(s))Δt(s)+αt(s)ηt(s).
This is the process in Theorem 6.3. Our goal is to show that the three conditions in Theorem 6.3 are satisfied and hence the process converges.
The first condition is valid as assumed in Theorem 7.1. We next show that the second condition is valid. That is, ∥E[ηt(s)∣Ht]∥∞≤γ∥Δt(s)∥∞ for all s∈S . Here, Ht represents the historical information (see the definition in Theorem 6.3). Due to the Markovian property, ηt(s)=rt+1+γvt(st+1)−vπ(s) or ηt(s)=0 does not depend
on the historical information once s is given. As a result, we have E[ηt(s)∣Ht]=E[ηt(s)] . For s=st , we have ηt(s)=0 . Then, it is trivial to see that
Therefore, at time t , we know from (7.10) and (7.11) that ∣E[ηt(s)]∣≤γ∥Δt(s)∥∞ for all s∈S regardless of whether s=st . Thus,
∥E[ηt(s)]∥∞≤γ∥Δt(s)∥∞,
which is the second condition in Theorem 6.3. Finally, regarding the third condition, we have var[ηt(s)∣Ht]=var[rt+1+γvt(st+1)−vπ(st)∣st]=var[rt+1+γvt(st+1)∣st] for s=st and var[ηt(s)∣Ht]=0 for s=st . Since rt+1 is bounded, the third condition can be proven without difficulty.