8.2_TD_learning_of_state_values_based_on_function_approximation

8.2 TD learning of state values based on function approximation

In this section, we show how to integrate the function approximation method into TD learning to estimate the state values of a given policy. This algorithm will be extended to learn action values and optimal policies in Section 8.3.

This section contains quite a few subsections and many coherent contents. It is better for us to review the contents first before diving into the details.

The function approximation method is formulated as an optimization problem. The objective function of this problem is introduced in Section 8.2.1. The TD learning algorithm for optimizing this objective function is introduced in Section 8.2.2.

To apply the TD learning algorithm, we need to select appropriate feature vectors. Section 8.2.3 discusses this problem.
Examples are given in Section 8.2.4 to demonstrate the TD algorithm and the impacts of different feature vectors.
A theoretical analysis of the TD algorithm is given in Section 8.2.5. This subsection is mathematically intensive. Readers may read it selectively based on their interests.

8.2.1 Objective function

Let vπ(s)v_{\pi}(s) and v^(s,w)\hat{v}(s, w) be the true state value and approximated state value of sSs \in S , respectively. The problem to be solved is to find an optimal ww so that v^(s,w)\hat{v}(s, w) can best approximate vπ(s)v_{\pi}(s) for every ss . In particular, the objective function is

J(w)=E[(vπ(S)v^(S,w))2],(8.3)J (w) = \mathbb {E} \left[ \left(v _ {\pi} (S) - \hat {v} (S, w)\right) ^ {2} \right], \tag {8.3}

where the expectation is calculated with respect to the random variable SSS \in S . While SS is a random variable, what is its probability distribution? This question is important for understanding this objective function. There are several ways to define the probability distribution of SS .

The first way is to use a uniform distribution. That is to treat all the states as equally important by setting the probability of each state to 1/n1/n . In this case, the objective function in (8.3) becomes

J(w)=1nsS(vπ(s)v^(s,w))2,(8.4)J (w) = \frac {1}{n} \sum_ {s \in \mathcal {S}} \left(v _ {\pi} (s) - \hat {v} (s, w)\right) ^ {2}, \tag {8.4}

which is the average value of the approximation errors of all the states. However, this way does not consider the real dynamics of the Markov process under the given policy. Since some states may be rarely visited by a policy, it may be unreasonable to treat all the states as equally important.

\diamond The second way, which is the focus of this chapter, is to use the stationary distribution. The stationary distribution describes the long-term behavior of a Markov decision process. More specifically, after the agent executes a given policy for a sufficiently long period, the probability of the agent being located at any state can be described by this stationary distribution. Interested readers may see the details in Box 8.1.

Let {dπ(s)}sS\{d_{\pi}(s)\}_{s\in \mathcal{S}} denote the stationary distribution of the Markov process under policy π\pi . That is, the probability for the agent visiting ss after a long period of time is dπ(s)d_{\pi}(s) . By definition, sSdπ(s)=1\sum_{s\in \mathcal{S}}d_{\pi}(s) = 1 . Then, the objective function in (8.3) can be rewritten

as

J(w)=sSdπ(s)(vπ(s)v^(s,w))2,(8.5)J (w) = \sum_ {s \in \mathcal {S}} d _ {\pi} (s) (v _ {\pi} (s) - \hat {v} (s, w)) ^ {2}, \tag {8.5}

which is a weighted average of the approximation errors. The states that have higher probabilities of being visited are given greater weights.

It is notable that the value of dπ(s)d_{\pi}(s) is nontrivial to obtain because it requires knowing the state transition probability matrix PπP_{\pi} (see Box 8.1). Fortunately, we do not need to calculate the specific value of dπ(s)d_{\pi}(s) to minimize this objective function as shown in the next subsection. In addition, it was assumed that the number of states was finite when we introduced (8.4) and (8.5). When the state space is continuous, we can replace the summations with integrals.

Box 8.1: Stationary distribution of a Markov decision process