8.2_TD_learning_of_state_values_based_on_function_approximation
8.2 TD learning of state values based on function approximation
In this section, we show how to integrate the function approximation method into TD learning to estimate the state values of a given policy. This algorithm will be extended to learn action values and optimal policies in Section 8.3.
This section contains quite a few subsections and many coherent contents. It is better for us to review the contents first before diving into the details.
The function approximation method is formulated as an optimization problem. The objective function of this problem is introduced in Section 8.2.1. The TD learning algorithm for optimizing this objective function is introduced in Section 8.2.2.
To apply the TD learning algorithm, we need to select appropriate feature vectors. Section 8.2.3 discusses this problem.
Examples are given in Section 8.2.4 to demonstrate the TD algorithm and the impacts of different feature vectors.
A theoretical analysis of the TD algorithm is given in Section 8.2.5. This subsection is mathematically intensive. Readers may read it selectively based on their interests.
8.2.1 Objective function
Let and be the true state value and approximated state value of , respectively. The problem to be solved is to find an optimal so that can best approximate for every . In particular, the objective function is
where the expectation is calculated with respect to the random variable . While is a random variable, what is its probability distribution? This question is important for understanding this objective function. There are several ways to define the probability distribution of .
The first way is to use a uniform distribution. That is to treat all the states as equally important by setting the probability of each state to . In this case, the objective function in (8.3) becomes
which is the average value of the approximation errors of all the states. However, this way does not consider the real dynamics of the Markov process under the given policy. Since some states may be rarely visited by a policy, it may be unreasonable to treat all the states as equally important.
The second way, which is the focus of this chapter, is to use the stationary distribution. The stationary distribution describes the long-term behavior of a Markov decision process. More specifically, after the agent executes a given policy for a sufficiently long period, the probability of the agent being located at any state can be described by this stationary distribution. Interested readers may see the details in Box 8.1.
Let denote the stationary distribution of the Markov process under policy . That is, the probability for the agent visiting after a long period of time is . By definition, . Then, the objective function in (8.3) can be rewritten
as
which is a weighted average of the approximation errors. The states that have higher probabilities of being visited are given greater weights.
It is notable that the value of is nontrivial to obtain because it requires knowing the state transition probability matrix (see Box 8.1). Fortunately, we do not need to calculate the specific value of to minimize this objective function as shown in the next subsection. In addition, it was assumed that the number of states was finite when we introduced (8.4) and (8.5). When the state space is continuous, we can replace the summations with integrals.