9.1_Policy_representation_From_table_to_function
9.1 Policy representation: From table to function
When the representation of a policy is switched from a table to a function, it is necessary to clarify the difference between the two representation methods.
First, how to define optimal policies? When represented as a table, a policy is defined as optimal if it can maximize every state value. When represented by a function, a policy is defined as optimal if it can maximize certain scalar metrics.
Second, how to update a policy? When represented by a table, a policy can be updated by directly changing the entries in the table. When represented by a parameterized function, a policy can no longer be updated in this way. Instead, it can only be updated by changing the parameter .
Third, how to retrieve the probability of an action? In the tabular case, the probability of an action can be directly obtained by looking up the corresponding entry in the table. In the case of function representation, we need to input into the function to calculate its probability (see Figure 9.2(a)). Depending on the structure of the function, we can also input a state and then output the probabilities of all actions (see Figure 9.2(b)).
The basic idea of the policy gradient method is summarized below. Suppose that is a scalar metric. Optimal policies can be obtained by optimizing this metric via the gradient-based algorithm:
where is the gradient of with respect to , is the time step, and is the optimization rate.
With this basic idea, we will answer the following three questions in the remainder of this chapter.
What metrics should be used? (Section 9.2).
How to calculate the gradients of the metrics? (Section 9.3)
How to use experience samples to calculate the gradients? (Section 9.4)