9.1_Policy_representation_From_table_to_function

9.1 Policy representation: From table to function

When the representation of a policy is switched from a table to a function, it is necessary to clarify the difference between the two representation methods.

\diamond First, how to define optimal policies? When represented as a table, a policy is defined as optimal if it can maximize every state value. When represented by a function, a policy is defined as optimal if it can maximize certain scalar metrics.
\diamond Second, how to update a policy? When represented by a table, a policy can be updated by directly changing the entries in the table. When represented by a parameterized function, a policy can no longer be updated in this way. Instead, it can only be updated by changing the parameter θ\theta .
\diamond Third, how to retrieve the probability of an action? In the tabular case, the probability of an action can be directly obtained by looking up the corresponding entry in the table. In the case of function representation, we need to input (s,a)(s,a) into the function to calculate its probability (see Figure 9.2(a)). Depending on the structure of the function, we can also input a state and then output the probabilities of all actions (see Figure 9.2(b)).

The basic idea of the policy gradient method is summarized below. Suppose that J(θ)J(\theta) is a scalar metric. Optimal policies can be obtained by optimizing this metric via the gradient-based algorithm:

θt+1=θt+αθJ(θt),\theta_ {t + 1} = \theta_ {t} + \alpha \nabla_ {\theta} J (\theta_ {t}),

where θJ\nabla_{\theta}J is the gradient of JJ with respect to θ\theta , tt is the time step, and α\alpha is the optimization rate.

With this basic idea, we will answer the following three questions in the remainder of this chapter.

What metrics should be used? (Section 9.2).
How to calculate the gradients of the metrics? (Section 9.3)
\diamond How to use experience samples to calculate the gradients? (Section 9.4)