9.2 Metrics for defining optimal policies

If a policy is represented by a function, there are two types of metrics for defining optimal policies. One is based on state values and the other is based on immediate rewards.

Metric 1: Average state value

The first metric is the average state value or simply called the average value. It is defined as

\bar {v} _ {\pi} = \sum_ {s \in \mathcal {S}} d (s) v _ {\pi} (s),

where $d(s)$ is the weight of state $s$ . It satisfies $d(s) \geq 0$ for any $s \in S$ and $\sum_{s \in S} d(s) = 1$ . Therefore, we can interpret $d(s)$ as a probability distribution of $s$ . Then, the metric can be written as

\bar {v} _ {\pi} = \mathbb {E} _ {S \sim d} [ v _ {\pi} (S) ].

How to select the distribution $d$ ? This is an important question. There are two cases.

The first and simplest case is that $d$ is independent of the policy $\pi$ . In this case, we specifically denote $d$ as $d_0$ and $\bar{v}_{\pi}$ as $\bar{v}_{\pi}^{0}$ to indicate that the distribution is independent of the policy. One case is to treat all the states equally important and select $d_0(s) = 1 / |\mathcal{S}|$ . Another case is when we are only interested in a specific state $s_0$ (e.g., the agent always starts from $s_0$ ). In this case, we can design

d _ {0} (s _ {0}) = 1, \quad d _ {0} (s \neq s _ {0}) = 0.

The second case is that $d$ is dependent on the policy $\pi$ . In this case, it is common to select $d$ as $d_{\pi}$ , which is the stationary distribution under $\pi$ . One basic property of $d_{\pi}$ is that it satisfies

d _ {\pi} ^ {T} P _ {\pi} = d _ {\pi} ^ {T},

where $P_{\pi}$ is the state transition probability matrix. More information about the stationary distribution can be found in Box 8.1.

The interpretation of selecting $d_{\pi}$ is as follows. The stationary distribution reflects the long-term behavior of a Markov decision process under a given policy. If one state is frequently visited in the long term, it is more important and deserves a higher weight; if a state is rarely visited, then its importance is low and deserves a lower weight.

As its name suggests, $\bar{v}_{\pi}$ is a weighted average of the state values. Different values of $\theta$ lead to different values of $\bar{v}_{\pi}$ . Our ultimate goal is to find an optimal policy (or equivalently an optimal $\theta$ ) to maximize $\bar{v}_{\pi}$ .

We next introduce another two important equivalent expressions of $\bar{v}_{\pi}$ .

$\diamond$ Suppose that an agent collects rewards $\{R_{t + 1}\}_{t = 0}^{\infty}$ by following a given policy $\pi (\theta)$ . Readers may often see the following metric in the literature:

J (\theta) = \lim _ {n \rightarrow \infty} \mathbb {E} \left[ \sum_ {t = 0} ^ {n} \gamma^ {t} R _ {t + 1} \right] = \mathbb {E} \left[ \sum_ {t = 0} ^ {\infty} \gamma^ {t} R _ {t + 1} \right]. \tag {9.1}

This metric may be nontrivial to interpret at first glance. In fact, it is equal to $\bar{v}_{\pi}$ . To see that, we have

\begin{array}{l} \mathbb {E} \left[ \sum_ {t = 0} ^ {\infty} \gamma^ {t} R _ {t + 1} \right] = \sum_ {s \in \mathcal {S}} d (s) \mathbb {E} \left[ \sum_ {t = 0} ^ {\infty} \gamma^ {t} R _ {t + 1} | S _ {0} = s \right] \\ = \sum_ {s \in \mathcal {S}} d (s) v _ {\pi} (s) \\ = \bar {v} _ {\pi}. \\ \end{array}

The first equality in the above equation is due to the law of total expectation. The second equality is by the definition of state values.

The metric $\bar{v}_{\pi}$ can also be rewritten as the inner product of two vectors. In particular, let

v _ {\pi} = [ \dots , v _ {\pi} (s), \dots ] ^ {T} \in \mathbb {R} ^ {| \mathcal {S} |},

d = [ \ldots , d (s), \ldots ] ^ {T} \in \mathbb {R} ^ {| \mathcal {S} |}.

Then, we have

\bar {v} _ {\pi} = d ^ {T} v _ {\pi}.

This expression will be useful when we analyze its gradient.

Metric 2: Average reward

The second metric is the average one-step reward or simply called the average reward [2,64,65]. In particular, it is defined as

\begin{array}{l} \bar {r} _ {\pi} \doteq \sum_ {s \in \mathcal {S}} d _ {\pi} (s) r _ {\pi} (s) \\ = \mathbb {E} _ {S \sim d _ {\pi}} [ r _ {\pi} (S) ], \tag {9.2} \\ \end{array}

where $d_{\pi}$ is the stationary distribution and

r _ {\pi} (s) \doteq \sum_ {a \in \mathcal {A}} \pi (a | s, \theta) r (s, a) = \mathbb {E} _ {A \sim \pi (s, \theta)} [ r (s, A) | s ] \tag {9.3}

is the expectation of the immediate rewards. Here, $r(s,a)\doteq \mathbb{E}[R|s,a] = \sum_{r}rp(r|s,a)$ . We next present another two important equivalent expressions of $\bar{r}_{\pi}$

$\diamond$ Suppose that the agent collects rewards $\{R_{t + 1}\}_{t = 0}^{\infty}$ by following a given policy $\pi (\theta)$ . A common metric that readers may often see in the literature is

J (\theta) = \lim _ {n \rightarrow \infty} \frac {1}{n} \mathbb {E} \left[ \sum_ {t = 0} ^ {n - 1} R _ {t + 1} \right]. \tag {9.4}

It may seem nontrivial to interpret this metric at first glance. In fact, it is equal to $\bar{r}_{\pi}$ :

\lim _ {n \rightarrow \infty} \frac {1}{n} \mathbb {E} \left[ \sum_ {t = 0} ^ {n - 1} R _ {t + 1} \right] = \sum_ {s \in \mathcal {S}} d _ {\pi} (s) r _ {\pi} (s) = \bar {r} _ {\pi}. \tag {9.5}

The proof of (9.5) is given in Box 9.1.

The average reward $\bar{r}_{\pi}$ in (9.2) can also be written as the inner product of two vectors. In particular, let

r _ {\pi} = [ \dots , r _ {\pi} (s), \dots ] ^ {T} \in \mathbb {R} ^ {| \mathcal {S} |},

d _ {\pi} = \left[ \dots , d _ {\pi} (s), \dots \right] ^ {T} \in \mathbb {R} ^ {| \mathcal {S} |},

where $r_{\pi}(s)$ is defined in (9.3). Then, it is clear that

\bar {r} _ {\pi} = \sum_ {s \in \mathcal {S}} d _ {\pi} (s) r _ {\pi} (s) = d _ {\pi} ^ {T} r _ {\pi}.

This expression will be useful when we derive its gradient.

Box 9.1: Proof of (9.5)

Step 1: We first prove that the following equation is valid for any starting state $s_0 \in S$ :

\bar {r} _ {\pi} = \lim _ {n \rightarrow \infty} \frac {1}{n} \mathbb {E} \left[ \sum_ {t = 0} ^ {n - 1} R _ {t + 1} | S _ {0} = s _ {0} \right]. \tag {9.6}

To do that, we notice

\begin{array}{l} \lim _ {n \to \infty} \frac {1}{n} \mathbb {E} \left[ \sum_ {t = 0} ^ {n - 1} R _ {t + 1} | S _ {0} = s _ {0} \right] = \lim _ {n \to \infty} \frac {1}{n} \sum_ {t = 0} ^ {n - 1} \mathbb {E} \left[ R _ {t + 1} | S _ {0} = s _ {0} \right] \\ = \lim _ {t \rightarrow \infty} \mathbb {E} \left[ R _ {t + 1} \mid S _ {0} = s _ {0} \right], \tag {9.7} \\ \end{array}

where the last equality is due to the property of the Cesaro mean (also called the Cesaro summation). In particular, if $\{a_k\}_{k=1}^{\infty}$ is a convergent sequence such that $\lim_{k \to \infty} a_k$ exists, then $\{1/n \sum_{k=1}^{n} a_k\}_{n=1}^{\infty}$ is also a convergent sequence such that $\lim_{n \to \infty} 1/n \sum_{k=1}^{n} a_k = \lim_{k \to \infty} a_k$ .

We next examine $\mathbb{E}[R_{t + 1}|S_0 = s_0]$ in (9.7) more closely. By the law of total expectation, we have

\begin{array}{l} \mathbb {E} \left[ R _ {t + 1} | S _ {0} = s _ {0} \right] = \sum_ {s \in \mathcal {S}} \mathbb {E} \left[ R _ {t + 1} | S _ {t} = s, S _ {0} = s _ {0} \right] p ^ {(t)} (s | s _ {0}) \\ = \sum_ {s \in \mathcal {S}} \mathbb {E} \left[ R _ {t + 1} \mid S _ {t} = s \right] p ^ {(t)} (s \mid s _ {0}) \\ = \sum_ {s \in \mathcal {S}} r _ {\pi} (s) p ^ {(t)} (s | s _ {0}), \\ \end{array}

where $p^{(t)}(s|s_0)$ denotes the probability of transitioning from $s_0$ to $s$ using exactly $t$ steps. The second equality in the above equation is due to the Markov memoryless property: the reward obtained at the next time step depends only on the current state rather than the previous ones.

Note that

\lim _ {t \to \infty} p ^ {(t)} (s | s _ {0}) = d _ {\pi} (s)

by the definition of the stationary distribution. As a result, the starting state $s_0$ does not matter. Then, we have

\lim _ {t \to \infty} \mathbb {E} \left[ R _ {t + 1} | S _ {0} = s _ {0} \right] = \lim _ {t \to \infty} \sum_ {s \in \mathcal {S}} r _ {\pi} (s) p ^ {(t)} (s | s _ {0}) = \sum_ {s \in \mathcal {S}} r _ {\pi} (s) d _ {\pi} (s) = \bar {r} _ {\pi}.

Substituting the above equation into (9.7) gives (9.6).

Step 2: Consider an arbitrary state distribution $d$ . By the law of total expectation, we have

\begin{array}{l} \lim _ {n \to \infty} \frac {1}{n} \mathbb {E} \left[ \sum_ {t = 0} ^ {n - 1} R _ {t + 1} \right] = \lim _ {n \to \infty} \frac {1}{n} \sum_ {s \in \mathcal {S}} d (s) \mathbb {E} \left[ \sum_ {t = 0} ^ {n - 1} R _ {t + 1} | S _ {0} = s \right] \\ = \sum_ {s \in \mathcal {S}} d (s) \lim _ {n \to \infty} \frac {1}{n} \mathbb {E} \left[ \sum_ {t = 0} ^ {n - 1} R _ {t + 1} | S _ {0} = s \right]. \\ \end{array}

Since (9.6) is valid for any starting state, substituting (9.6) into the above equation yields

\lim _ {n \to \infty} \frac {1}{n} \mathbb {E} \left[ \sum_ {t = 0} ^ {n - 1} R _ {t + 1} \right] = \sum_ {s \in \mathcal {S}} d (s) \bar {r} _ {\pi} = \bar {r} _ {\pi}.

The proof is complete.

Some remarks

Table 9.2: Summary of the different but equivalent expressions of $\bar{v}_{\pi}$ and $\bar{r}_{\pi}$ .

Up to now, we have introduced two types of metrics: $\bar{v}_{\pi}$ and $\bar{r}_{\pi}$ . Each metric has several different but equivalent expressions. They are summarized in Table 9.2. We sometimes use $\bar{v}_{\pi}$ to specifically refer to the case where the state distribution is the stationary distribution $d_{\pi}$ and use $\bar{v}_{\pi}^{0}$ to refer to the case where $d_{0}$ is independent of $\pi$ . Some remarks about the metrics are given below.

All these metrics are functions of $\pi$ . Since $\pi$ is parameterized by $\theta$ , these metrics are functions of $\theta$ . In other words, different values of $\theta$ can generate different metric

values. Therefore, we can search for the optimal values of $\theta$ to maximize these metrics. This is the basic idea of policy gradient methods.

$\diamond$ The two metrics $\bar{v}_{\pi}$ and $\bar{r}_{\pi}$ are equivalent in the discounted case where $\gamma < 1$ . In particular, it can be shown that

\bar {r} _ {\pi} = (1 - \gamma) \bar {v} _ {\pi}.

The above equation indicates that these two metrics can be simultaneously maximized. The proof of this equation is given later in Lemma 9.1.

9.2_Metrics_for_defining_optimal_policies

9.2 Metrics for defining optimal policies

Metric 1: Average state value

Metric 2: Average reward

Box 9.1: Proof of (9.5)

Some remarks