9.2_Metrics_for_defining_optimal_policies

9.2 Metrics for defining optimal policies

If a policy is represented by a function, there are two types of metrics for defining optimal policies. One is based on state values and the other is based on immediate rewards.

Metric 1: Average state value

The first metric is the average state value or simply called the average value. It is defined as

vˉπ=sSd(s)vπ(s),\bar {v} _ {\pi} = \sum_ {s \in \mathcal {S}} d (s) v _ {\pi} (s),

where d(s)d(s) is the weight of state ss . It satisfies d(s)0d(s) \geq 0 for any sSs \in S and sSd(s)=1\sum_{s \in S} d(s) = 1 . Therefore, we can interpret d(s)d(s) as a probability distribution of ss . Then, the metric can be written as

vˉπ=ESd[vπ(S)].\bar {v} _ {\pi} = \mathbb {E} _ {S \sim d} [ v _ {\pi} (S) ].

How to select the distribution dd ? This is an important question. There are two cases.

The first and simplest case is that dd is independent of the policy π\pi . In this case, we specifically denote dd as d0d_0 and vˉπ\bar{v}_{\pi} as vˉπ0\bar{v}_{\pi}^{0} to indicate that the distribution is independent of the policy. One case is to treat all the states equally important and select d0(s)=1/Sd_0(s) = 1 / |\mathcal{S}| . Another case is when we are only interested in a specific state s0s_0 (e.g., the agent always starts from s0s_0 ). In this case, we can design

d0(s0)=1,d0(ss0)=0.d _ {0} (s _ {0}) = 1, \quad d _ {0} (s \neq s _ {0}) = 0.

The second case is that dd is dependent on the policy π\pi . In this case, it is common to select dd as dπd_{\pi} , which is the stationary distribution under π\pi . One basic property of dπd_{\pi} is that it satisfies

dπTPπ=dπT,d _ {\pi} ^ {T} P _ {\pi} = d _ {\pi} ^ {T},

where PπP_{\pi} is the state transition probability matrix. More information about the stationary distribution can be found in Box 8.1.

The interpretation of selecting dπd_{\pi} is as follows. The stationary distribution reflects the long-term behavior of a Markov decision process under a given policy. If one state is frequently visited in the long term, it is more important and deserves a higher weight; if a state is rarely visited, then its importance is low and deserves a lower weight.

As its name suggests, vˉπ\bar{v}_{\pi} is a weighted average of the state values. Different values of θ\theta lead to different values of vˉπ\bar{v}_{\pi} . Our ultimate goal is to find an optimal policy (or equivalently an optimal θ\theta ) to maximize vˉπ\bar{v}_{\pi} .

We next introduce another two important equivalent expressions of vˉπ\bar{v}_{\pi} .

\diamond Suppose that an agent collects rewards {Rt+1}t=0\{R_{t + 1}\}_{t = 0}^{\infty} by following a given policy π(θ)\pi (\theta) . Readers may often see the following metric in the literature:

J(θ)=limnE[t=0nγtRt+1]=E[t=0γtRt+1].(9.1)J (\theta) = \lim _ {n \rightarrow \infty} \mathbb {E} \left[ \sum_ {t = 0} ^ {n} \gamma^ {t} R _ {t + 1} \right] = \mathbb {E} \left[ \sum_ {t = 0} ^ {\infty} \gamma^ {t} R _ {t + 1} \right]. \tag {9.1}

This metric may be nontrivial to interpret at first glance. In fact, it is equal to vˉπ\bar{v}_{\pi} . To see that, we have

E[t=0γtRt+1]=sSd(s)E[t=0γtRt+1S0=s]=sSd(s)vπ(s)=vˉπ.\begin{array}{l} \mathbb {E} \left[ \sum_ {t = 0} ^ {\infty} \gamma^ {t} R _ {t + 1} \right] = \sum_ {s \in \mathcal {S}} d (s) \mathbb {E} \left[ \sum_ {t = 0} ^ {\infty} \gamma^ {t} R _ {t + 1} | S _ {0} = s \right] \\ = \sum_ {s \in \mathcal {S}} d (s) v _ {\pi} (s) \\ = \bar {v} _ {\pi}. \\ \end{array}

The first equality in the above equation is due to the law of total expectation. The second equality is by the definition of state values.

The metric vˉπ\bar{v}_{\pi} can also be rewritten as the inner product of two vectors. In particular, let

vπ=[,vπ(s),]TRS,v _ {\pi} = [ \dots , v _ {\pi} (s), \dots ] ^ {T} \in \mathbb {R} ^ {| \mathcal {S} |},
d=[,d(s),]TRS.d = [ \ldots , d (s), \ldots ] ^ {T} \in \mathbb {R} ^ {| \mathcal {S} |}.

Then, we have

vˉπ=dTvπ.\bar {v} _ {\pi} = d ^ {T} v _ {\pi}.

This expression will be useful when we analyze its gradient.

Metric 2: Average reward

The second metric is the average one-step reward or simply called the average reward [2,64,65]. In particular, it is defined as

rˉπsSdπ(s)rπ(s)=ESdπ[rπ(S)],(9.2)\begin{array}{l} \bar {r} _ {\pi} \doteq \sum_ {s \in \mathcal {S}} d _ {\pi} (s) r _ {\pi} (s) \\ = \mathbb {E} _ {S \sim d _ {\pi}} [ r _ {\pi} (S) ], \tag {9.2} \\ \end{array}

where dπd_{\pi} is the stationary distribution and

rπ(s)aAπ(as,θ)r(s,a)=EAπ(s,θ)[r(s,A)s](9.3)r _ {\pi} (s) \doteq \sum_ {a \in \mathcal {A}} \pi (a | s, \theta) r (s, a) = \mathbb {E} _ {A \sim \pi (s, \theta)} [ r (s, A) | s ] \tag {9.3}

is the expectation of the immediate rewards. Here, r(s,a)E[Rs,a]=rrp(rs,a)r(s,a)\doteq \mathbb{E}[R|s,a] = \sum_{r}rp(r|s,a) . We next present another two important equivalent expressions of rˉπ\bar{r}_{\pi}

\diamond Suppose that the agent collects rewards {Rt+1}t=0\{R_{t + 1}\}_{t = 0}^{\infty} by following a given policy π(θ)\pi (\theta) . A common metric that readers may often see in the literature is

J(θ)=limn1nE[t=0n1Rt+1].(9.4)J (\theta) = \lim _ {n \rightarrow \infty} \frac {1}{n} \mathbb {E} \left[ \sum_ {t = 0} ^ {n - 1} R _ {t + 1} \right]. \tag {9.4}

It may seem nontrivial to interpret this metric at first glance. In fact, it is equal to rˉπ\bar{r}_{\pi} :

limn1nE[t=0n1Rt+1]=sSdπ(s)rπ(s)=rˉπ.(9.5)\lim _ {n \rightarrow \infty} \frac {1}{n} \mathbb {E} \left[ \sum_ {t = 0} ^ {n - 1} R _ {t + 1} \right] = \sum_ {s \in \mathcal {S}} d _ {\pi} (s) r _ {\pi} (s) = \bar {r} _ {\pi}. \tag {9.5}

The proof of (9.5) is given in Box 9.1.

The average reward rˉπ\bar{r}_{\pi} in (9.2) can also be written as the inner product of two vectors. In particular, let

rπ=[,rπ(s),]TRS,r _ {\pi} = [ \dots , r _ {\pi} (s), \dots ] ^ {T} \in \mathbb {R} ^ {| \mathcal {S} |},
dπ=[,dπ(s),]TRS,d _ {\pi} = \left[ \dots , d _ {\pi} (s), \dots \right] ^ {T} \in \mathbb {R} ^ {| \mathcal {S} |},

where rπ(s)r_{\pi}(s) is defined in (9.3). Then, it is clear that

rˉπ=sSdπ(s)rπ(s)=dπTrπ.\bar {r} _ {\pi} = \sum_ {s \in \mathcal {S}} d _ {\pi} (s) r _ {\pi} (s) = d _ {\pi} ^ {T} r _ {\pi}.

This expression will be useful when we derive its gradient.

Box 9.1: Proof of (9.5)

Step 1: We first prove that the following equation is valid for any starting state s0Ss_0 \in S :

rˉπ=limn1nE[t=0n1Rt+1S0=s0].(9.6)\bar {r} _ {\pi} = \lim _ {n \rightarrow \infty} \frac {1}{n} \mathbb {E} \left[ \sum_ {t = 0} ^ {n - 1} R _ {t + 1} | S _ {0} = s _ {0} \right]. \tag {9.6}

To do that, we notice

limn1nE[t=0n1Rt+1S0=s0]=limn1nt=0n1E[Rt+1S0=s0]=limtE[Rt+1S0=s0],(9.7)\begin{array}{l} \lim _ {n \to \infty} \frac {1}{n} \mathbb {E} \left[ \sum_ {t = 0} ^ {n - 1} R _ {t + 1} | S _ {0} = s _ {0} \right] = \lim _ {n \to \infty} \frac {1}{n} \sum_ {t = 0} ^ {n - 1} \mathbb {E} \left[ R _ {t + 1} | S _ {0} = s _ {0} \right] \\ = \lim _ {t \rightarrow \infty} \mathbb {E} \left[ R _ {t + 1} \mid S _ {0} = s _ {0} \right], \tag {9.7} \\ \end{array}

where the last equality is due to the property of the Cesaro mean (also called the Cesaro summation). In particular, if {ak}k=1\{a_k\}_{k=1}^{\infty} is a convergent sequence such that limkak\lim_{k \to \infty} a_k exists, then {1/nk=1nak}n=1\{1/n \sum_{k=1}^{n} a_k\}_{n=1}^{\infty} is also a convergent sequence such that limn1/nk=1nak=limkak\lim_{n \to \infty} 1/n \sum_{k=1}^{n} a_k = \lim_{k \to \infty} a_k .

We next examine E[Rt+1S0=s0]\mathbb{E}[R_{t + 1}|S_0 = s_0] in (9.7) more closely. By the law of total expectation, we have

E[Rt+1S0=s0]=sSE[Rt+1St=s,S0=s0]p(t)(ss0)=sSE[Rt+1St=s]p(t)(ss0)=sSrπ(s)p(t)(ss0),\begin{array}{l} \mathbb {E} \left[ R _ {t + 1} | S _ {0} = s _ {0} \right] = \sum_ {s \in \mathcal {S}} \mathbb {E} \left[ R _ {t + 1} | S _ {t} = s, S _ {0} = s _ {0} \right] p ^ {(t)} (s | s _ {0}) \\ = \sum_ {s \in \mathcal {S}} \mathbb {E} \left[ R _ {t + 1} \mid S _ {t} = s \right] p ^ {(t)} (s \mid s _ {0}) \\ = \sum_ {s \in \mathcal {S}} r _ {\pi} (s) p ^ {(t)} (s | s _ {0}), \\ \end{array}

where p(t)(ss0)p^{(t)}(s|s_0) denotes the probability of transitioning from s0s_0 to ss using exactly tt steps. The second equality in the above equation is due to the Markov memoryless property: the reward obtained at the next time step depends only on the current state rather than the previous ones.

Note that

limtp(t)(ss0)=dπ(s)\lim _ {t \to \infty} p ^ {(t)} (s | s _ {0}) = d _ {\pi} (s)

by the definition of the stationary distribution. As a result, the starting state s0s_0 does not matter. Then, we have

limtE[Rt+1S0=s0]=limtsSrπ(s)p(t)(ss0)=sSrπ(s)dπ(s)=rˉπ.\lim _ {t \to \infty} \mathbb {E} \left[ R _ {t + 1} | S _ {0} = s _ {0} \right] = \lim _ {t \to \infty} \sum_ {s \in \mathcal {S}} r _ {\pi} (s) p ^ {(t)} (s | s _ {0}) = \sum_ {s \in \mathcal {S}} r _ {\pi} (s) d _ {\pi} (s) = \bar {r} _ {\pi}.

Substituting the above equation into (9.7) gives (9.6).

Step 2: Consider an arbitrary state distribution dd . By the law of total expectation, we have

limn1nE[t=0n1Rt+1]=limn1nsSd(s)E[t=0n1Rt+1S0=s]=sSd(s)limn1nE[t=0n1Rt+1S0=s].\begin{array}{l} \lim _ {n \to \infty} \frac {1}{n} \mathbb {E} \left[ \sum_ {t = 0} ^ {n - 1} R _ {t + 1} \right] = \lim _ {n \to \infty} \frac {1}{n} \sum_ {s \in \mathcal {S}} d (s) \mathbb {E} \left[ \sum_ {t = 0} ^ {n - 1} R _ {t + 1} | S _ {0} = s \right] \\ = \sum_ {s \in \mathcal {S}} d (s) \lim _ {n \to \infty} \frac {1}{n} \mathbb {E} \left[ \sum_ {t = 0} ^ {n - 1} R _ {t + 1} | S _ {0} = s \right]. \\ \end{array}

Since (9.6) is valid for any starting state, substituting (9.6) into the above equation yields

limn1nE[t=0n1Rt+1]=sSd(s)rˉπ=rˉπ.\lim _ {n \to \infty} \frac {1}{n} \mathbb {E} \left[ \sum_ {t = 0} ^ {n - 1} R _ {t + 1} \right] = \sum_ {s \in \mathcal {S}} d (s) \bar {r} _ {\pi} = \bar {r} _ {\pi}.

The proof is complete.

Some remarks

Table 9.2: Summary of the different but equivalent expressions of vˉπ\bar{v}_{\pi} and rˉπ\bar{r}_{\pi} .

Up to now, we have introduced two types of metrics: vˉπ\bar{v}_{\pi} and rˉπ\bar{r}_{\pi} . Each metric has several different but equivalent expressions. They are summarized in Table 9.2. We sometimes use vˉπ\bar{v}_{\pi} to specifically refer to the case where the state distribution is the stationary distribution dπd_{\pi} and use vˉπ0\bar{v}_{\pi}^{0} to refer to the case where d0d_{0} is independent of π\pi . Some remarks about the metrics are given below.

All these metrics are functions of π\pi . Since π\pi is parameterized by θ\theta , these metrics are functions of θ\theta . In other words, different values of θ\theta can generate different metric

values. Therefore, we can search for the optimal values of θ\theta to maximize these metrics. This is the basic idea of policy gradient methods.

\diamond The two metrics vˉπ\bar{v}_{\pi} and rˉπ\bar{r}_{\pi} are equivalent in the discounted case where γ<1\gamma < 1 . In particular, it can be shown that

rˉπ=(1γ)vˉπ.\bar {r} _ {\pi} = (1 - \gamma) \bar {v} _ {\pi}.

The above equation indicates that these two metrics can be simultaneously maximized. The proof of this equation is given later in Lemma 9.1.

9.2_Metrics_for_defining_optimal_policies - 强化学习的数学基础 | OpenTech