9.2 Metrics for defining optimal policies
If a policy is represented by a function, there are two types of metrics for defining optimal policies. One is based on state values and the other is based on immediate rewards.
Metric 1: Average state value
The first metric is the average state value or simply called the average value. It is defined as
vˉπ=s∈S∑d(s)vπ(s), where d(s) is the weight of state s . It satisfies d(s)≥0 for any s∈S and ∑s∈Sd(s)=1 . Therefore, we can interpret d(s) as a probability distribution of s . Then, the metric can be written as
vˉπ=ES∼d[vπ(S)]. How to select the distribution d ? This is an important question. There are two cases.
The first and simplest case is that d is independent of the policy π . In this case, we specifically denote d as d0 and vˉπ as vˉπ0 to indicate that the distribution is independent of the policy. One case is to treat all the states equally important and select d0(s)=1/∣S∣ . Another case is when we are only interested in a specific state s0 (e.g., the agent always starts from s0 ). In this case, we can design
d0(s0)=1,d0(s=s0)=0. The second case is that d is dependent on the policy π . In this case, it is common to select d as dπ , which is the stationary distribution under π . One basic property of dπ is that it satisfies
dπTPπ=dπT, where Pπ is the state transition probability matrix. More information about the stationary distribution can be found in Box 8.1.
The interpretation of selecting dπ is as follows. The stationary distribution reflects the long-term behavior of a Markov decision process under a given policy. If one state is frequently visited in the long term, it is more important and deserves a higher weight; if a state is rarely visited, then its importance is low and deserves a lower weight.
As its name suggests, vˉπ is a weighted average of the state values. Different values of θ lead to different values of vˉπ . Our ultimate goal is to find an optimal policy (or equivalently an optimal θ ) to maximize vˉπ .
We next introduce another two important equivalent expressions of vˉπ .
⋄ Suppose that an agent collects rewards {Rt+1}t=0∞ by following a given policy π(θ) . Readers may often see the following metric in the literature:
J(θ)=n→∞limE[t=0∑nγtRt+1]=E[t=0∑∞γtRt+1].(9.1) This metric may be nontrivial to interpret at first glance. In fact, it is equal to vˉπ . To see that, we have
E[∑t=0∞γtRt+1]=∑s∈Sd(s)E[∑t=0∞γtRt+1∣S0=s]=∑s∈Sd(s)vπ(s)=vˉπ. The first equality in the above equation is due to the law of total expectation. The second equality is by the definition of state values.
The metric vˉπ can also be rewritten as the inner product of two vectors. In particular, let
vπ=[…,vπ(s),…]T∈R∣S∣, d=[…,d(s),…]T∈R∣S∣. Then, we have
vˉπ=dTvπ. This expression will be useful when we analyze its gradient.
Metric 2: Average reward
The second metric is the average one-step reward or simply called the average reward [2,64,65]. In particular, it is defined as
rˉπ≐∑s∈Sdπ(s)rπ(s)=ES∼dπ[rπ(S)],(9.2) where dπ is the stationary distribution and
rπ(s)≐a∈A∑π(a∣s,θ)r(s,a)=EA∼π(s,θ)[r(s,A)∣s](9.3) is the expectation of the immediate rewards. Here, r(s,a)≐E[R∣s,a]=∑rrp(r∣s,a) . We next present another two important equivalent expressions of rˉπ
⋄ Suppose that the agent collects rewards {Rt+1}t=0∞ by following a given policy π(θ) . A common metric that readers may often see in the literature is
J(θ)=n→∞limn1E[t=0∑n−1Rt+1].(9.4) It may seem nontrivial to interpret this metric at first glance. In fact, it is equal to rˉπ :
n→∞limn1E[t=0∑n−1Rt+1]=s∈S∑dπ(s)rπ(s)=rˉπ.(9.5) The proof of (9.5) is given in Box 9.1.
The average reward rˉπ in (9.2) can also be written as the inner product of two vectors. In particular, let
rπ=[…,rπ(s),…]T∈R∣S∣, dπ=[…,dπ(s),…]T∈R∣S∣, where rπ(s) is defined in (9.3). Then, it is clear that
rˉπ=s∈S∑dπ(s)rπ(s)=dπTrπ. This expression will be useful when we derive its gradient.
Box 9.1: Proof of (9.5)
Step 1: We first prove that the following equation is valid for any starting state s0∈S :
rˉπ=n→∞limn1E[t=0∑n−1Rt+1∣S0=s0].(9.6) To do that, we notice
limn→∞n1E[∑t=0n−1Rt+1∣S0=s0]=limn→∞n1∑t=0n−1E[Rt+1∣S0=s0]=limt→∞E[Rt+1∣S0=s0],(9.7) where the last equality is due to the property of the Cesaro mean (also called the Cesaro summation). In particular, if {ak}k=1∞ is a convergent sequence such that limk→∞ak exists, then {1/n∑k=1nak}n=1∞ is also a convergent sequence such that limn→∞1/n∑k=1nak=limk→∞ak .
We next examine E[Rt+1∣S0=s0] in (9.7) more closely. By the law of total expectation, we have
E[Rt+1∣S0=s0]=∑s∈SE[Rt+1∣St=s,S0=s0]p(t)(s∣s0)=∑s∈SE[Rt+1∣St=s]p(t)(s∣s0)=∑s∈Srπ(s)p(t)(s∣s0), where p(t)(s∣s0) denotes the probability of transitioning from s0 to s using exactly t steps. The second equality in the above equation is due to the Markov memoryless property: the reward obtained at the next time step depends only on the current state rather than the previous ones.
Note that
t→∞limp(t)(s∣s0)=dπ(s) by the definition of the stationary distribution. As a result, the starting state s0 does not matter. Then, we have
t→∞limE[Rt+1∣S0=s0]=t→∞lims∈S∑rπ(s)p(t)(s∣s0)=s∈S∑rπ(s)dπ(s)=rˉπ. Substituting the above equation into (9.7) gives (9.6).
Step 2: Consider an arbitrary state distribution d . By the law of total expectation, we have
limn→∞n1E[∑t=0n−1Rt+1]=limn→∞n1∑s∈Sd(s)E[∑t=0n−1Rt+1∣S0=s]=∑s∈Sd(s)limn→∞n1E[∑t=0n−1Rt+1∣S0=s]. Since (9.6) is valid for any starting state, substituting (9.6) into the above equation yields
n→∞limn1E[t=0∑n−1Rt+1]=s∈S∑d(s)rˉπ=rˉπ. The proof is complete.
Some remarks
Table 9.2: Summary of the different but equivalent expressions of vˉπ and rˉπ .
Up to now, we have introduced two types of metrics: vˉπ and rˉπ . Each metric has several different but equivalent expressions. They are summarized in Table 9.2. We sometimes use vˉπ to specifically refer to the case where the state distribution is the stationary distribution dπ and use vˉπ0 to refer to the case where d0 is independent of π . Some remarks about the metrics are given below.
All these metrics are functions of π . Since π is parameterized by θ , these metrics are functions of θ . In other words, different values of θ can generate different metric
values. Therefore, we can search for the optimal values of θ to maximize these metrics. This is the basic idea of policy gradient methods.
⋄ The two metrics vˉπ and rˉπ are equivalent in the discounted case where γ<1 . In particular, it can be shown that
rˉπ=(1−γ)vˉπ. The above equation indicates that these two metrics can be simultaneously maximized. The proof of this equation is given later in Lemma 9.1.