9.3_Gradients_of_the_metrics

9.3 Gradients of the metrics

Given the metrics introduced in the last section, we can use gradient-based methods to maximize them. To do that, we need to first calculate the gradients of these metrics. The most important theoretical result in this chapter is the following theorem.

Theorem 9.1 (Policy gradient theorem). The gradient of J(θ)J(\theta) is

θJ(θ)=sSη(s)aAθπ(as,θ)qπ(s,a),(9.8)\nabla_ {\theta} J (\theta) = \sum_ {s \in \mathcal {S}} \eta (s) \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s, \theta) q _ {\pi} (s, a), \tag {9.8}

where η\eta is a state distribution and θπ\nabla_{\theta}\pi is the gradient of π\pi with respect to θ\theta . Moreover, (9.8) has a compact form expressed in terms of expectation:

θJ(θ)=ESη,Aπ(S,θ)[θlnπ(AS,θ)qπ(S,A)],(9.9)\nabla_ {\theta} J (\theta) = \mathbb {E} _ {S \sim \eta , A \sim \pi (S, \theta)} \Big [ \nabla_ {\theta} \ln \pi (A | S, \theta) q _ {\pi} (S, A) \Big ], \tag {9.9}

where ln\ln is the natural logarithm.

Some important remarks about Theorem 9.1 are given below.

\diamond It should be noted that Theorem 9.1 is a summary of the results in Theorem 9.2, Theorem 9.3, and Theorem 9.5. These three theorems address different scenarios involving different metrics and discounted/undiscounted cases. The gradients in these scenarios all have similar expressions and hence are summarized in Theorem 9.1. The specific expressions of J(θ)J(\theta) and η\eta are not given in Theorem 9.1 and can be found in Theorem 9.2, Theorem 9.3, and Theorem 9.5. In particular, J(θ)J(\theta) could be vˉπ0\bar{v}_{\pi}^{0} , vˉπ\bar{v}_{\pi} , or rˉπ\bar{r}_{\pi} . The equality in (9.8) may become a strict equality or an approximation. The distribution η\eta also varies in different scenarios.

The derivation of the gradients is the most complicated part of the policy gradient method. For many readers, it is sufficient to be familiar with the result in Theorem 9.1 without knowing the proof. The derivation details presented in the rest of this section are mathematically intensive. Readers are suggested to study selectively based on their interests.

\diamond The expression in (9.9) is more favorable than (9.8) because it is expressed as an expectation. We will show in Section 9.4 that this true gradient can be approximated by a stochastic gradient.

Why can (9.8) be expressed as (9.9)? The proof is given below. By the definition of expectation, (9.8) can be rewritten as

θJ(θ)=sSη(s)aAθπ(as,θ)qπ(s,a)=ESη[aAθπ(aS,θ)qπ(S,a)].(9.10)\begin{array}{l} \nabla_ {\theta} J (\theta) = \sum_ {s \in \mathcal {S}} \eta (s) \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s, \theta) q _ {\pi} (s, a) \\ = \mathbb {E} _ {S \sim \eta} \left[ \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | S, \theta) q _ {\pi} (S, a) \right]. \tag {9.10} \\ \end{array}

Furthermore, the gradient of lnπ(as,θ)\ln \pi (a|s,\theta) is

θlnπ(as,θ)=θπ(as,θ)π(as,θ).\nabla_ {\theta} \ln \pi (a | s, \theta) = \frac {\nabla_ {\theta} \pi (a | s , \theta)}{\pi (a | s , \theta)}.

It follows that

θπ(as,θ)=π(as,θ)θlnπ(as,θ).(9.11)\nabla_ {\theta} \pi (a | s, \theta) = \pi (a | s, \theta) \nabla_ {\theta} \ln \pi (a | s, \theta). \tag {9.11}

Substituting (9.11) into (9.10) gives

θJ(θ)=E[aAπ(aS,θ)θlnπ(aS,θ)qπ(S,a)]=ESη,Aπ(S,θ)[θlnπ(AS,θ)qπ(S,A)].\begin{array}{l} \nabla_ {\theta} J (\theta) = \mathbb {E} \left[ \sum_ {a \in \mathcal {A}} \pi (a | S, \theta) \nabla_ {\theta} \ln \pi (a | S, \theta) q _ {\pi} (S, a) \right] \\ = \mathbb {E} _ {S \sim \eta , A \sim \pi (S, \theta)} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta) q _ {\pi} (S, A) \right]. \\ \end{array}

It is notable that π(as,θ)\pi(a|s, \theta) must be positive for all (s,a)(s, a) to ensure that lnπ(as,θ)\ln \pi(a|s, \theta) is valid. This can be achieved by using softmax functions:

π(as,θ)=eh(s,a,θ)aAeh(s,a,θ),aA,(9.12)\pi (a | s, \theta) = \frac {e ^ {h (s , a , \theta)}}{\sum_ {a ^ {\prime} \in \mathcal {A}} e ^ {h (s , a ^ {\prime} , \theta)}}, \quad a \in \mathcal {A}, \tag {9.12}

where h(s,a,θ)h(s, a, \theta) is a function indicating the preference for selecting aa at ss . The policy in (9.12) satisfies π(as,θ)(0,1)\pi(a|s, \theta) \in (0, 1) and aAπ(as,θ)=1\sum_{a \in A} \pi(a|s, \theta) = 1 for any sSs \in S . This policy can be realized by a neural network. The input of the network is ss . The output layer is a softmax layer so that the network outputs π(as,θ)\pi(a|s, \theta) for all aa and the sum of the outputs is equal to 1. See Figure 9.2(b) for an illustration.

Since π(as,θ)>0\pi(a|s, \theta) > 0 for all aa , the policy is stochastic and hence exploratory. The policy does not directly tell which action to take. Instead, the action should be generated according to the probability distribution of the policy.

9.3.1 Derivation of the gradients in the discounted case

We next derive the gradients of the metrics in the discounted case where γ(0,1)\gamma \in (0,1) . The state value and action value in the discounted case are defined as

vπ(s)=E[Rt+1+γRt+2+γ2Rt+3+St=s],v _ {\pi} (s) = \mathbb {E} [ R _ {t + 1} + \gamma R _ {t + 2} + \gamma^ {2} R _ {t + 3} + \dots | S _ {t} = s ],
qπ(s,a)=E[Rt+1+γRt+2+γ2Rt+3+St=s,At=a].q _ {\pi} (s, a) = \mathbb {E} \left[ R _ {t + 1} + \gamma R _ {t + 2} + \gamma^ {2} R _ {t + 3} + \dots \mid S _ {t} = s, A _ {t} = a \right].

It holds that vπ(s)=aAπ(as,θ)qπ(s,a)v_{\pi}(s) = \sum_{a \in \mathcal{A}} \pi(a|s, \theta) q_{\pi}(s, a) and the state value satisfies the Bellman equation.

First, we show that vˉπ(θ)\bar{v}_{\pi}(\theta) and rˉπ(θ)\bar{r}_{\pi}(\theta) are equivalent metrics.

Lemma 9.1 (Equivalence between vˉπ(θ)\bar{v}_{\pi}(\theta) and rˉπ(θ)\bar{r}_{\pi}(\theta) ). In the discounted case where γ(0,1)\gamma \in (0,1) , it holds that

rˉπ=(1γ)vˉπ.(9.13)\bar {r} _ {\pi} = (1 - \gamma) \bar {v} _ {\pi}. \tag {9.13}

Proof. Note that vˉπ(θ)=dπTvπ\bar{v}_{\pi}(\theta) = d_{\pi}^{T}v_{\pi} and rˉπ(θ)=dπTrπ\bar{r}_{\pi}(\theta) = d_{\pi}^{T}r_{\pi} , where vπv_{\pi} and rπr_{\pi} satisfy the Bellman equation vπ=rπ+γPπvπv_{\pi} = r_{\pi} + \gamma P_{\pi}v_{\pi} . Multiplying dπTd_{\pi}^{T} on both sides of the Bellman equation yields

vˉπ=rˉπ+γdπTPπvπ=rˉπ+γdπTvπ=rˉπ+γvˉπ,\bar {v} _ {\pi} = \bar {r} _ {\pi} + \gamma d _ {\pi} ^ {T} P _ {\pi} v _ {\pi} = \bar {r} _ {\pi} + \gamma d _ {\pi} ^ {T} v _ {\pi} = \bar {r} _ {\pi} + \gamma \bar {v} _ {\pi},

which implies (9.13).

Second, the following lemma gives the gradient of vπ(s)v_{\pi}(s) for any ss .

Lemma 9.2 (Gradient of vπ(s)v_{\pi}(s) ). In the discounted case, it holds for any sSs \in S that

θvπ(s)=sSPrπ(ss)aAθπ(as,θ)qπ(s,a),(9.14)\nabla_ {\theta} v _ {\pi} (s) = \sum_ {s ^ {\prime} \in \mathcal {S}} \Pr_ {\pi} \left(s ^ {\prime} \mid s\right) \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi \left(a \mid s ^ {\prime}, \theta\right) q _ {\pi} \left(s ^ {\prime}, a\right), \tag {9.14}

where

Prπ(ss)k=0γk[Pπk]ss=[(InγPπ)1]ss\operatorname * {P r} _ {\pi} (s ^ {\prime} | s) \doteq \sum_ {k = 0} ^ {\infty} \gamma^ {k} [ P _ {\pi} ^ {k} ] _ {s s ^ {\prime}} = \left[ (I _ {n} - \gamma P _ {\pi}) ^ {- 1} \right] _ {s s ^ {\prime}}

is the discounted total probability of transitioning from ss to ss' under policy π\pi . Here, []ss[\cdot]_{ss'} denotes the entry in the ss th row and ss' th column, and [Pπ]ss[P_{\pi}]_{ss'} is the probability of transitioning from ss to ss' using exactly kk steps under π\pi .

Box 9.2: Proof of Lemma 9.2

First, for any sSs \in S , it holds that

θvπ(s)=θ[aAπ(as,θ)qπ(s,a)]=aA[θπ(as,θ)qπ(s,a)+π(as,θ)θqπ(s,a)],(9.15)\begin{array}{l} \nabla_ {\theta} v _ {\pi} (s) = \nabla_ {\theta} \left[ \sum_ {a \in \mathcal {A}} \pi (a | s, \theta) q _ {\pi} (s, a) \right] \\ = \sum_ {a \in \mathcal {A}} \left[ \nabla_ {\theta} \pi (a | s, \theta) q _ {\pi} (s, a) + \pi (a | s, \theta) \nabla_ {\theta} q _ {\pi} (s, a) \right], \tag {9.15} \\ \end{array}

where qπ(s,a)q_{\pi}(s,a) is the action value given by

qπ(s,a)=r(s,a)+γsSp(ss,a)vπ(s).q _ {\pi} (s, a) = r (s, a) + \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s, a) v _ {\pi} (s ^ {\prime}).

Since r(s,a)=rrp(rs,a)r(s,a) = \sum_{r}rp(r|s,a) is independent of θ\theta , we have

θqπ(s,a)=0+γsSp(ss,a)θvπ(s).\nabla_ {\theta} q _ {\pi} (s, a) = 0 + \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s, a) \nabla_ {\theta} v _ {\pi} (s ^ {\prime}).

Substituting this result into (9.15) yields

θvπ(s)=aA[θπ(as,θ)qπ(s,a)+π(as,θ)γsSp(ss,a)θvπ(s)]=aAθπ(as,θ)qπ(s,a)+γaAπ(as,θ)sSp(ss,a)θvπ(s).(9.16)\begin{array}{l} \nabla_ {\theta} v _ {\pi} (s) = \sum_ {a \in \mathcal {A}} \left[ \nabla_ {\theta} \pi (a | s, \theta) q _ {\pi} (s, a) + \pi (a | s, \theta) \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s, a) \nabla_ {\theta} v _ {\pi} (s ^ {\prime}) \right] \\ = \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s, \theta) q _ {\pi} (s, a) + \gamma \sum_ {a \in \mathcal {A}} \pi (a | s, \theta) \sum_ {s ^ {\prime} \in \mathcal {S}} p \left(s ^ {\prime} \mid s, a\right) \nabla_ {\theta} v _ {\pi} \left(s ^ {\prime}\right). \tag {9.16} \\ \end{array}

It is notable that θvπ\nabla_{\theta}v_{\pi} appears on both sides of the above equation. One way to calculate it is to use the unrolling technique [64]. Here, we use another way based on the matrix-vector form, which we believe is more straightforward to understand. In particular, let

u(s)aAθπ(as,θ)qπ(s,a).u (s) \doteq \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s, \theta) q _ {\pi} (s, a).

Since

aAπ(as,θ)sSp(ss,a)θvπ(s)=sSp(ss)θvπ(s)=sS[Pπ]ssθvπ(s),\sum_ {a \in \mathcal {A}} \pi (a | s, \theta) \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s, a) \nabla_ {\theta} v _ {\pi} (s ^ {\prime}) = \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s) \nabla_ {\theta} v _ {\pi} (s ^ {\prime}) = \sum_ {s ^ {\prime} \in \mathcal {S}} [ P _ {\pi} ] _ {s s ^ {\prime}} \nabla_ {\theta} v _ {\pi} (s ^ {\prime}),

equation (9.16) can be written in matrix-vector form as

[θvπ(s)]θvπRmn=[u(s)]uRmn+γ(PπIm)[θvπ(s)]θvπRmn,\underbrace {\left[ \begin{array}{c} \vdots \\ \nabla_ {\theta} v _ {\pi} (s) \\ \vdots \end{array} \right]} _ {\nabla_ {\theta} v _ {\pi} \in \mathbb {R} ^ {m n}} = \underbrace {\left[ \begin{array}{c} \vdots \\ u (s) \\ \vdots \end{array} \right]} _ {u \in \mathbb {R} ^ {m n}} + \gamma (P _ {\pi} \otimes I _ {m}) \underbrace {\left[ \begin{array}{c} \vdots \\ \nabla_ {\theta} v _ {\pi} (s ^ {\prime}) \\ \vdots \end{array} \right]} _ {\nabla_ {\theta} v _ {\pi} \in \mathbb {R} ^ {m n}},

which can be written concisely as

θvπ=u+γ(PπIm)θvπ.\nabla_ {\theta} v _ {\pi} = u + \gamma (P _ {\pi} \otimes I _ {m}) \nabla_ {\theta} v _ {\pi}.

Here, n=Sn = |\mathcal{S}| , and mm is the dimension of the parameter vector θ\theta . The reason that the Kronecker product \otimes emerges in the equation is that θvπ(s)\nabla_{\theta}v_{\pi}(s) is a vector. The above equation is a linear equation of θvπ\nabla_{\theta}v_{\pi} , which can be solved as

θvπ=(InmγPπIm)1u=(InImγPπIm)1u=[(InγPπ)1Im]u.(9.17)\begin{array}{l} \nabla_ {\theta} v _ {\pi} = \left(I _ {n m} - \gamma P _ {\pi} \otimes I _ {m}\right) ^ {- 1} u \\ = \left(I _ {n} \otimes I _ {m} - \gamma P _ {\pi} \otimes I _ {m}\right) ^ {- 1} u \\ = \left[ \left(I _ {n} - \gamma P _ {\pi}\right) ^ {- 1} \otimes I _ {m} \right] u. \tag {9.17} \\ \end{array}

For any state ss , it follows from (9.17) that

θvπ(s)=sS[(InγPπ)1]ssu(s)=sS[(InγPπ)1]ssaAθπ(as,θ)qπ(s,a).(9.18)\begin{array}{l} \nabla_ {\theta} v _ {\pi} (s) = \sum_ {s ^ {\prime} \in \mathcal {S}} \left[ \left(I _ {n} - \gamma P _ {\pi}\right) ^ {- 1} \right] _ {s s ^ {\prime}} u (s ^ {\prime}) \\ = \sum_ {s ^ {\prime} \in \mathcal {S}} \left[ \left(I _ {n} - \gamma P _ {\pi}\right) ^ {- 1} \right] _ {s s ^ {\prime}} \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s ^ {\prime}, \theta) q _ {\pi} \left(s ^ {\prime}, a\right). \tag {9.18} \\ \end{array}

The quantity [(InγPπ)1]ss[(I_n - \gamma P_\pi)^{-1}]_{ss'} has a clear probabilistic interpretation. In particular, since (InγPπ)1=I+γPπ+γ2Pπ2+(I_n - \gamma P_\pi)^{-1} = I + \gamma P_\pi + \gamma^2 P_\pi^2 + \dots , we have

[(InγPπ)1]ss=[I]ss+γ[Pπ]ss+γ2[Pπ2]ss+=k=0γk[Pπk]ss.\left[ \left(I _ {n} - \gamma P _ {\pi}\right) ^ {- 1} \right] _ {s s ^ {\prime}} = \left[ I \right] _ {s s ^ {\prime}} + \gamma [ P _ {\pi} ] _ {s s ^ {\prime}} + \gamma^ {2} [ P _ {\pi} ^ {2} ] _ {s s ^ {\prime}} + \dots = \sum_ {k = 0} ^ {\infty} \gamma^ {k} [ P _ {\pi} ^ {k} ] _ {s s ^ {\prime}}.

Note that [Pπk]ss[P_{\pi}^{k}]_{ss'} is the probability of transitioning from ss to ss' using exactly kk steps (see Box 8.1). Therefore, [(InγPπ)1]ss[(I_n - \gamma P_{\pi})^{-1}]_{ss'} is the discounted total probability of transitioning from ss to ss' using any number of steps. By denoting [(InγPπ)1]ssPrπ(ss)[(I_n - \gamma P_{\pi})^{-1}]_{ss'} \doteq \operatorname{Pr}_{\pi}(s'|s) , equation (9.18) becomes (9.14).

With the results in Lemma 9.2, we are ready to derive the gradient of vˉπ0\bar{v}_{\pi}^{0} .

Theorem 9.2 (Gradient of vˉπ0\bar{v}_{\pi}^{0} in the discounted case). In the discounted case where γ(0,1)\gamma \in (0,1) , the gradient of vˉπ0=d0Tvπ\bar{v}_{\pi}^{0} = d_{0}^{T}v_{\pi} is

θvˉπ0=E[θlnπ(AS,θ)qπ(S,A)],\nabla_ {\theta} \bar {v} _ {\pi} ^ {0} = \mathbb {E} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta) q _ {\pi} (S, A) \right],

where SρπS \sim \rho_{\pi} and Aπ(S,θ)A \sim \pi(S, \theta) . Here, the state distribution ρπ\rho_{\pi} is

ρπ(s)=sSd0(s)Prπ(ss),sS,(9.19)\rho_ {\pi} (s) = \sum_ {s ^ {\prime} \in \mathcal {S}} d _ {0} \left(s ^ {\prime}\right) \Pr_ {\pi} \left(s \mid s ^ {\prime}\right), \quad s \in \mathcal {S}, \tag {9.19}

where Prπ(ss)=k=0γk[Pπk]ss=[(IγPπ)1]ss\operatorname*{Pr}_{\pi}(s|s') = \sum_{k=0}^{\infty} \gamma^{k}[P_{\pi}^{k}]_{s's} = [(I - \gamma P_{\pi})^{-1}]_{s's} is the discounted total probability of

transitioning from ss' to ss under policy π\pi .

Box 9.3: Proof of Theorem 9.2

Since d0(s)d_0(s) is independent of π\pi , we have

θvˉπ0=θsSd0(s)vπ(s)=sSd0(s)θvπ(s).\nabla_ {\theta} \bar {v} _ {\pi} ^ {0} = \nabla_ {\theta} \sum_ {s \in \mathcal {S}} d _ {0} (s) v _ {\pi} (s) = \sum_ {s \in \mathcal {S}} d _ {0} (s) \nabla_ {\theta} v _ {\pi} (s).

Substituting the expression of θvπ(s)\nabla_{\theta}v_{\pi}(s) given in Lemma 9.2 into the above equation yields

θvˉπ0=sSd0(s)θvπ(s)=sSd0(s)sSPrπ(ss)aAθπ(as,θ)qπ(s,a)=sS(sSd0(s)Prπ(ss))aAθπ(as,θ)qπ(s,a)=˙sSρπ(s)aAθπ(as,θ)qπ(s,a)=sSρπ(s)aAθπ(as,θ)qπ(s,a)(c h a n g est os)=sSρπ(s)aAπ(as,θ)θlnπ(as,θ)qπ(s,a)=E[θlnπ(AS,θ)qπ(S,A)],\begin{array}{l} \nabla_ {\theta} \bar {v} _ {\pi} ^ {0} = \sum_ {s \in \mathcal {S}} d _ {0} (s) \nabla_ {\theta} v _ {\pi} (s) = \sum_ {s \in \mathcal {S}} d _ {0} (s) \sum_ {s ^ {\prime} \in \mathcal {S}} \operatorname * {P r} _ {\pi} (s ^ {\prime} | s) \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s ^ {\prime}, \theta) q _ {\pi} (s ^ {\prime}, a) \\ = \sum_ {s ^ {\prime} \in \mathcal {S}} \left(\sum_ {s \in \mathcal {S}} d _ {0} (s) \Pr_ {\pi} \left(s ^ {\prime} \mid s\right)\right) \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a \mid s ^ {\prime}, \theta) q _ {\pi} \left(s ^ {\prime}, a\right) \\ \dot {=} \sum_ {s ^ {\prime} \in \mathcal {S}} \rho_ {\pi} (s ^ {\prime}) \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s ^ {\prime}, \theta) q _ {\pi} (s ^ {\prime}, a) \\ = \sum_ {s \in \mathcal {S}} \rho_ {\pi} (s) \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s, \theta) q _ {\pi} (s, a) \qquad (\text {c h a n g e} s ^ {\prime} \text {t o} s) \\ = \sum_ {s \in \mathcal {S}} \rho_ {\pi} (s) \sum_ {a \in \mathcal {A}} \pi (a | s, \theta) \nabla_ {\theta} \ln \pi (a | s, \theta) q _ {\pi} (s, a) \\ = \mathbb {E} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta) q _ {\pi} (S, A) \right], \\ \end{array}

where SρπS \sim \rho_{\pi} and Aπ(S,θ)A \sim \pi(S, \theta) . The proof is complete.

With Lemma 9.1 and Lemma 9.2, we can derive the gradients of rˉπ\bar{r}_{\pi} and vˉπ\bar{v}_{\pi} .

Theorem 9.3 (Gradients of rˉπ\bar{r}_{\pi} and vˉπ\bar{v}_{\pi} in the discounted case). In the discounted case where γ(0,1)\gamma \in (0,1) , the gradients of rˉπ\bar{r}_{\pi} and vˉπ\bar{v}_{\pi} are

θrˉπ=(1γ)θvˉπsSdπ(s)aAθπ(as,θ)qπ(s,a)=E[θlnπ(AS,θ)qπ(S,A)],\begin{array}{l} \nabla_ {\theta} \bar {r} _ {\pi} = (1 - \gamma) \nabla_ {\theta} \bar {v} _ {\pi} \approx \sum_ {s \in \mathcal {S}} d _ {\pi} (s) \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s, \theta) q _ {\pi} (s, a) \\ = \mathbb {E} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta) q _ {\pi} (S, A) \right], \\ \end{array}

where SdπS \sim d_{\pi} and Aπ(S,θ)A \sim \pi(S, \theta) . Here, the approximation is more accurate when γ\gamma is closer to 1.

Box 9.4: Proof of Theorem 9.3

It follows from the definition of vˉπ\bar{v}_{\pi} that

θvˉπ=θsSdπ(s)vπ(s)=sSθdπ(s)vπ(s)+sSdπ(s)θvπ(s).(9.20)\begin{array}{l} \nabla_ {\theta} \bar {v} _ {\pi} = \nabla_ {\theta} \sum_ {s \in \mathcal {S}} d _ {\pi} (s) v _ {\pi} (s) \\ = \sum_ {s \in S} \nabla_ {\theta} d _ {\pi} (s) v _ {\pi} (s) + \sum_ {s \in S} d _ {\pi} (s) \nabla_ {\theta} v _ {\pi} (s). \tag {9.20} \\ \end{array}

This equation contains two terms. On the one hand, substituting the expression of θvπ\nabla_{\theta}v_{\pi} given in (9.17) into the second term gives

sSdπ(s)θvπ(s)=(dπTIm)θvπ=(dπTIm)[(InγPπ)1Im]u=[dπT(InγPπ)1]Imu.(9.21)\begin{array}{l} \sum_ {s \in \mathcal {S}} d _ {\pi} (s) \nabla_ {\theta} v _ {\pi} (s) = (d _ {\pi} ^ {T} \otimes I _ {m}) \nabla_ {\theta} v _ {\pi} \\ = \left(d _ {\pi} ^ {T} \otimes I _ {m}\right) \left[ \left(I _ {n} - \gamma P _ {\pi}\right) ^ {- 1} \otimes I _ {m} \right] u \\ = \left[ d _ {\pi} ^ {T} \left(I _ {n} - \gamma P _ {\pi}\right) ^ {- 1} \right] \otimes I _ {m} u. \tag {9.21} \\ \end{array}

It is noted that

dπT(InγPπ)1=11γdπT,d _ {\pi} ^ {T} (I _ {n} - \gamma P _ {\pi}) ^ {- 1} = \frac {1}{1 - \gamma} d _ {\pi} ^ {T},

which can be easily verified by multiplying (InγPπ)(I_n - \gamma P_{\pi}) on both sides of the equation. Therefore, (9.21) becomes

sSdπ(s)θvπ(s)=11γdπTImu=11γsSdπ(s)aAθπ(as,θ)qπ(s,a).\begin{array}{l} \sum_ {s \in \mathcal {S}} d _ {\pi} (s) \nabla_ {\theta} v _ {\pi} (s) = \frac {1}{1 - \gamma} d _ {\pi} ^ {T} \otimes I _ {m} u \\ = \frac {1}{1 - \gamma} \sum_ {s \in \mathcal {S}} d _ {\pi} (s) \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s, \theta) q _ {\pi} (s, a). \\ \end{array}

On the other hand, the first term of (9.20) involves θdπ\nabla_{\theta}d_{\pi} . However, since the second term contains 11γ\frac{1}{1 - \gamma} , the second term becomes dominant, and the first term becomes negligible when γ1\gamma \to 1 . Therefore,

θvˉπ11γsSdπ(s)aAθπ(as,θ)qπ(s,a).\nabla_ {\theta} \bar {v} _ {\pi} \approx \frac {1}{1 - \gamma} \sum_ {s \in \mathcal {S}} d _ {\pi} (s) \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s, \theta) q _ {\pi} (s, a).

Furthermore, it follows from rˉπ=(1γ)vˉπ\bar{r}_{\pi} = (1 - \gamma)\bar{v}_{\pi} that

θrˉπ=(1γ)θvˉπsSdπ(s)aAθπ(as,θ)qπ(s,a)=sSdπ(s)aAπ(as,θ)θlnπ(as,θ)qπ(s,a)=E[θlnπ(AS,θ)qπ(S,A)].\begin{array}{l} \nabla_ {\theta} \bar {r} _ {\pi} = (1 - \gamma) \nabla_ {\theta} \bar {v} _ {\pi} \approx \sum_ {s \in \mathcal {S}} d _ {\pi} (s) \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s, \theta) q _ {\pi} (s, a) \\ = \sum_ {s \in \mathcal {S}} d _ {\pi} (s) \sum_ {a \in \mathcal {A}} \pi (a | s, \theta) \nabla_ {\theta} \ln \pi (a | s, \theta) q _ {\pi} (s, a) \\ = \mathbb {E} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta) q _ {\pi} (S, A) \right]. \\ \end{array}

The approximation in the above equation requires that the first term does not go to infinity when γ1\gamma \to 1 . More information can be found in [66, Section 4].

9.3.2 Derivation of the gradients in the undiscounted case

We next show how to calculate the gradients of the metrics in the undiscounted case where γ=1\gamma = 1 . Readers may wonder why we suddenly start considering the undiscounted case while we have only considered the discounted case so far in this book. In fact, the definition of the average reward rˉπ\bar{r}_{\pi} is valid for both discounted and undiscounted cases. While the gradient of rˉπ\bar{r}_{\pi} in the discounted case is an approximation, we will see that its gradient in the undiscounted case is more elegant.

State values and the Poisson equation

In the undiscounted case, it is necessary to redefine state and action values. Since the undiscounted sum of the rewards, E[Rt+1+Rt+2+Rt+3+St=s]\mathbb{E}[R_{t + 1} + R_{t + 2} + R_{t + 3} + \ldots |S_t = s] , may diverge, the state and action values are defined in a special way [64]:

vπ(s)E[(Rt+1rˉπ)+(Rt+2rˉπ)+(Rt+3rˉπ)+St=s],v _ {\pi} (s) \doteq \mathbb {E} \left[ \left(R _ {t + 1} - \bar {r} _ {\pi}\right) + \left(R _ {t + 2} - \bar {r} _ {\pi}\right) + \left(R _ {t + 3} - \bar {r} _ {\pi}\right) + \dots \mid S _ {t} = s \right],
qπ(s,a)E[(Rt+1rˉπ)+(Rt+2rˉπ)+(Rt+3rˉπ)+St=s,At=a],q _ {\pi} (s, a) \doteq \mathbb {E} \big [ (R _ {t + 1} - \bar {r} _ {\pi}) + (R _ {t + 2} - \bar {r} _ {\pi}) + (R _ {t + 3} - \bar {r} _ {\pi}) + \ldots | S _ {t} = s, A _ {t} = a \big ],

where rˉπ\bar{r}_{\pi} is the average reward, which is determined when π\pi is given. There are different names for vπ(s)v_{\pi}(s) in the literature such as the differential reward [65] or bias [2, Section 8.2.1]. It can be verified that the state value defined above satisfies the following Bellman-like equation:

vπ(s)=aπ(as,θ)[rp(rs,a)(rrˉπ)+sp(ss,a)vπ(s)].(9.22)v _ {\pi} (s) = \sum_ {a} \pi (a | s, \theta) \left[ \sum_ {r} p (r | s, a) (r - \bar {r} _ {\pi}) + \sum_ {s ^ {\prime}} p (s ^ {\prime} | s, a) v _ {\pi} (s ^ {\prime}) \right]. \tag {9.22}

Since vπ(s)=aAπ(as,θ)qπ(s,a)v_{\pi}(s) = \sum_{a \in \mathcal{A}} \pi(a|s, \theta) q_{\pi}(s, a) , it holds that qπ(s,a)=rp(rs,a)(rrˉπ)+sp(ss,a)vπ(s)q_{\pi}(s, a) = \sum_{r} p(r|s, a)(r - \bar{r}_{\pi}) + \sum_{s'} p(s'|s, a) v_{\pi}(s') . The matrix-vector form of (9.22) is

vπ=rπrˉπ1n+Pπvπ,(9.23)v _ {\pi} = r _ {\pi} - \bar {r} _ {\pi} \mathbf {1} _ {n} + P _ {\pi} v _ {\pi}, \tag {9.23}

where 1n=[1,,1]TRn\mathbf{1}_n = [1,\dots ,1]^T\in \mathbb{R}^n . Equation (9.23) is similar to the Bellman equation and it has a specific name called the Poisson equation [65, 67].

How to solve vπv_{\pi} from the Poisson equation? The answer is given in the following theorem.

Theorem 9.4 (Solution of the Poisson equation). Let

vπ=(InPπ+1ndπT)1rπ.(9.24)v _ {\pi} ^ {*} = \left(I _ {n} - P _ {\pi} + \mathbf {1} _ {n} d _ {\pi} ^ {T}\right) ^ {- 1} r _ {\pi}. \tag {9.24}

Then, vπv_{\pi}^{*} is a solution of the Poisson equation in (9.23). Moreover, any solution of the Poisson equation has the following form:

vπ=vπ+c1n,v _ {\pi} = v _ {\pi} ^ {*} + c \mathbf {1} _ {n},

where cRc\in \mathbb{R}

This theorem indicates that the solution of the Poisson equation may not be unique.

Box 9.5: Proof of Theorem 9.4

We prove using three steps.

\diamond Step 1: Show that vπv_{\pi}^{*} in (9.24) is a solution of the Poisson equation.

For the sake of simplicity, let

AInPπ+1ndπT.A \doteq I _ {n} - P _ {\pi} + \mathbf {1} _ {n} d _ {\pi} ^ {T}.

Then, vπ=A1rπv_{\pi}^{*} = A^{-1}r_{\pi} . The fact that AA is invertible will be proven in Step 3. Substituting vπ=A1rπv_{\pi}^{*} = A^{-1}r_{\pi} into (9.23) gives

A1rπ=rπ1ndπTrπ+PπA1rπ.A ^ {- 1} \boldsymbol {r} _ {\pi} = \boldsymbol {r} _ {\pi} - \mathbf {1} _ {n} d _ {\pi} ^ {T} \boldsymbol {r} _ {\pi} + P _ {\pi} A ^ {- 1} \boldsymbol {r} _ {\pi}.

This equation is valid as proven below. Recognizing this equation gives (A1+In1ndπT+PπA1)rπ=0(-A^{-1} + I_n - \mathbf{1}_n d_\pi^T + P_\pi A^{-1})r_\pi = 0 , and consequently,

(In+A1ndπTA+Pπ)A1rπ=0.\left(- I _ {n} + A - \mathbf {1} _ {n} d _ {\pi} ^ {T} A + P _ {\pi}\right) A ^ {- 1} r _ {\pi} = 0.

The term in the brackets in the above equation is zero because In+A1ndπTA+Pπ=In+(InPπ+1ndπT)1ndπT(InPπ+1ndπT)+Pπ=0-I_{n} + A - \mathbf{1}_{n}d_{\pi}^{T}A + P_{\pi} = -I_{n} + (I_{n} - P_{\pi} + \mathbf{1}_{n}d_{\pi}^{T}) - \mathbf{1}_{n}d_{\pi}^{T}(I_{n} - P_{\pi} + \mathbf{1}_{n}d_{\pi}^{T}) + P_{\pi} = 0 . Therefore, vπv_{\pi}^{*} in (9.24) is a solution.

\diamond Step 2: General expression of the solutions.

Substituting rˉπ=dπTrπ\bar{r}_{\pi} = d_{\pi}^{T}r_{\pi} into (9.23) gives

vπ=rπ1ndπTrπ+Pπvπ(9.25)v _ {\pi} = r _ {\pi} - \mathbf {1} _ {n} d _ {\pi} ^ {T} r _ {\pi} + P _ {\pi} v _ {\pi} \tag {9.25}

and consequently

(InPπ)vπ=(In1ndπT)rπ.(9.26)\left(I _ {n} - P _ {\pi}\right) v _ {\pi} = \left(I _ {n} - \mathbf {1} _ {n} d _ {\pi} ^ {T}\right) r _ {\pi}. \tag {9.26}

It is noted that InPπI_{n} - P_{\pi} is singular because (InPπ)1n=0(I_{n} - P_{\pi})\mathbf{1}_{n} = 0 for any π\pi . Therefore, the solution of (9.26) is not unique: if vπv_{\pi}^{*} is a solution, then vπ+xv_{\pi}^{*} + x is also a solution for any xNull(InPπ)x \in \mathrm{Null}(I_n - P_\pi) . When PπP_{\pi} is irreducible, Null(InPπ)=span{1n}\mathrm{Null}(I_n - P_{\pi}) = \mathrm{span}\{\mathbf{1}_n\} . Then, any solution of the Poisson equation has the expression vπ+c1nv_{\pi}^{*} + c\mathbf{1}_{n} where cRc \in \mathbb{R} .

\diamond Step 3: Show that A=InPπ+1ndπTA = I_{n} - P_{\pi} + \mathbf{1}_{n}d_{\pi}^{T} is invertible.

Since vπv_{\pi}^{*} involves A1A^{-1} , it is necessary to show that AA is invertible. The analysis is summarized in the following lemma.

Lemma 9.3. The matrix InPπ+1ndπTI_{n} - P_{\pi} + \mathbf{1}_{n}d_{\pi}^{T} is invertible and its inverse is

[In(Pπ1ndπT)]1=k=1(Pπk1ndπT)+In.\left[ I _ {n} - \left(P _ {\pi} - \mathbf {1} _ {n} d _ {\pi} ^ {T}\right) \right] ^ {- 1} = \sum_ {k = 1} ^ {\infty} \left(P _ {\pi} ^ {k} - \mathbf {1} _ {n} d _ {\pi} ^ {T}\right) + I _ {n}.

Proof. First of all, we state some preliminary facts without proof. Let ρ(M)\rho(M) be the spectral radius of a matrix MM . Then, IMI - M is invertible if ρ(M)<1\rho(M) < 1 . Moreover, ρ(M)<1\rho(M) < 1 if and only if limkMk=0\lim_{k \to \infty} M^k = 0 .

Based on the above facts, we next show that limk(Pπ1ndπT)k0\lim_{k\to \infty}(P_{\pi} - \mathbf{1}_{n}d_{\pi}^{T})^{k}\to 0 , and then the invertibility of In(Pπ1ndπT)I_{n} - (P_{\pi} - \mathbf{1}_{n}d_{\pi}^{T}) immediately follows. To do that, we notice that

(Pπ1ndπT)k=Pπk1ndπT,k1,(9.27)\left(P _ {\pi} - \mathbf {1} _ {n} d _ {\pi} ^ {T}\right) ^ {k} = P _ {\pi} ^ {k} - \mathbf {1} _ {n} d _ {\pi} ^ {T}, \quad k \geq 1, \tag {9.27}

which can be proven by induction. For instance, when k=1k = 1 , the equation is valid. When k=2k = 2 , we have

(Pπ1ndπT)2=(Pπ1ndπT)(Pπ1ndπT)=Pπ2Pπ1ndπT1ndπTPπ+1ndπT1ndπTΣ=Pπ21ndπT,\begin{array}{l} \left(P _ {\pi} - \mathbf {1} _ {n} d _ {\pi} ^ {T}\right) ^ {2} = \left(P _ {\pi} - \mathbf {1} _ {n} d _ {\pi} ^ {T}\right) \left(P _ {\pi} - \mathbf {1} _ {n} d _ {\pi} ^ {T}\right) \\ = P _ {\pi} ^ {2} - P _ {\pi} \mathbf {1} _ {n} d _ {\pi} ^ {T} - \mathbf {1} _ {n} d _ {\pi} ^ {T} P _ {\pi} + \mathbf {1} _ {n} d _ {\pi} ^ {T} \mathbf {1} _ {n} d _ {\pi} ^ {T} \\ \mathbf {\Sigma} = P _ {\pi} ^ {2} - \mathbf {1} _ {n} d _ {\pi} ^ {T}, \\ \end{array}

where the last equality is due to Pπ1n=1nP_{\pi}\mathbf{1}_n = \mathbf{1}_n , dπTPπ=dπTd_{\pi}^{T}P_{\pi} = d_{\pi}^{T} , and dπT1n=1d_{\pi}^{T}\mathbf{1}_{n} = 1 . The case of k3k \geq 3 can be proven similarly.

Since dπd_{\pi} is the stationary distribution of the state, it holds that limkPπk=1ndπT\lim_{k \to \infty} P_{\pi}^{k} = \mathbf{1}_{n} d_{\pi}^{T} (see Box 8.1). Therefore, (9.27) implies that

limk(Pπ1ndπT)k=limkPπk1ndπT=0.\lim _ {k \to \infty} (P _ {\pi} - \mathbf {1} _ {n} d _ {\pi} ^ {T}) ^ {k} = \lim _ {k \to \infty} P _ {\pi} ^ {k} - \mathbf {1} _ {n} d _ {\pi} ^ {T} = 0.

As a result, ρ(Pπ1ndπT)<1\rho(P_{\pi} - \mathbf{1}_n d_{\pi}^T) < 1 and hence In(Pπ1ndπT)I_n - (P_{\pi} - \mathbf{1}_n d_{\pi}^T) is invertible. Furthermore,

the inverse of this matrix is given by

(In(Pπ1ndπT))1=k=0(Pπ1ndπT)k=In+k=1(Pπ1ndπT)k=In+k=1(Pπk1ndπT)=k=0(Pπk1ndπT)+1ndπT.\begin{array}{l} \left(I _ {n} - \left(P _ {\pi} - \mathbf {1} _ {n} d _ {\pi} ^ {T}\right)\right) ^ {- 1} = \sum_ {k = 0} ^ {\infty} \left(P _ {\pi} - \mathbf {1} _ {n} d _ {\pi} ^ {T}\right) ^ {k} \\ = I _ {n} + \sum_ {k = 1} ^ {\infty} \left(P _ {\pi} - \mathbf {1} _ {n} d _ {\pi} ^ {T}\right) ^ {k} \\ = I _ {n} + \sum_ {k = 1} ^ {\infty} (P _ {\pi} ^ {k} - \mathbf {1} _ {n} d _ {\pi} ^ {T}) \\ = \sum_ {k = 0} ^ {\infty} \left(P _ {\pi} ^ {k} - \mathbf {1} _ {n} d _ {\pi} ^ {T}\right) + \mathbf {1} _ {n} d _ {\pi} ^ {T}. \\ \end{array}

The proof is complete.

The proof of Lemma 9.3 is inspired by [66]. However, the result (InPπ+1ndπT)1=k=0(Pπk1ndπT)(I_n - P_{\pi} + \mathbf{1}_nd_\pi^T)^{-1} = \sum_{k=0}^\infty (P_\pi^k - \mathbf{1}_nd_\pi^T) given in [66] (the statement above equation (16) in [66]) is inaccurate because k=0(Pπk1ndπT)\sum_{k=0}^\infty (P_\pi^k - \mathbf{1}_nd_\pi^T) is singular since k=0(Pπk1ndπT)1n=0\sum_{k=0}^\infty (P_\pi^k - \mathbf{1}_nd_\pi^T)\mathbf{1}_n = 0 . Lemma 9.3 corrects this inaccuracy.

Derivation of gradients

Although the value of vπv_{\pi} is not unique in the undiscounted case, as shown in Theorem 9.4, the value of rˉπ\bar{r}_{\pi} is unique. In particular, it follows from the Poisson equation that

rˉπ1n=rπ+(PπIn)vπ=rπ+(PπIn)(vπ+c1n)=rπ+(PπIn)vπ.\begin{array}{l} \bar {r} _ {\pi} \mathbf {1} _ {n} = r _ {\pi} + (P _ {\pi} - I _ {n}) v _ {\pi} \\ = r _ {\pi} + \left(P _ {\pi} - I _ {n}\right) \left(v _ {\pi} ^ {*} + c \mathbf {1} _ {n}\right) \\ = r _ {\pi} + \left(P _ {\pi} - I _ {n}\right) v _ {\pi} ^ {*}. \\ \end{array}

Notably, the undetermined value cc is canceled and hence rˉπ\bar{r}_{\pi} is unique. Therefore, we can calculate the gradient of rˉπ\bar{r}_{\pi} in the undiscounted case. In addition, since vπv_{\pi} is not unique, vˉπ\bar{v}_{\pi} is not unique either. We do not study the gradient of vˉπ\bar{v}_{\pi} in the undiscounted case. For interested readers, it is worth mentioning that we can add more constraints to uniquely solve vπv_{\pi} from the Poisson equation. For example, by assuming that a recurrent state exists, the state value of this recurrent state can be determined [65, Section II], and hence cc can be determined. There are also other ways to uniquely determine vπv_{\pi} . See, for example, equations (8.6.5)-(8.6.7) in [2].

The gradient of rˉπ\bar{r}_{\pi} in the undiscounted case is given below.

Theorem 9.5 (Gradient of rˉπ\bar{r}_{\pi} in the undiscounted case). In the undiscounted case, the

gradient of the average reward rˉπ\bar{r}_{\pi} is

θrˉπ=sSdπ(s)aAθπ(as,θ)qπ(s,a)=E[θlnπ(AS,θ)qπ(S,A)],(9.28)\begin{array}{l} \nabla_ {\theta} \bar {r} _ {\pi} = \sum_ {s \in \mathcal {S}} d _ {\pi} (s) \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s, \theta) q _ {\pi} (s, a) \\ = \mathbb {E} \left[ \nabla_ {\theta} \ln \pi (A | S, \theta) q _ {\pi} (S, A) \right], \tag {9.28} \\ \end{array}

where SdπS\sim d_{\pi} and Aπ(S,θ)A\sim \pi (S,\theta)

Compared to the discounted case shown in Theorem 9.3, the gradient of rˉπ\bar{r}_{\pi} in the undiscounted case is more elegant in the sense that (9.28) is strictly valid and SS obeys the stationary distribution.

Box 9.6: Proof of Theorem 9.5

First of all, it follows from vπ(s)=aAπ(as,θ)qπ(s,a)v_{\pi}(s) = \sum_{a \in \mathcal{A}} \pi(a|s, \theta) q_{\pi}(s, a) that

θvπ(s)=θ[aAπ(as,θ)qπ(s,a)]=aA[θπ(as,θ)qπ(s,a)+π(as,θ)θqπ(s,a)],(9.29)\begin{array}{l} \nabla_ {\theta} v _ {\pi} (s) = \nabla_ {\theta} \left[ \sum_ {a \in \mathcal {A}} \pi (a | s, \theta) q _ {\pi} (s, a) \right] \\ = \sum_ {a \in \mathcal {A}} \left[ \nabla_ {\theta} \pi (a | s, \theta) q _ {\pi} (s, a) + \pi (a | s, \theta) \nabla_ {\theta} q _ {\pi} (s, a) \right], \tag {9.29} \\ \end{array}

where qπ(s,a)q_{\pi}(s,a) is the action value satisfying

qπ(s,a)=rp(rs,a)(rrˉπ)+sp(ss,a)vπ(s)=r(s,a)rˉπ+sp(ss,a)vπ(s).\begin{array}{l} q _ {\pi} (s, a) = \sum_ {r} p (r | s, a) (r - \bar {r} _ {\pi}) + \sum_ {s ^ {\prime}} p (s ^ {\prime} | s, a) v _ {\pi} (s ^ {\prime}) \\ = r (s, a) - \bar {r} _ {\pi} + \sum_ {s ^ {\prime}} p \left(s ^ {\prime} \mid s, a\right) v _ {\pi} \left(s ^ {\prime}\right). \\ \end{array}

Since r(s,a)=rrp(rs,a)r(s,a) = \sum_{r}rp(r|s,a) is independent of θ\theta , we have

θqπ(s,a)=0θrˉπ+sSp(ss,a)θvπ(s).\nabla_ {\theta} q _ {\pi} (s, a) = 0 - \nabla_ {\theta} \bar {r} _ {\pi} + \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s, a) \nabla_ {\theta} v _ {\pi} (s ^ {\prime}).

Substituting this result into (9.29) yields

θvπ(s)=aA[θπ(as,θ)qπ(s,a)+π(as,θ)(θrˉπ+sSp(ss,a)θvπ(s))]=aAθπ(as,θ)qπ(s,a)θrˉπ+aAπ(as,θ)sSp(ss,a)θvπ(s).(9.30)\begin{array}{l} \nabla_ {\theta} v _ {\pi} (s) = \sum_ {a \in \mathcal {A}} \left[ \nabla_ {\theta} \pi (a | s, \theta) q _ {\pi} (s, a) + \pi (a | s, \theta) \left(- \nabla_ {\theta} \bar {r} _ {\pi} + \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s, a) \nabla_ {\theta} v _ {\pi} (s ^ {\prime})\right) \right] \\ = \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s, \theta) q _ {\pi} (s, a) - \nabla_ {\theta} \bar {r} _ {\pi} + \sum_ {a \in \mathcal {A}} \pi (a | s, \theta) \sum_ {s ^ {\prime} \in \mathcal {S}} p \left(s ^ {\prime} \mid s, a\right) \nabla_ {\theta} v _ {\pi} \left(s ^ {\prime}\right). \tag {9.30} \\ \end{array}

Let

u(s)aAθπ(as,θ)qπ(s,a).u (s) \doteq \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s, \theta) q _ {\pi} (s, a).

Since aAπ(as,θ)sSp(ss,a)θvπ(s)=sSp(ss)θvπ(s)\sum_{a\in \mathcal{A}}\pi (a|s,\theta)\sum_{s'\in \mathcal{S}}p(s'|s,a)\nabla_{\theta}v_{\pi}(s') = \sum_{s'\in \mathcal{S}}p(s'|s)\nabla_{\theta}v_{\pi}(s') , equation (9.30) can be written in matrix-vector form as

[θvπ(s)]θvπRmn=[u(s)]uRmn1nθrˉπ+(PπIm)[θvπ(s)]θvπRmn,\underbrace {\left[ \begin{array}{c} \vdots \\ \nabla_ {\theta} v _ {\pi} (s) \\ \vdots \end{array} \right]} _ {\nabla_ {\theta} v _ {\pi} \in \mathbb {R} ^ {m n}} = \underbrace {\left[ \begin{array}{c} \vdots \\ u (s) \\ \vdots \end{array} \right]} _ {u \in \mathbb {R} ^ {m n}} - \mathbf {1} _ {n} \otimes \nabla_ {\theta} \bar {r} _ {\pi} + (P _ {\pi} \otimes I _ {m}) \underbrace {\left[ \begin{array}{c} \vdots \\ \nabla_ {\theta} v _ {\pi} (s ^ {\prime}) \\ \vdots \end{array} \right]} _ {\nabla_ {\theta} v _ {\pi} \in \mathbb {R} ^ {m n}},

where n=Sn = |\mathcal{S}| , mm is the dimension of θ\theta , and \otimes is the Kronecker product. The above equation can be written concisely as

θvπ=u1nθrˉπ+(PπIm)θvπ,\nabla_ {\theta} v _ {\pi} = u - \mathbf {1} _ {n} \otimes \nabla_ {\theta} \bar {r} _ {\pi} + (P _ {\pi} \otimes I _ {m}) \nabla_ {\theta} v _ {\pi},

and hence

1nθrˉπ=u+(PπIm)θvπθvπ.\mathbf {1} _ {n} \otimes \nabla_ {\theta} \bar {r} _ {\pi} = u + (P _ {\pi} \otimes I _ {m}) \nabla_ {\theta} v _ {\pi} - \nabla_ {\theta} v _ {\pi}.

Multiplying dπTImd_{\pi}^{T} \otimes I_{m} on both sides of the above equation gives

(dπT1n)θrˉπ=dπTImu+(dπTPπ)ImθvπdπTImθvπ=dπTImu,\begin{array}{l} \left(d _ {\pi} ^ {T} \mathbf {1} _ {n}\right) \otimes \nabla_ {\theta} \bar {r} _ {\pi} = d _ {\pi} ^ {T} \otimes I _ {m} u + \left(d _ {\pi} ^ {T} P _ {\pi}\right) \otimes I _ {m} \nabla_ {\theta} v _ {\pi} - d _ {\pi} ^ {T} \otimes I _ {m} \nabla_ {\theta} v _ {\pi} \\ = d _ {\pi} ^ {T} \otimes I _ {m} u, \\ \end{array}

which implies

θrˉπ=dπTImu=sSdπ(s)u(s)=sSdπ(s)aAθπ(as,θ)qπ(s,a).\begin{array}{l} \nabla_ {\theta} \bar {r} _ {\pi} = d _ {\pi} ^ {T} \otimes I _ {m} u \\ = \sum_ {s \in \mathcal {S}} d _ {\pi} (s) u (s) \\ = \sum_ {s \in \mathcal {S}} d _ {\pi} (s) \sum_ {a \in \mathcal {A}} \nabla_ {\theta} \pi (a | s, \theta) q _ {\pi} (s, a). \\ \end{array}
9.3_Gradients_of_the_metrics - 强化学习的数学基础 | OpenTech