Given the metrics introduced in the last section, we can use gradient-based methods to maximize them. To do that, we need to first calculate the gradients of these metrics. The most important theoretical result in this chapter is the following theorem.
Theorem 9.1 (Policy gradient theorem). The gradient of J(θ) is
∇θJ(θ)=s∈S∑η(s)a∈A∑∇θπ(a∣s,θ)qπ(s,a),(9.8)
where η is a state distribution and ∇θπ is the gradient of π with respect to θ . Moreover, (9.8) has a compact form expressed in terms of expectation:
Some important remarks about Theorem 9.1 are given below.
⋄ It should be noted that Theorem 9.1 is a summary of the results in Theorem 9.2, Theorem 9.3, and Theorem 9.5. These three theorems address different scenarios involving different metrics and discounted/undiscounted cases. The gradients in these scenarios all have similar expressions and hence are summarized in Theorem 9.1. The specific expressions of J(θ) and η are not given in Theorem 9.1 and can be found in Theorem 9.2, Theorem 9.3, and Theorem 9.5. In particular, J(θ) could be vˉπ0 , vˉπ , or rˉπ . The equality in (9.8) may become a strict equality or an approximation. The distribution η also varies in different scenarios.
The derivation of the gradients is the most complicated part of the policy gradient method. For many readers, it is sufficient to be familiar with the result in Theorem 9.1 without knowing the proof. The derivation details presented in the rest of this section are mathematically intensive. Readers are suggested to study selectively based on their interests.
⋄ The expression in (9.9) is more favorable than (9.8) because it is expressed as an expectation. We will show in Section 9.4 that this true gradient can be approximated by a stochastic gradient.
Why can (9.8) be expressed as (9.9)? The proof is given below. By the definition of expectation, (9.8) can be rewritten as
It is notable that π(a∣s,θ) must be positive for all (s,a) to ensure that lnπ(a∣s,θ) is valid. This can be achieved by using softmax functions:
π(a∣s,θ)=∑a′∈Aeh(s,a′,θ)eh(s,a,θ),a∈A,(9.12)
where h(s,a,θ) is a function indicating the preference for selecting a at s . The policy in (9.12) satisfies π(a∣s,θ)∈(0,1) and ∑a∈Aπ(a∣s,θ)=1 for any s∈S . This policy can be realized by a neural network. The input of the network is s . The output layer is a softmax layer so that the network outputs π(a∣s,θ) for all a and the sum of the outputs is equal to 1. See Figure 9.2(b) for an illustration.
Since π(a∣s,θ)>0 for all a , the policy is stochastic and hence exploratory. The policy does not directly tell which action to take. Instead, the action should be generated according to the probability distribution of the policy.
9.3.1 Derivation of the gradients in the discounted case
We next derive the gradients of the metrics in the discounted case where γ∈(0,1) . The state value and action value in the discounted case are defined as
vπ(s)=E[Rt+1+γRt+2+γ2Rt+3+…∣St=s],
qπ(s,a)=E[Rt+1+γRt+2+γ2Rt+3+⋯∣St=s,At=a].
It holds that vπ(s)=∑a∈Aπ(a∣s,θ)qπ(s,a) and the state value satisfies the Bellman equation.
First, we show that vˉπ(θ) and rˉπ(θ) are equivalent metrics.
Lemma 9.1 (Equivalence between vˉπ(θ) and rˉπ(θ) ). In the discounted case where γ∈(0,1) , it holds that
rˉπ=(1−γ)vˉπ.(9.13)
Proof. Note that vˉπ(θ)=dπTvπ and rˉπ(θ)=dπTrπ , where vπ and rπ satisfy the Bellman equation vπ=rπ+γPπvπ . Multiplying dπT on both sides of the Bellman equation yields
vˉπ=rˉπ+γdπTPπvπ=rˉπ+γdπTvπ=rˉπ+γvˉπ,
which implies (9.13).
Second, the following lemma gives the gradient of vπ(s) for any s .
Lemma 9.2 (Gradient of vπ(s) ). In the discounted case, it holds for any s∈S that
is the discounted total probability of transitioning from s to s′ under policy π . Here, [⋅]ss′ denotes the entry in the s th row and s′ th column, and [Pπ]ss′ is the probability of transitioning from s to s′ using exactly k steps under π .
It is notable that ∇θvπ appears on both sides of the above equation. One way to calculate it is to use the unrolling technique [64]. Here, we use another way based on the matrix-vector form, which we believe is more straightforward to understand. In particular, let
Here, n=∣S∣ , and m is the dimension of the parameter vector θ . The reason that the Kronecker product ⊗ emerges in the equation is that ∇θvπ(s) is a vector. The above equation is a linear equation of ∇θvπ , which can be solved as
Note that [Pπk]ss′ is the probability of transitioning from s to s′ using exactly k steps (see Box 8.1). Therefore, [(In−γPπ)−1]ss′ is the discounted total probability of transitioning from s to s′ using any number of steps. By denoting [(In−γPπ)−1]ss′≐Prπ(s′∣s) , equation (9.18) becomes (9.14).
With the results in Lemma 9.2, we are ready to derive the gradient of vˉπ0 .
Theorem 9.2 (Gradient of vˉπ0 in the discounted case). In the discounted case where γ∈(0,1) , the gradient of vˉπ0=d0Tvπ is
∇θvˉπ0=E[∇θlnπ(A∣S,θ)qπ(S,A)],
where S∼ρπ and A∼π(S,θ) . Here, the state distribution ρπ is
ρπ(s)=s′∈S∑d0(s′)πPr(s∣s′),s∈S,(9.19)
where Prπ(s∣s′)=∑k=0∞γk[Pπk]s′s=[(I−γPπ)−1]s′s is the discounted total probability of
Substituting the expression of ∇θvπ(s) given in Lemma 9.2 into the above equation yields
∇θvˉπ0=∑s∈Sd0(s)∇θvπ(s)=∑s∈Sd0(s)∑s′∈SPrπ(s′∣s)∑a∈A∇θπ(a∣s′,θ)qπ(s′,a)=∑s′∈S(∑s∈Sd0(s)Prπ(s′∣s))∑a∈A∇θπ(a∣s′,θ)qπ(s′,a)=˙∑s′∈Sρπ(s′)∑a∈A∇θπ(a∣s′,θ)qπ(s′,a)=∑s∈Sρπ(s)∑a∈A∇θπ(a∣s,θ)qπ(s,a)(c h a n g es′t os)=∑s∈Sρπ(s)∑a∈Aπ(a∣s,θ)∇θlnπ(a∣s,θ)qπ(s,a)=E[∇θlnπ(A∣S,θ)qπ(S,A)],
where S∼ρπ and A∼π(S,θ) . The proof is complete.
With Lemma 9.1 and Lemma 9.2, we can derive the gradients of rˉπ and vˉπ .
Theorem 9.3 (Gradients of rˉπ and vˉπ in the discounted case). In the discounted case where γ∈(0,1) , the gradients of rˉπ and vˉπ are
On the other hand, the first term of (9.20) involves ∇θdπ . However, since the second term contains 1−γ1 , the second term becomes dominant, and the first term becomes negligible when γ→1 . Therefore,
The approximation in the above equation requires that the first term does not go to infinity when γ→1 . More information can be found in [66, Section 4].
9.3.2 Derivation of the gradients in the undiscounted case
We next show how to calculate the gradients of the metrics in the undiscounted case where γ=1 . Readers may wonder why we suddenly start considering the undiscounted case while we have only considered the discounted case so far in this book. In fact, the definition of the average reward rˉπ is valid for both discounted and undiscounted cases. While the gradient of rˉπ in the discounted case is an approximation, we will see that its gradient in the undiscounted case is more elegant.
State values and the Poisson equation
In the undiscounted case, it is necessary to redefine state and action values. Since the undiscounted sum of the rewards, E[Rt+1+Rt+2+Rt+3+…∣St=s] , may diverge, the state and action values are defined in a special way [64]:
where rˉπ is the average reward, which is determined when π is given. There are different names for vπ(s) in the literature such as the differential reward [65] or bias [2, Section 8.2.1]. It can be verified that the state value defined above satisfies the following Bellman-like equation:
Since vπ(s)=∑a∈Aπ(a∣s,θ)qπ(s,a) , it holds that qπ(s,a)=∑rp(r∣s,a)(r−rˉπ)+∑s′p(s′∣s,a)vπ(s′) . The matrix-vector form of (9.22) is
vπ=rπ−rˉπ1n+Pπvπ,(9.23)
where 1n=[1,…,1]T∈Rn . Equation (9.23) is similar to the Bellman equation and it has a specific name called the Poisson equation [65, 67].
How to solve vπ from the Poisson equation? The answer is given in the following theorem.
Theorem 9.4 (Solution of the Poisson equation). Let
vπ∗=(In−Pπ+1ndπT)−1rπ.(9.24)
Then, vπ∗ is a solution of the Poisson equation in (9.23). Moreover, any solution of the Poisson equation has the following form:
vπ=vπ∗+c1n,
where c∈R
This theorem indicates that the solution of the Poisson equation may not be unique.
Box 9.5: Proof of Theorem 9.4
We prove using three steps.
⋄ Step 1: Show that vπ∗ in (9.24) is a solution of the Poisson equation.
For the sake of simplicity, let
A≐In−Pπ+1ndπT.
Then, vπ∗=A−1rπ . The fact that A is invertible will be proven in Step 3. Substituting vπ∗=A−1rπ into (9.23) gives
A−1rπ=rπ−1ndπTrπ+PπA−1rπ.
This equation is valid as proven below. Recognizing this equation gives (−A−1+In−1ndπT+PπA−1)rπ=0 , and consequently,
(−In+A−1ndπTA+Pπ)A−1rπ=0.
The term in the brackets in the above equation is zero because −In+A−1ndπTA+Pπ=−In+(In−Pπ+1ndπT)−1ndπT(In−Pπ+1ndπT)+Pπ=0 . Therefore, vπ∗ in (9.24) is a solution.
⋄ Step 2: General expression of the solutions.
Substituting rˉπ=dπTrπ into (9.23) gives
vπ=rπ−1ndπTrπ+Pπvπ(9.25)
and consequently
(In−Pπ)vπ=(In−1ndπT)rπ.(9.26)
It is noted that In−Pπ is singular because (In−Pπ)1n=0 for any π . Therefore, the solution of (9.26) is not unique: if vπ∗ is a solution, then vπ∗+x is also a solution for any x∈Null(In−Pπ) . When Pπ is irreducible, Null(In−Pπ)=span{1n} . Then, any solution of the Poisson equation has the expression vπ∗+c1n where c∈R .
⋄ Step 3: Show that A=In−Pπ+1ndπT is invertible.
Since vπ∗ involves A−1 , it is necessary to show that A is invertible. The analysis is summarized in the following lemma.
Lemma 9.3. The matrix In−Pπ+1ndπT is invertible and its inverse is
[In−(Pπ−1ndπT)]−1=k=1∑∞(Pπk−1ndπT)+In.
Proof. First of all, we state some preliminary facts without proof. Let ρ(M) be the spectral radius of a matrix M . Then, I−M is invertible if ρ(M)<1 . Moreover, ρ(M)<1 if and only if limk→∞Mk=0 .
Based on the above facts, we next show that limk→∞(Pπ−1ndπT)k→0 , and then the invertibility of In−(Pπ−1ndπT) immediately follows. To do that, we notice that
(Pπ−1ndπT)k=Pπk−1ndπT,k≥1,(9.27)
which can be proven by induction. For instance, when k=1 , the equation is valid. When k=2 , we have
The proof of Lemma 9.3 is inspired by [66]. However, the result (In−Pπ+1ndπT)−1=∑k=0∞(Pπk−1ndπT) given in [66] (the statement above equation (16) in [66]) is inaccurate because ∑k=0∞(Pπk−1ndπT) is singular since ∑k=0∞(Pπk−1ndπT)1n=0 . Lemma 9.3 corrects this inaccuracy.
Derivation of gradients
Although the value of vπ is not unique in the undiscounted case, as shown in Theorem 9.4, the value of rˉπ is unique. In particular, it follows from the Poisson equation that
Notably, the undetermined value c is canceled and hence rˉπ is unique. Therefore, we can calculate the gradient of rˉπ in the undiscounted case. In addition, since vπ is not unique, vˉπ is not unique either. We do not study the gradient of vˉπ in the undiscounted case. For interested readers, it is worth mentioning that we can add more constraints to uniquely solve vπ from the Poisson equation. For example, by assuming that a recurrent state exists, the state value of this recurrent state can be determined [65, Section II], and hence c can be determined. There are also other ways to uniquely determine vπ . See, for example, equations (8.6.5)-(8.6.7) in [2].
The gradient of rˉπ in the undiscounted case is given below.
Theorem 9.5 (Gradient of rˉπ in the undiscounted case). In the undiscounted case, the
Compared to the discounted case shown in Theorem 9.3, the gradient of rˉπ in the undiscounted case is more elegant in the sense that (9.28) is strictly valid and S obeys the stationary distribution.
Box 9.6: Proof of Theorem 9.5
First of all, it follows from vπ(s)=∑a∈Aπ(a∣s,θ)qπ(s,a) that