Up to now, the policies used in the policy gradient methods are all stochastic since it is required that π(a∣s,θ)>0 for every (s,a) . This section shows that deterministic policies can also be used in policy gradient methods. Here, "deterministic" indicates that, for any state, a single action is given a probability of one and all the other actions are given probabilities of zero. It is important to study the deterministic case since it is naturally off-policy and can effectively handle continuous action spaces.
We have been using π(a∣s,θ) to denote a general policy, which can be either stochastic or deterministic. In this section, we use
a=μ(s,θ)
to specifically denote a deterministic policy. Different from π which gives the probability of an action, μ directly gives the action since it is a mapping from S to A . This deterministic policy can be represented by, for example, a neural network with s as its input, a as its output, and θ as its parameter. For the sake of simplicity, we often write μ(s,θ) as μ(s) for short.
10.4.1 The deterministic policy gradient theorem
The policy gradient theorem introduced in the last chapter is only valid for stochastic policies. When we require the policy to be deterministic, a new policy gradient theorem must be derived.
Theorem 10.2 (Deterministic policy gradient theorem). The gradient of J(θ) is
Theorem 10.2 is a summary of the results presented in Theorem 10.3 and Theorem 10.4 since the gradients in the two theorems have similar expressions. The specific expressions of J(θ) and η can be found in Theorems 10.3 and 10.4.
Unlike the stochastic case, the gradient in the deterministic case shown in (10.14) does not involve the action random variable A . As a result, when we use samples to approximate the true gradient, it is not required to sample actions. Therefore, the deterministic policy gradient method is off-policy. In addition, some readers may wonder why (∇aqμ(S,a))∣a=μ(S) cannot be written as ∇aqμ(S,μ(S)) , which seems more concise. That is simply because, if we do that, it is unclear how qμ(S,μ(S)) is a function of a . A concise yet less confusing expression may be ∇aqμ(S,a=μ(S)) .
In the rest of this subsection, we present the derivation details of Theorem 10.2. In particular, we derive the gradients of two common metrics: the first is the average value and the second is the average reward. Since these two metrics have been discussed in detail in Section 9.2, we sometimes use their properties without proof. For most readers, it is sufficient to be familiar with Theorem 10.2 without knowing its derivation details. Interested readers can selectively examine the details in the remainder of this section.
Metric 1: Average value
We first derive the gradient of the average value:
J(θ)=E[vμ(s)]=s∈S∑d0(s)vμ(s),(10.15)
where d0 is the probability distribution of the states. Here, d0 is selected to be independent of μ for simplicity. There are two special yet important cases of selecting d0 . The first case is that d0(s0)=1 and d0(s=s0)=0 , where s0 is a specific state of interest. In this case, the policy aims to maximize the discounted return that can be obtained when starting from s0 . The second case is that d0 is the distribution of a given behavior policy that is different from the target policy.
To calculate the gradient of J(θ) , we need to first calculate the gradient of vμ(s) for any s∈S . Consider the discounted case where γ∈(0,1) .
Lemma 10.1 (Gradient of vμ(s) ). In the discounted case, it holds for any s∈S that
is the discounted total probability of transitioning from s to s′ under policy μ . Here, [⋅]ss′ denotes the entry in the s th row and s′ th column of a matrix.
where n=∣S∣ , m is the dimensionality of θ , Pμ is the state transition matrix with [Pμ]ss′=p(s′∣s,μ(s)) , and ⊗ is the Kronecker product. The above matrix-vector form can be written concisely as
∇θvμ=u+γ(Pμ⊗Im)∇θvμ,
which is a linear equation of ∇θvμ . Then, ∇θvμ can be solved as
Note that [Pμk]ss′ is the probability of transitioning from s to s′ using exactly k steps (see Box 8.1 for more information). Therefore, [(I−γPμ)−1]ss′ is the discounted total probability of transitioning from s to s′ using any number of steps. By denoting [(I−γPμ)−1]ss′≐Prμ(s′∣s) , equation (10.19) leads to (10.16).
With the preparation in Lemma 10.1, we are ready to derive the gradient of J(θ) .
Theorem 10.3 (Deterministic policy gradient theorem in the discounted case). In the
discounted case where γ∈(0,1) , the gradient of J(θ) in (10.15) is
Here, Prμ(s∣s′)=∑k=0∞γk[Pμk]s′s=[(I−γPμ)−1]s′s is the discounted total probability of transitioning from s′ to s under policy μ .
Box 10.4: Proof of Theorem 10.3
Since d0 is independent of μ , we have
∇θJ(θ)=s∈S∑d0(s)∇θvμ(s).
Substituting the expression of ∇θvμ(s) given by Lemma 10.1 into the above equation yields
∇θJ(θ)=∑s∈Sd0(s)∇θvμ(s)=∑s∈Sd0(s)∑s′∈SPrμ(s′∣s)∇θμ(s′)(∇aqμ(s′,a))∣a=μ(s′)=∑s′∈S(∑s∈Sd0(s)Prμ(s′∣s))∇θμ(s′)(∇aqμ(s′,a))∣a=μ(s′)=˙∑s′∈Sρμ(s′)∇θμ(s′)(∇aqμ(s′,a))∣a=μ(s′)=∑s∈Sρμ(s)∇θμ(s)(∇aqμ(s,a))∣a=μ(s)(c h a n g es′t os)=ES∼ρμ[∇θμ(S)(∇aqμ(S,a))∣a=μ(S)].
The proof is complete. The above proof is consistent with the proof of Theorem 1 in [74]. Here, we consider the case in which the states and actions are finite. When they are continuous, the proof is similar, but the summations should be replaced by integrals [74].
Metric 2: Average reward
We next derive the gradient of the average reward:
where n=∣S∣ , m is the dimension of θ , Pμ is the state transition matrix with [Pμ]ss′=p(s′∣s,μ(s)) , and ⊗ is the Kronecker product. The above matrix-vector form can be written concisely as
∇θvμ=u−1n⊗∇θrˉμ+(Pμ⊗Im)∇θvμ,
and hence
1n⊗∇θrˉμ=u+(Pμ⊗Im)∇θvμ−∇θvμ.(10.22)
Since dμ is the stationary distribution, we have dμTPμ=dμT . Multiplying dμT⊗Im on both sides of (10.22) gives
Based on the gradient given in Theorem 10.2, we can apply the gradient-ascent algorithm to maximize J(θ) :
θt+1=θt+αθES∼η[∇θμ(S)(∇aqμ(S,a))∣a=μ(S)].
The corresponding stochastic gradient-ascent algorithm is
θt+1=θt+αθ∇θμ(st)(∇aqμ(st,a))∣a=μ(st).
The implementation is summarized in Algorithm 10.4. It should be noted that this algorithm is off-policy since the behavior policy β may be different from μ . First, the actor is off-policy. We already explained the reason when presenting Theorem 10.2. Second, the critic is also off-policy. Special attention must be paid to why the critic is off-policy but does not require the importance sampling technique. In particular, the experience sample required by the critic is (st,at,rt+1,st+1,a~t+1) , where a~t+1=μ(st+1) . The generation of this experience sample involves two policies. The first is the policy for generating at at st , and the second is the policy for generating a~t+1 at st+1 . The first policy that generates at is the behavior policy since at is used to interact with the environment. The second policy must be μ because it is the policy that the critic aims to evaluate. Hence, μ is the target policy. It should be noted that a~t+1 is not used to interact with the environment in the next time step. Hence, μ is not the behavior policy. Therefore, the critic is off-policy.
How to select the function q(s,a,w) ? The original research work [74] that proposed the deterministic policy gradient method adopted linear functions: q(s,a,w)=ϕT(s,a)w where ϕ(s,a) is the feature vector. It is currently popular to represent q(s,a,w) using neural networks, as suggested in the deep deterministic policy gradient (DDPG) method [75].
Algorithm 10.4: Deterministic policy gradient or deterministic actor-critic
Initialization: A given behavior policy β(a∣s) . A deterministic target policy μ(s,θ0) where θ0 is the initial parameter. A value function q(s,a,w0) where w0 is the initial parameter. αw,αθ>0 .
Goal: Learn an optimal policy to maximize J(θ) .
At time step t in each episode, do
Generate at following β and then observe rt+1,st+1 .
How to select the behavior policy β ? It can be any exploratory policy. It can also be a stochastic policy obtained by adding noise to μ [75]. In this case, μ is also the behavior policy and hence this way is an on-policy implementation.