10.4_Deterministic_actor-critic

10.4 Deterministic actor-critic

Up to now, the policies used in the policy gradient methods are all stochastic since it is required that π(as,θ)>0\pi(a|s,\theta) > 0 for every (s,a)(s,a) . This section shows that deterministic policies can also be used in policy gradient methods. Here, "deterministic" indicates that, for any state, a single action is given a probability of one and all the other actions are given probabilities of zero. It is important to study the deterministic case since it is naturally off-policy and can effectively handle continuous action spaces.

We have been using π(as,θ)\pi(a|s, \theta) to denote a general policy, which can be either stochastic or deterministic. In this section, we use

a=μ(s,θ)a = \mu (s, \theta)

to specifically denote a deterministic policy. Different from π\pi which gives the probability of an action, μ\mu directly gives the action since it is a mapping from SS to A\mathcal{A} . This deterministic policy can be represented by, for example, a neural network with ss as its input, aa as its output, and θ\theta as its parameter. For the sake of simplicity, we often write μ(s,θ)\mu(s,\theta) as μ(s)\mu(s) for short.

10.4.1 The deterministic policy gradient theorem

The policy gradient theorem introduced in the last chapter is only valid for stochastic policies. When we require the policy to be deterministic, a new policy gradient theorem must be derived.

Theorem 10.2 (Deterministic policy gradient theorem). The gradient of J(θ)J(\theta) is

θJ(θ)=sSη(s)θμ(s)(aqμ(s,a))a=μ(s)=ESη[θμ(S)(aqμ(S,a))a=μ(S)],(10.14)\begin{array}{l} \nabla_ {\theta} J (\theta) = \sum_ {s \in \mathcal {S}} \eta (s) \nabla_ {\theta} \mu (s) \big (\nabla_ {a} q _ {\mu} (s, a) \big) | _ {a = \mu (s)} \\ = \mathbb {E} _ {S \sim \eta} \left[ \nabla_ {\theta} \mu (S) \left(\nabla_ {a} q _ {\mu} (S, a)\right) | _ {a = \mu (S)} \right], \tag {10.14} \\ \end{array}

where η\eta is a distribution of the states.

Theorem 10.2 is a summary of the results presented in Theorem 10.3 and Theorem 10.4 since the gradients in the two theorems have similar expressions. The specific expressions of J(θ)J(\theta) and η\eta can be found in Theorems 10.3 and 10.4.

Unlike the stochastic case, the gradient in the deterministic case shown in (10.14) does not involve the action random variable AA . As a result, when we use samples to approximate the true gradient, it is not required to sample actions. Therefore, the deterministic policy gradient method is off-policy. In addition, some readers may wonder why (aqμ(S,a))a=μ(S)\left(\nabla_{a}q_{\mu}(S,a)\right)|_{a = \mu (S)} cannot be written as aqμ(S,μ(S))\nabla_{a}q_{\mu}(S,\mu (S)) , which seems more concise. That is simply because, if we do that, it is unclear how qμ(S,μ(S))q_{\mu}(S,\mu (S)) is a function of aa . A concise yet less confusing expression may be aqμ(S,a=μ(S))\nabla_{a}q_{\mu}(S,a = \mu (S)) .

In the rest of this subsection, we present the derivation details of Theorem 10.2. In particular, we derive the gradients of two common metrics: the first is the average value and the second is the average reward. Since these two metrics have been discussed in detail in Section 9.2, we sometimes use their properties without proof. For most readers, it is sufficient to be familiar with Theorem 10.2 without knowing its derivation details. Interested readers can selectively examine the details in the remainder of this section.

Metric 1: Average value

We first derive the gradient of the average value:

J(θ)=E[vμ(s)]=sSd0(s)vμ(s),(10.15)J (\theta) = \mathbb {E} [ v _ {\mu} (s) ] = \sum_ {s \in \mathcal {S}} d _ {0} (s) v _ {\mu} (s), \tag {10.15}

where d0d_0 is the probability distribution of the states. Here, d0d_0 is selected to be independent of μ\mu for simplicity. There are two special yet important cases of selecting d0d_0 . The first case is that d0(s0)=1d_0(s_0) = 1 and d0(ss0)=0d_0(s \neq s_0) = 0 , where s0s_0 is a specific state of interest. In this case, the policy aims to maximize the discounted return that can be obtained when starting from s0s_0 . The second case is that d0d_0 is the distribution of a given behavior policy that is different from the target policy.

To calculate the gradient of J(θ)J(\theta) , we need to first calculate the gradient of vμ(s)v_{\mu}(s) for any sSs \in S . Consider the discounted case where γ(0,1)\gamma \in (0,1) .

Lemma 10.1 (Gradient of vμ(s)v_{\mu}(s) ). In the discounted case, it holds for any sSs \in S that

θvμ(s)=sSPrμ(ss)θμ(s)(aqμ(s,a))a=μ(s),(10.16)\nabla_ {\theta} v _ {\mu} (s) = \sum_ {s ^ {\prime} \in \mathcal {S}} \Pr_ {\mu} \left(s ^ {\prime} \mid s\right) \nabla_ {\theta} \mu \left(s ^ {\prime}\right) \left(\nabla_ {a} q _ {\mu} \left(s ^ {\prime}, a\right)\right) | _ {a = \mu \left(s ^ {\prime}\right)}, \tag {10.16}

where

Prμ(ss)k=0γk[Pμk]ss=[(IγPμ)1]ss\operatorname * {P r} _ {\mu} (s ^ {\prime} | s) \doteq \sum_ {k = 0} ^ {\infty} \gamma^ {k} [ P _ {\mu} ^ {k} ] _ {s s ^ {\prime}} = \left[ (I - \gamma P _ {\mu}) ^ {- 1} \right] _ {s s ^ {\prime}}

is the discounted total probability of transitioning from ss to ss' under policy μ\mu . Here, []ss[\cdot]_{ss'} denotes the entry in the ss th row and ss' th column of a matrix.

Box 10.3: Proof of Lemma 10.1

Since the policy is deterministic, we have

vμ(s)=qμ(s,μ(s)).v _ {\mu} (s) = q _ {\mu} (s, \mu (s)).

Since both qμq_{\mu} and μ\mu are functions of θ\theta , we have

θvμ(s)=θqμ(s,μ(s))=(θqμ(s,a))a=μ(s)+θμ(s)(aqμ(s,a))a=μ(s).(10.17)\nabla_ {\theta} v _ {\mu} (s) = \nabla_ {\theta} q _ {\mu} (s, \mu (s)) = \big (\nabla_ {\theta} q _ {\mu} (s, a) \big) | _ {a = \mu (s)} + \nabla_ {\theta} \mu (s) \big (\nabla_ {a} q _ {\mu} (s, a) \big) | _ {a = \mu (s)}. \tag {10.17}

By the definition of action values, for any given (s,a)(s,a) , we have

qμ(s,a)=r(s,a)+γsSp(ss,a)vμ(s),q _ {\mu} (s, a) = r (s, a) + \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s, a) v _ {\mu} (s ^ {\prime}),

where r(s,a)=rrp(rs,a)r(s,a) = \sum_{r}rp(r|s,a) . Since r(s,a)r(s,a) is independent of μ\mu , we have

θqμ(s,a)=0+γsSp(ss,a)θvμ(s).\nabla_ {\theta} q _ {\mu} (s, a) = 0 + \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s, a) \nabla_ {\theta} v _ {\mu} (s ^ {\prime}).

Substituting the above equation into (10.17) yields

θvμ(s)=γsSp(ss,μ(s))θvμ(s)+θμ(s)(aqμ(s,a))a=μ(s)u(s),sS.\nabla_ {\theta} v _ {\mu} (s) = \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s, \mu (s)) \nabla_ {\theta} v _ {\mu} (s ^ {\prime}) + \underbrace {\nabla_ {\theta} \mu (s) \left(\nabla_ {a} q _ {\mu} (s , a)\right) | _ {a = \mu (s)}} _ {u (s)}, \quad s \in \mathcal {S}.

Since the above equation is valid for all sSs \in S , we can combine these equations to obtain a matrix-vector form:

[θvμ(s)]θvμRmn=[u(s)]uRmn+γ(PμIm)[θvμ(s)]θvμRmn,\underbrace {\left[ \begin{array}{c} \vdots \\ \nabla_ {\theta} v _ {\mu} (s) \\ \vdots \end{array} \right]} _ {\nabla_ {\theta} v _ {\mu} \in \mathbb {R} ^ {m n}} = \underbrace {\left[ \begin{array}{c} \vdots \\ u (s) \\ \vdots \end{array} \right]} _ {u \in \mathbb {R} ^ {m n}} + \gamma (P _ {\mu} \otimes I _ {m}) \underbrace {\left[ \begin{array}{c} \vdots \\ \nabla_ {\theta} v _ {\mu} (s ^ {\prime}) \\ \vdots \end{array} \right]} _ {\nabla_ {\theta} v _ {\mu} \in \mathbb {R} ^ {m n}},

where n=Sn = |\mathcal{S}| , mm is the dimensionality of θ\theta , PμP_{\mu} is the state transition matrix with [Pμ]ss=p(ss,μ(s))[P_{\mu}]_{ss'} = p(s'|s, \mu(s)) , and \otimes is the Kronecker product. The above matrix-vector form can be written concisely as

θvμ=u+γ(PμIm)θvμ,\nabla_ {\theta} v _ {\mu} = u + \gamma (P _ {\mu} \otimes I _ {m}) \nabla_ {\theta} v _ {\mu},

which is a linear equation of θvμ\nabla_{\theta}v_{\mu} . Then, θvμ\nabla_{\theta}v_{\mu} can be solved as

θvμ=(ImnγPμIm)1u=(InImγPμIm)1u=[(InγPμ)1Im]u.(10.18)\begin{array}{l} \nabla_ {\theta} v _ {\mu} = \left(I _ {m n} - \gamma P _ {\mu} \otimes I _ {m}\right) ^ {- 1} u \\ = \left(I _ {n} \otimes I _ {m} - \gamma P _ {\mu} \otimes I _ {m}\right) ^ {- 1} u \\ = \left[ \left(I _ {n} - \gamma P _ {\mu}\right) ^ {- 1} \otimes I _ {m} \right] u. \tag {10.18} \\ \end{array}

The elementwise form of (10.18) is

θvμ(s)=sS[(IγPμ)1]ssu(s)=sS[(IγPμ)1]ss[θμ(s)(aqμ(s,a))a=μ(s)].(10.19)\begin{array}{l} \nabla_ {\theta} v _ {\mu} (s) = \sum_ {s ^ {\prime} \in \mathcal {S}} \left[ (I - \gamma P _ {\mu}) ^ {- 1} \right] _ {s s ^ {\prime}} u (s ^ {\prime}) \\ = \sum_ {s ^ {\prime} \in \mathcal {S}} \left[ (I - \gamma P _ {\mu}) ^ {- 1} \right] _ {s s ^ {\prime}} \left[ \nabla_ {\theta} \mu \left(s ^ {\prime}\right) \left(\nabla_ {a} q _ {\mu} \left(s ^ {\prime}, a\right)\right) | _ {a = \mu \left(s ^ {\prime}\right)} \right]. \tag {10.19} \\ \end{array}

The quantity [(IγPμ)1]ss\left[(I - \gamma P_{\mu})^{-1}\right]_{ss'} has a clear probabilistic interpretation. Since (IγPμ)1=I+γPμ+γ2Pμ2+(I - \gamma P_{\mu})^{-1} = I + \gamma P_{\mu} + \gamma^{2}P_{\mu}^{2} + \dots , we have

[(IγPμ)1]ss=[I]ss+γ[Pμ]ss+γ2[Pμ2]ss+=k=0γk[Pμk]ss.\left[ (I - \gamma P _ {\mu}) ^ {- 1} \right] _ {s s ^ {\prime}} = [ I ] _ {s s ^ {\prime}} + \gamma [ P _ {\mu} ] _ {s s ^ {\prime}} + \gamma^ {2} [ P _ {\mu} ^ {2} ] _ {s s ^ {\prime}} + \dots = \sum_ {k = 0} ^ {\infty} \gamma^ {k} [ P _ {\mu} ^ {k} ] _ {s s ^ {\prime}}.

Note that [Pμk]ss[P_{\mu}^{k}]_{ss^{\prime}} is the probability of transitioning from ss to ss^{\prime} using exactly kk steps (see Box 8.1 for more information). Therefore, [(IγPμ)1]ss\left[(I - \gamma P_{\mu})^{-1}\right]_{ss^{\prime}} is the discounted total probability of transitioning from ss to ss^{\prime} using any number of steps. By denoting [(IγPμ)1]ssPrμ(ss)\left[(I - \gamma P_{\mu})^{-1}\right]_{ss^{\prime}} \doteq \operatorname{Pr}_{\mu}(s^{\prime}|s) , equation (10.19) leads to (10.16).

With the preparation in Lemma 10.1, we are ready to derive the gradient of J(θ)J(\theta) .

Theorem 10.3 (Deterministic policy gradient theorem in the discounted case). In the

discounted case where γ(0,1)\gamma \in (0,1) , the gradient of J(θ)J(\theta) in (10.15) is

θJ(θ)=sSρμ(s)θμ(s)(aqμ(s,a))a=μ(s)=ESρμ[θμ(S)(aqμ(S,a))a=μ(S)],\begin{array}{l} \nabla_ {\theta} J (\theta) = \sum_ {s \in \mathcal {S}} \rho_ {\mu} (s) \nabla_ {\theta} \mu (s) \left(\nabla_ {a} q _ {\mu} (s, a)\right) | _ {a = \mu (s)} \\ = \mathbb {E} _ {S \sim \rho_ {\mu}} \left[ \nabla_ {\theta} \mu (S) \left(\nabla_ {a} q _ {\mu} (S, a)\right) | _ {a = \mu (S)} \right], \\ \end{array}

where the state distribution ρμ\rho_{\mu} is

ρμ(s)=sSd0(s)Prμ(ss),sS.\rho_ {\mu} (s) = \sum_ {s ^ {\prime} \in \mathcal {S}} d _ {0} (s ^ {\prime}) \mathrm {P r} _ {\mu} (s | s ^ {\prime}), \qquad s \in \mathcal {S}.

Here, Prμ(ss)=k=0γk[Pμk]ss=[(IγPμ)1]ss\operatorname{Pr}_{\mu}(s|s') = \sum_{k=0}^{\infty} \gamma^{k}[P_{\mu}^{k}]_{s's} = [(I - \gamma P_{\mu})^{-1}]_{s's} is the discounted total probability of transitioning from ss' to ss under policy μ\mu .

Box 10.4: Proof of Theorem 10.3

Since d0d_0 is independent of μ\mu , we have

θJ(θ)=sSd0(s)θvμ(s).\nabla_ {\theta} J (\theta) = \sum_ {s \in \mathcal {S}} d _ {0} (s) \nabla_ {\theta} v _ {\mu} (s).

Substituting the expression of θvμ(s)\nabla_{\theta}v_{\mu}(s) given by Lemma 10.1 into the above equation yields

θJ(θ)=sSd0(s)θvμ(s)=sSd0(s)sSPrμ(ss)θμ(s)(aqμ(s,a))a=μ(s)=sS(sSd0(s)Prμ(ss))θμ(s)(aqμ(s,a))a=μ(s)=˙sSρμ(s)θμ(s)(aqμ(s,a))a=μ(s)=sSρμ(s)θμ(s)(aqμ(s,a))a=μ(s)(c h a n g est os)=ESρμ[θμ(S)(aqμ(S,a))a=μ(S)].\begin{array}{l} \nabla_ {\theta} J (\theta) = \sum_ {s \in \mathcal {S}} d _ {0} (s) \nabla_ {\theta} v _ {\mu} (s) \\ = \sum_ {s \in \mathcal {S}} d _ {0} (s) \sum_ {s ^ {\prime} \in \mathcal {S}} \Pr_ {\mu} \left(s ^ {\prime} | s\right) \nabla_ {\theta} \mu \left(s ^ {\prime}\right) \left(\nabla_ {a} q _ {\mu} \left(s ^ {\prime}, a\right)\right) | _ {a = \mu \left(s ^ {\prime}\right)} \\ = \sum_ {s ^ {\prime} \in \mathcal {S}} \left(\sum_ {s \in \mathcal {S}} d _ {0} (s) \Pr_ {\mu} \left(s ^ {\prime} \mid s\right)\right) \nabla_ {\theta} \mu \left(s ^ {\prime}\right) \left(\nabla_ {a} q _ {\mu} \left(s ^ {\prime}, a\right)\right) | _ {a = \mu \left(s ^ {\prime}\right)} \\ \dot {=} \sum_ {s ^ {\prime} \in \mathcal {S}} \rho_ {\mu} (s ^ {\prime}) \nabla_ {\theta} \mu (s ^ {\prime}) \left(\nabla_ {a} q _ {\mu} (s ^ {\prime}, a)\right) | _ {a = \mu (s ^ {\prime})} \\ = \sum_ {s \in \mathcal {S}} \rho_ {\mu} (s) \nabla_ {\theta} \mu (s) \left(\nabla_ {a} q _ {\mu} (s, a)\right) | _ {a = \mu (s)} \quad (\text {c h a n g e} s ^ {\prime} \text {t o} s) \\ = \mathbb {E} _ {S \sim \rho_ {\mu}} \left[ \nabla_ {\theta} \mu (S) \left(\nabla_ {a} q _ {\mu} (S, a)\right) | _ {a = \mu (S)} \right]. \\ \end{array}

The proof is complete. The above proof is consistent with the proof of Theorem 1 in [74]. Here, we consider the case in which the states and actions are finite. When they are continuous, the proof is similar, but the summations should be replaced by integrals [74].

Metric 2: Average reward

We next derive the gradient of the average reward:

J(θ)=rˉμ=sSdμ(s)rμ(s)=ESdμ[rμ(S)],(10.20)\begin{array}{l} J (\theta) = \bar {r} _ {\mu} = \sum_ {s \in \mathcal {S}} d _ {\mu} (s) r _ {\mu} (s) \\ = \mathbb {E} _ {S \sim d _ {\mu}} \left[ r _ {\mu} (S) \right], \tag {10.20} \\ \end{array}

where

rμ(s)=E[Rs,a=μ(s)]=rrp(rs,a=μ(s))r _ {\mu} (s) = \mathbb {E} [ R | s, a = \mu (s) ] = \sum_ {r} r p (r | s, a = \mu (s))

is the expectation of the immediate rewards. More information about this metric can be found in Section 9.2.

The gradient of J(θ)J(\theta) is given in the following theorem.

Theorem 10.4 (Deterministic policy gradient theorem in the undiscounted case). In the undiscounted case, the gradient of J(θ)J(\theta) in (10.20) is

θJ(θ)=sSdμ(s)θμ(s)(aqμ(s,a))a=μ(s)=ESdμ[θμ(S)(aqμ(S,a))a=μ(S)],\begin{array}{l} \nabla_ {\theta} J (\theta) = \sum_ {s \in \mathcal {S}} d _ {\mu} (s) \nabla_ {\theta} \mu (s) \left(\nabla_ {a} q _ {\mu} (s, a)\right) | _ {a = \mu (s)} \\ = \mathbb {E} _ {S \sim d _ {\mu}} \left[ \nabla_ {\theta} \mu (S) \left(\nabla_ {a} q _ {\mu} (S, a)\right) | _ {a = \mu (S)} \right], \\ \end{array}

where dμd_{\mu} is the stationary distribution of the states under policy μ\mu .

Box 10.5: Proof of Theorem 10.4

Since the policy is deterministic, we have

vμ(s)=qμ(s,μ(s)).v _ {\mu} (s) = q _ {\mu} (s, \mu (s)).

Since both qμq_{\mu} and μ\mu are functions of θ\theta , we have

θvμ(s)=θqμ(s,μ(s))=(θqμ(s,a))a=μ(s)+θμ(s)(aqμ(s,a))a=μ(s).(10.21)\nabla_ {\theta} v _ {\mu} (s) = \nabla_ {\theta} q _ {\mu} (s, \mu (s)) = (\nabla_ {\theta} q _ {\mu} (s, a)) | _ {a = \mu (s)} + \nabla_ {\theta} \mu (s) (\nabla_ {a} q _ {\mu} (s, a)) | _ {a = \mu (s)}. \tag {10.21}

In the undiscounted case, it follows from the definition of action value (Section 9.3.2) that

qμ(s,a)=E[Rt+1rˉμ+vμ(St+1)s,a]=rp(rs,a)(rrˉμ)+sp(ss,a)vμ(s)=r(s,a)rˉμ+sp(ss,a)vμ(s).\begin{array}{l} q _ {\mu} (s, a) = \mathbb {E} \left[ R _ {t + 1} - \bar {r} _ {\mu} + v _ {\mu} \left(S _ {t + 1}\right) \mid s, a \right] \\ = \sum_ {r} p (r | s, a) (r - \bar {r} _ {\mu}) + \sum_ {s ^ {\prime}} p \left(s ^ {\prime} | s, a\right) v _ {\mu} \left(s ^ {\prime}\right) \\ = r (s, a) - \bar {r} _ {\mu} + \sum_ {s ^ {\prime}} p (s ^ {\prime} | s, a) v _ {\mu} (s ^ {\prime}). \\ \end{array}

Since r(s,a)=rrp(rs,a)r(s,a) = \sum_{r}rp(r|s,a) is independent of θ\theta , we have

θqμ(s,a)=0θrˉμ+sp(ss,a)θvμ(s).\nabla_ {\theta} q _ {\mu} (s, a) = 0 - \nabla_ {\theta} \bar {r} _ {\mu} + \sum_ {s ^ {\prime}} p (s ^ {\prime} | s, a) \nabla_ {\theta} v _ {\mu} (s ^ {\prime}).

Substituting the above equation into (10.21) gives

θvμ(s)=θrˉμ+sp(ss,μ(s))θvμ(s)+θμ(s)(aqμ(s,a))a=μ(s)u(s),sS.\nabla_ {\theta} v _ {\mu} (s) = - \nabla_ {\theta} \bar {r} _ {\mu} + \sum_ {s ^ {\prime}} p (s ^ {\prime} | s, \mu (s)) \nabla_ {\theta} v _ {\mu} (s ^ {\prime}) + \underbrace {\nabla_ {\theta} \mu (s) \left(\nabla_ {a} q _ {\mu} (s , a)\right) | _ {a = \mu (s)}} _ {u (s)}, \quad s \in \mathcal {S}.

While the above equation is valid for all sSs \in S , we can combine these equations to obtain a matrix-vector form:

[θvμ(s)]θvμRmn=1nθrˉμ+(PμIm)[θvμ(s)]θvμRmn+[u(s)]uRmn,\underbrace {\left[ \begin{array}{c} \vdots \\ \nabla_ {\theta} v _ {\mu} (s) \\ \vdots \end{array} \right]} _ {\nabla_ {\theta} v _ {\mu} \in \mathbb {R} ^ {m n}} = - \mathbf {1} _ {n} \otimes \nabla_ {\theta} \bar {r} _ {\mu} + (P _ {\mu} \otimes I _ {m}) \underbrace {\left[ \begin{array}{c} \vdots \\ \nabla_ {\theta} v _ {\mu} (s ^ {\prime}) \\ \vdots \end{array} \right]} _ {\nabla_ {\theta} v _ {\mu} \in \mathbb {R} ^ {m n}} + \underbrace {\left[ \begin{array}{c} \vdots \\ u (s) \\ \vdots \end{array} \right]} _ {u \in \mathbb {R} ^ {m n}},

where n=Sn = |\mathcal{S}| , mm is the dimension of θ\theta , PμP_{\mu} is the state transition matrix with [Pμ]ss=p(ss,μ(s))[P_{\mu}]_{ss'} = p(s'|s, \mu(s)) , and \otimes is the Kronecker product. The above matrix-vector form can be written concisely as

θvμ=u1nθrˉμ+(PμIm)θvμ,\nabla_ {\theta} v _ {\mu} = u - \mathbf {1} _ {n} \otimes \nabla_ {\theta} \bar {r} _ {\mu} + (P _ {\mu} \otimes I _ {m}) \nabla_ {\theta} v _ {\mu},

and hence

1nθrˉμ=u+(PμIm)θvμθvμ.(10.22)\mathbf {1} _ {n} \otimes \nabla_ {\theta} \bar {r} _ {\mu} = u + \left(P _ {\mu} \otimes I _ {m}\right) \nabla_ {\theta} v _ {\mu} - \nabla_ {\theta} v _ {\mu}. \tag {10.22}

Since dμd_{\mu} is the stationary distribution, we have dμTPμ=dμTd_{\mu}^{T}P_{\mu} = d_{\mu}^{T} . Multiplying dμTImd_{\mu}^{T}\otimes I_{m} on both sides of (10.22) gives

(dμT1n)θrˉμ=dμTImu+(dμTPμ)ImθvμdμTImθvμ=dμTImu+dμTImθvμdμTImθvμ=dμTImu.\begin{array}{l} \left(d _ {\mu} ^ {T} \mathbf {1} _ {n}\right) \otimes \nabla_ {\theta} \bar {r} _ {\mu} = d _ {\mu} ^ {T} \otimes I _ {m} u + \left(d _ {\mu} ^ {T} P _ {\mu}\right) \otimes I _ {m} \nabla_ {\theta} v _ {\mu} - d _ {\mu} ^ {T} \otimes I _ {m} \nabla_ {\theta} v _ {\mu} \\ = d _ {\mu} ^ {T} \otimes I _ {m} u + d _ {\mu} ^ {T} \otimes I _ {m} \nabla_ {\theta} v _ {\mu} - d _ {\mu} ^ {T} \otimes I _ {m} \nabla_ {\theta} v _ {\mu} \\ = d _ {\mu} ^ {T} \otimes I _ {m} u. \\ \end{array}

Since dμT1n=1d_{\mu}^{T}\mathbf{1}_{n} = 1 , the above equations become

θrˉμ=dμTImu=sSdμ(s)u(s)=sSdμ(s)θμ(s)(aqμ(s,a))a=μ(s)=ESdμ[θμ(S)(aqμ(S,a))a=μ(S)].\begin{array}{l} \nabla_ {\theta} \bar {r} _ {\mu} = d _ {\mu} ^ {T} \otimes I _ {m} u \\ = \sum_ {s \in \mathcal {S}} d _ {\mu} (s) u (s) \\ = \sum_ {s \in \mathcal {S}} d _ {\mu} (s) \nabla_ {\theta} \mu (s) \left(\nabla_ {a} q _ {\mu} (s, a)\right) | _ {a = \mu (s)} \\ = \mathbb {E} _ {S \sim d _ {\mu}} \left[ \nabla_ {\theta} \mu (S) \left(\nabla_ {a} q _ {\mu} (S, a)\right) | _ {a = \mu (S)} \right]. \\ \end{array}

The proof is complete.

10.4.2 Algorithm description

Based on the gradient given in Theorem 10.2, we can apply the gradient-ascent algorithm to maximize J(θ)J(\theta) :

θt+1=θt+αθESη[θμ(S)(aqμ(S,a))a=μ(S)].\theta_ {t + 1} = \theta_ {t} + \alpha_ {\theta} \mathbb {E} _ {S \sim \eta} \left[ \nabla_ {\theta} \mu (S) \big (\nabla_ {a} q _ {\mu} (S, a) \big) | _ {a = \mu (S)} \right].

The corresponding stochastic gradient-ascent algorithm is

θt+1=θt+αθθμ(st)(aqμ(st,a))a=μ(st).\theta_ {t + 1} = \theta_ {t} + \alpha_ {\theta} \nabla_ {\theta} \mu (s _ {t}) \big (\nabla_ {a} q _ {\mu} (s _ {t}, a) \big) | _ {a = \mu (s _ {t})}.

The implementation is summarized in Algorithm 10.4. It should be noted that this algorithm is off-policy since the behavior policy β\beta may be different from μ\mu . First, the actor is off-policy. We already explained the reason when presenting Theorem 10.2. Second, the critic is also off-policy. Special attention must be paid to why the critic is off-policy but does not require the importance sampling technique. In particular, the experience sample required by the critic is (st,at,rt+1,st+1,a~t+1)(s_t, a_t, r_{t+1}, s_{t+1}, \tilde{a}_{t+1}) , where a~t+1=μ(st+1)\tilde{a}_{t+1} = \mu(s_{t+1}) . The generation of this experience sample involves two policies. The first is the policy for generating ata_t at sts_t , and the second is the policy for generating a~t+1\tilde{a}_{t+1} at st+1s_{t+1} . The first policy that generates ata_t is the behavior policy since ata_t is used to interact with the environment. The second policy must be μ\mu because it is the policy that the critic aims to evaluate. Hence, μ\mu is the target policy. It should be noted that a~t+1\tilde{a}_{t+1} is not used to interact with the environment in the next time step. Hence, μ\mu is not the behavior policy. Therefore, the critic is off-policy.

How to select the function q(s,a,w)q(s, a, w) ? The original research work [74] that proposed the deterministic policy gradient method adopted linear functions: q(s,a,w)=ϕT(s,a)wq(s, a, w) = \phi^T(s, a)w where ϕ(s,a)\phi(s, a) is the feature vector. It is currently popular to represent q(s,a,w)q(s, a, w) using neural networks, as suggested in the deep deterministic policy gradient (DDPG) method [75].

Algorithm 10.4: Deterministic policy gradient or deterministic actor-critic

Initialization: A given behavior policy β(as)\beta (a|s) . A deterministic target policy μ(s,θ0)\mu (s,\theta_0) where θ0\theta_0 is the initial parameter. A value function q(s,a,w0)q(s,a,w_0) where w0w_{0} is the initial parameter. αw,αθ>0\alpha_w,\alpha_\theta >0 .

Goal: Learn an optimal policy to maximize J(θ)J(\theta) .

At time step tt in each episode, do

Generate ata_t following β\beta and then observe rt+1,st+1r_{t+1}, s_{t+1} .

TD error:

δt=rt+1+γq(st+1,μ(st+1,θt),wt)q(st,at,wt)\delta_ {t} = r _ {t + 1} + \gamma q (s _ {t + 1}, \mu (s _ {t + 1}, \theta_ {t}), w _ {t}) - q (s _ {t}, a _ {t}, w _ {t})

Actor (policy update):

θt+1=θt+αθθμ(st,θt)(aq(st,a,wt))a=μ(st)\theta_ {t + 1} = \theta_ {t} + \alpha_ {\theta} \nabla_ {\theta} \mu (s _ {t}, \theta_ {t}) \left(\nabla_ {a} q (s _ {t}, a, w _ {t})\right) | _ {a = \mu (s _ {t})}

Critic(valueupdate):

wt+1=wt+αwδtwq(st,at,wt)w _ {t + 1} = w _ {t} + \alpha_ {w} \delta_ {t} \nabla_ {w} q \left(s _ {t}, a _ {t}, w _ {t}\right)

How to select the behavior policy β\beta ? It can be any exploratory policy. It can also be a stochastic policy obtained by adding noise to μ\mu [75]. In this case, μ\mu is also the behavior policy and hence this way is an on-policy implementation.