10.4 Deterministic actor-critic

Up to now, the policies used in the policy gradient methods are all stochastic since it is required that $\pi(a|s,\theta) > 0$ for every $(s,a)$ . This section shows that deterministic policies can also be used in policy gradient methods. Here, "deterministic" indicates that, for any state, a single action is given a probability of one and all the other actions are given probabilities of zero. It is important to study the deterministic case since it is naturally off-policy and can effectively handle continuous action spaces.

We have been using $\pi(a|s, \theta)$ to denote a general policy, which can be either stochastic or deterministic. In this section, we use

a = \mu (s, \theta)

to specifically denote a deterministic policy. Different from $\pi$ which gives the probability of an action, $\mu$ directly gives the action since it is a mapping from $S$ to $\mathcal{A}$ . This deterministic policy can be represented by, for example, a neural network with $s$ as its input, $a$ as its output, and $\theta$ as its parameter. For the sake of simplicity, we often write $\mu(s,\theta)$ as $\mu(s)$ for short.

10.4.1 The deterministic policy gradient theorem

The policy gradient theorem introduced in the last chapter is only valid for stochastic policies. When we require the policy to be deterministic, a new policy gradient theorem must be derived.

Theorem 10.2 (Deterministic policy gradient theorem). The gradient of $J(\theta)$ is

\begin{array}{l} \nabla_ {\theta} J (\theta) = \sum_ {s \in \mathcal {S}} \eta (s) \nabla_ {\theta} \mu (s) \big (\nabla_ {a} q _ {\mu} (s, a) \big) | _ {a = \mu (s)} \\ = \mathbb {E} _ {S \sim \eta} \left[ \nabla_ {\theta} \mu (S) \left(\nabla_ {a} q _ {\mu} (S, a)\right) | _ {a = \mu (S)} \right], \tag {10.14} \\ \end{array}

where $\eta$ is a distribution of the states.

Theorem 10.2 is a summary of the results presented in Theorem 10.3 and Theorem 10.4 since the gradients in the two theorems have similar expressions. The specific expressions of $J(\theta)$ and $\eta$ can be found in Theorems 10.3 and 10.4.

Unlike the stochastic case, the gradient in the deterministic case shown in (10.14) does not involve the action random variable $A$ . As a result, when we use samples to approximate the true gradient, it is not required to sample actions. Therefore, the deterministic policy gradient method is off-policy. In addition, some readers may wonder why $\left(\nabla_{a}q_{\mu}(S,a)\right)|_{a = \mu (S)}$ cannot be written as $\nabla_{a}q_{\mu}(S,\mu (S))$ , which seems more concise. That is simply because, if we do that, it is unclear how $q_{\mu}(S,\mu (S))$ is a function of $a$ . A concise yet less confusing expression may be $\nabla_{a}q_{\mu}(S,a = \mu (S))$ .

In the rest of this subsection, we present the derivation details of Theorem 10.2. In particular, we derive the gradients of two common metrics: the first is the average value and the second is the average reward. Since these two metrics have been discussed in detail in Section 9.2, we sometimes use their properties without proof. For most readers, it is sufficient to be familiar with Theorem 10.2 without knowing its derivation details. Interested readers can selectively examine the details in the remainder of this section.

Metric 1: Average value

We first derive the gradient of the average value:

J (\theta) = \mathbb {E} [ v _ {\mu} (s) ] = \sum_ {s \in \mathcal {S}} d _ {0} (s) v _ {\mu} (s), \tag {10.15}

where $d_0$ is the probability distribution of the states. Here, $d_0$ is selected to be independent of $\mu$ for simplicity. There are two special yet important cases of selecting $d_0$ . The first case is that $d_0(s_0) = 1$ and $d_0(s \neq s_0) = 0$ , where $s_0$ is a specific state of interest. In this case, the policy aims to maximize the discounted return that can be obtained when starting from $s_0$ . The second case is that $d_0$ is the distribution of a given behavior policy that is different from the target policy.

To calculate the gradient of $J(\theta)$ , we need to first calculate the gradient of $v_{\mu}(s)$ for any $s \in S$ . Consider the discounted case where $\gamma \in (0,1)$ .

Lemma 10.1 (Gradient of $v_{\mu}(s)$ ). In the discounted case, it holds for any $s \in S$ that

\nabla_ {\theta} v _ {\mu} (s) = \sum_ {s ^ {\prime} \in \mathcal {S}} \Pr_ {\mu} \left(s ^ {\prime} \mid s\right) \nabla_ {\theta} \mu \left(s ^ {\prime}\right) \left(\nabla_ {a} q _ {\mu} \left(s ^ {\prime}, a\right)\right) | _ {a = \mu \left(s ^ {\prime}\right)}, \tag {10.16}

where

\operatorname * {P r} _ {\mu} (s ^ {\prime} | s) \doteq \sum_ {k = 0} ^ {\infty} \gamma^ {k} [ P _ {\mu} ^ {k} ] _ {s s ^ {\prime}} = \left[ (I - \gamma P _ {\mu}) ^ {- 1} \right] _ {s s ^ {\prime}}

is the discounted total probability of transitioning from $s$ to $s'$ under policy $\mu$ . Here, $[\cdot]_{ss'}$ denotes the entry in the $s$ th row and $s'$ th column of a matrix.

Box 10.3: Proof of Lemma 10.1

Since the policy is deterministic, we have

v _ {\mu} (s) = q _ {\mu} (s, \mu (s)).

Since both $q_{\mu}$ and $\mu$ are functions of $\theta$ , we have

\nabla_ {\theta} v _ {\mu} (s) = \nabla_ {\theta} q _ {\mu} (s, \mu (s)) = \big (\nabla_ {\theta} q _ {\mu} (s, a) \big) | _ {a = \mu (s)} + \nabla_ {\theta} \mu (s) \big (\nabla_ {a} q _ {\mu} (s, a) \big) | _ {a = \mu (s)}. \tag {10.17}

By the definition of action values, for any given $(s,a)$ , we have

q _ {\mu} (s, a) = r (s, a) + \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s, a) v _ {\mu} (s ^ {\prime}),

where $r(s,a) = \sum_{r}rp(r|s,a)$ . Since $r(s,a)$ is independent of $\mu$ , we have

\nabla_ {\theta} q _ {\mu} (s, a) = 0 + \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s, a) \nabla_ {\theta} v _ {\mu} (s ^ {\prime}).

Substituting the above equation into (10.17) yields

\nabla_ {\theta} v _ {\mu} (s) = \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s, \mu (s)) \nabla_ {\theta} v _ {\mu} (s ^ {\prime}) + \underbrace {\nabla_ {\theta} \mu (s) \left(\nabla_ {a} q _ {\mu} (s , a)\right) | _ {a = \mu (s)}} _ {u (s)}, \quad s \in \mathcal {S}.

Since the above equation is valid for all $s \in S$ , we can combine these equations to obtain a matrix-vector form:

\underbrace {\left[ \begin{array}{c} \vdots \\ \nabla_ {\theta} v _ {\mu} (s) \\ \vdots \end{array} \right]} _ {\nabla_ {\theta} v _ {\mu} \in \mathbb {R} ^ {m n}} = \underbrace {\left[ \begin{array}{c} \vdots \\ u (s) \\ \vdots \end{array} \right]} _ {u \in \mathbb {R} ^ {m n}} + \gamma (P _ {\mu} \otimes I _ {m}) \underbrace {\left[ \begin{array}{c} \vdots \\ \nabla_ {\theta} v _ {\mu} (s ^ {\prime}) \\ \vdots \end{array} \right]} _ {\nabla_ {\theta} v _ {\mu} \in \mathbb {R} ^ {m n}},

where $n = |\mathcal{S}|$ , $m$ is the dimensionality of $\theta$ , $P_{\mu}$ is the state transition matrix with $[P_{\mu}]_{ss'} = p(s'|s, \mu(s))$ , and $\otimes$ is the Kronecker product. The above matrix-vector form can be written concisely as

\nabla_ {\theta} v _ {\mu} = u + \gamma (P _ {\mu} \otimes I _ {m}) \nabla_ {\theta} v _ {\mu},

which is a linear equation of $\nabla_{\theta}v_{\mu}$ . Then, $\nabla_{\theta}v_{\mu}$ can be solved as

\begin{array}{l} \nabla_ {\theta} v _ {\mu} = \left(I _ {m n} - \gamma P _ {\mu} \otimes I _ {m}\right) ^ {- 1} u \\ = \left(I _ {n} \otimes I _ {m} - \gamma P _ {\mu} \otimes I _ {m}\right) ^ {- 1} u \\ = \left[ \left(I _ {n} - \gamma P _ {\mu}\right) ^ {- 1} \otimes I _ {m} \right] u. \tag {10.18} \\ \end{array}

The elementwise form of (10.18) is

\begin{array}{l} \nabla_ {\theta} v _ {\mu} (s) = \sum_ {s ^ {\prime} \in \mathcal {S}} \left[ (I - \gamma P _ {\mu}) ^ {- 1} \right] _ {s s ^ {\prime}} u (s ^ {\prime}) \\ = \sum_ {s ^ {\prime} \in \mathcal {S}} \left[ (I - \gamma P _ {\mu}) ^ {- 1} \right] _ {s s ^ {\prime}} \left[ \nabla_ {\theta} \mu \left(s ^ {\prime}\right) \left(\nabla_ {a} q _ {\mu} \left(s ^ {\prime}, a\right)\right) | _ {a = \mu \left(s ^ {\prime}\right)} \right]. \tag {10.19} \\ \end{array}

The quantity $\left[(I - \gamma P_{\mu})^{-1}\right]_{ss'}$ has a clear probabilistic interpretation. Since $(I - \gamma P_{\mu})^{-1} = I + \gamma P_{\mu} + \gamma^{2}P_{\mu}^{2} + \dots$ , we have

\left[ (I - \gamma P _ {\mu}) ^ {- 1} \right] _ {s s ^ {\prime}} = [ I ] _ {s s ^ {\prime}} + \gamma [ P _ {\mu} ] _ {s s ^ {\prime}} + \gamma^ {2} [ P _ {\mu} ^ {2} ] _ {s s ^ {\prime}} + \dots = \sum_ {k = 0} ^ {\infty} \gamma^ {k} [ P _ {\mu} ^ {k} ] _ {s s ^ {\prime}}.

Note that $[P_{\mu}^{k}]_{ss^{\prime}}$ is the probability of transitioning from $s$ to $s^{\prime}$ using exactly $k$ steps (see Box 8.1 for more information). Therefore, $\left[(I - \gamma P_{\mu})^{-1}\right]_{ss^{\prime}}$ is the discounted total probability of transitioning from $s$ to $s^{\prime}$ using any number of steps. By denoting $\left[(I - \gamma P_{\mu})^{-1}\right]_{ss^{\prime}} \doteq \operatorname{Pr}_{\mu}(s^{\prime}|s)$ , equation (10.19) leads to (10.16).

With the preparation in Lemma 10.1, we are ready to derive the gradient of $J(\theta)$ .

Theorem 10.3 (Deterministic policy gradient theorem in the discounted case). In the

discounted case where $\gamma \in (0,1)$ , the gradient of $J(\theta)$ in (10.15) is

\begin{array}{l} \nabla_ {\theta} J (\theta) = \sum_ {s \in \mathcal {S}} \rho_ {\mu} (s) \nabla_ {\theta} \mu (s) \left(\nabla_ {a} q _ {\mu} (s, a)\right) | _ {a = \mu (s)} \\ = \mathbb {E} _ {S \sim \rho_ {\mu}} \left[ \nabla_ {\theta} \mu (S) \left(\nabla_ {a} q _ {\mu} (S, a)\right) | _ {a = \mu (S)} \right], \\ \end{array}

where the state distribution $\rho_{\mu}$ is

\rho_ {\mu} (s) = \sum_ {s ^ {\prime} \in \mathcal {S}} d _ {0} (s ^ {\prime}) \mathrm {P r} _ {\mu} (s | s ^ {\prime}), \qquad s \in \mathcal {S}.

Here, $\operatorname{Pr}_{\mu}(s|s') = \sum_{k=0}^{\infty} \gamma^{k}[P_{\mu}^{k}]_{s's} = [(I - \gamma P_{\mu})^{-1}]_{s's}$ is the discounted total probability of transitioning from $s'$ to $s$ under policy $\mu$ .

Box 10.4: Proof of Theorem 10.3

Since $d_0$ is independent of $\mu$ , we have

\nabla_ {\theta} J (\theta) = \sum_ {s \in \mathcal {S}} d _ {0} (s) \nabla_ {\theta} v _ {\mu} (s).

Substituting the expression of $\nabla_{\theta}v_{\mu}(s)$ given by Lemma 10.1 into the above equation yields

\begin{array}{l} \nabla_ {\theta} J (\theta) = \sum_ {s \in \mathcal {S}} d _ {0} (s) \nabla_ {\theta} v _ {\mu} (s) \\ = \sum_ {s \in \mathcal {S}} d _ {0} (s) \sum_ {s ^ {\prime} \in \mathcal {S}} \Pr_ {\mu} \left(s ^ {\prime} | s\right) \nabla_ {\theta} \mu \left(s ^ {\prime}\right) \left(\nabla_ {a} q _ {\mu} \left(s ^ {\prime}, a\right)\right) | _ {a = \mu \left(s ^ {\prime}\right)} \\ = \sum_ {s ^ {\prime} \in \mathcal {S}} \left(\sum_ {s \in \mathcal {S}} d _ {0} (s) \Pr_ {\mu} \left(s ^ {\prime} \mid s\right)\right) \nabla_ {\theta} \mu \left(s ^ {\prime}\right) \left(\nabla_ {a} q _ {\mu} \left(s ^ {\prime}, a\right)\right) | _ {a = \mu \left(s ^ {\prime}\right)} \\ \dot {=} \sum_ {s ^ {\prime} \in \mathcal {S}} \rho_ {\mu} (s ^ {\prime}) \nabla_ {\theta} \mu (s ^ {\prime}) \left(\nabla_ {a} q _ {\mu} (s ^ {\prime}, a)\right) | _ {a = \mu (s ^ {\prime})} \\ = \sum_ {s \in \mathcal {S}} \rho_ {\mu} (s) \nabla_ {\theta} \mu (s) \left(\nabla_ {a} q _ {\mu} (s, a)\right) | _ {a = \mu (s)} \quad (\text {c h a n g e} s ^ {\prime} \text {t o} s) \\ = \mathbb {E} _ {S \sim \rho_ {\mu}} \left[ \nabla_ {\theta} \mu (S) \left(\nabla_ {a} q _ {\mu} (S, a)\right) | _ {a = \mu (S)} \right]. \\ \end{array}

The proof is complete. The above proof is consistent with the proof of Theorem 1 in [74]. Here, we consider the case in which the states and actions are finite. When they are continuous, the proof is similar, but the summations should be replaced by integrals [74].

Metric 2: Average reward

We next derive the gradient of the average reward:

\begin{array}{l} J (\theta) = \bar {r} _ {\mu} = \sum_ {s \in \mathcal {S}} d _ {\mu} (s) r _ {\mu} (s) \\ = \mathbb {E} _ {S \sim d _ {\mu}} \left[ r _ {\mu} (S) \right], \tag {10.20} \\ \end{array}

where

r _ {\mu} (s) = \mathbb {E} [ R | s, a = \mu (s) ] = \sum_ {r} r p (r | s, a = \mu (s))

is the expectation of the immediate rewards. More information about this metric can be found in Section 9.2.

The gradient of $J(\theta)$ is given in the following theorem.

Theorem 10.4 (Deterministic policy gradient theorem in the undiscounted case). In the undiscounted case, the gradient of $J(\theta)$ in (10.20) is

\begin{array}{l} \nabla_ {\theta} J (\theta) = \sum_ {s \in \mathcal {S}} d _ {\mu} (s) \nabla_ {\theta} \mu (s) \left(\nabla_ {a} q _ {\mu} (s, a)\right) | _ {a = \mu (s)} \\ = \mathbb {E} _ {S \sim d _ {\mu}} \left[ \nabla_ {\theta} \mu (S) \left(\nabla_ {a} q _ {\mu} (S, a)\right) | _ {a = \mu (S)} \right], \\ \end{array}

where $d_{\mu}$ is the stationary distribution of the states under policy $\mu$ .

Box 10.5: Proof of Theorem 10.4

Since the policy is deterministic, we have

v _ {\mu} (s) = q _ {\mu} (s, \mu (s)).

Since both $q_{\mu}$ and $\mu$ are functions of $\theta$ , we have

\nabla_ {\theta} v _ {\mu} (s) = \nabla_ {\theta} q _ {\mu} (s, \mu (s)) = (\nabla_ {\theta} q _ {\mu} (s, a)) | _ {a = \mu (s)} + \nabla_ {\theta} \mu (s) (\nabla_ {a} q _ {\mu} (s, a)) | _ {a = \mu (s)}. \tag {10.21}

In the undiscounted case, it follows from the definition of action value (Section 9.3.2) that

\begin{array}{l} q _ {\mu} (s, a) = \mathbb {E} \left[ R _ {t + 1} - \bar {r} _ {\mu} + v _ {\mu} \left(S _ {t + 1}\right) \mid s, a \right] \\ = \sum_ {r} p (r | s, a) (r - \bar {r} _ {\mu}) + \sum_ {s ^ {\prime}} p \left(s ^ {\prime} | s, a\right) v _ {\mu} \left(s ^ {\prime}\right) \\ = r (s, a) - \bar {r} _ {\mu} + \sum_ {s ^ {\prime}} p (s ^ {\prime} | s, a) v _ {\mu} (s ^ {\prime}). \\ \end{array}

Since $r(s,a) = \sum_{r}rp(r|s,a)$ is independent of $\theta$ , we have

\nabla_ {\theta} q _ {\mu} (s, a) = 0 - \nabla_ {\theta} \bar {r} _ {\mu} + \sum_ {s ^ {\prime}} p (s ^ {\prime} | s, a) \nabla_ {\theta} v _ {\mu} (s ^ {\prime}).

Substituting the above equation into (10.21) gives

\nabla_ {\theta} v _ {\mu} (s) = - \nabla_ {\theta} \bar {r} _ {\mu} + \sum_ {s ^ {\prime}} p (s ^ {\prime} | s, \mu (s)) \nabla_ {\theta} v _ {\mu} (s ^ {\prime}) + \underbrace {\nabla_ {\theta} \mu (s) \left(\nabla_ {a} q _ {\mu} (s , a)\right) | _ {a = \mu (s)}} _ {u (s)}, \quad s \in \mathcal {S}.

While the above equation is valid for all $s \in S$ , we can combine these equations to obtain a matrix-vector form:

\underbrace {\left[ \begin{array}{c} \vdots \\ \nabla_ {\theta} v _ {\mu} (s) \\ \vdots \end{array} \right]} _ {\nabla_ {\theta} v _ {\mu} \in \mathbb {R} ^ {m n}} = - \mathbf {1} _ {n} \otimes \nabla_ {\theta} \bar {r} _ {\mu} + (P _ {\mu} \otimes I _ {m}) \underbrace {\left[ \begin{array}{c} \vdots \\ \nabla_ {\theta} v _ {\mu} (s ^ {\prime}) \\ \vdots \end{array} \right]} _ {\nabla_ {\theta} v _ {\mu} \in \mathbb {R} ^ {m n}} + \underbrace {\left[ \begin{array}{c} \vdots \\ u (s) \\ \vdots \end{array} \right]} _ {u \in \mathbb {R} ^ {m n}},

where $n = |\mathcal{S}|$ , $m$ is the dimension of $\theta$ , $P_{\mu}$ is the state transition matrix with $[P_{\mu}]_{ss'} = p(s'|s, \mu(s))$ , and $\otimes$ is the Kronecker product. The above matrix-vector form can be written concisely as

\nabla_ {\theta} v _ {\mu} = u - \mathbf {1} _ {n} \otimes \nabla_ {\theta} \bar {r} _ {\mu} + (P _ {\mu} \otimes I _ {m}) \nabla_ {\theta} v _ {\mu},

and hence

\mathbf {1} _ {n} \otimes \nabla_ {\theta} \bar {r} _ {\mu} = u + \left(P _ {\mu} \otimes I _ {m}\right) \nabla_ {\theta} v _ {\mu} - \nabla_ {\theta} v _ {\mu}. \tag {10.22}

Since $d_{\mu}$ is the stationary distribution, we have $d_{\mu}^{T}P_{\mu} = d_{\mu}^{T}$ . Multiplying $d_{\mu}^{T}\otimes I_{m}$ on both sides of (10.22) gives

\begin{array}{l} \left(d _ {\mu} ^ {T} \mathbf {1} _ {n}\right) \otimes \nabla_ {\theta} \bar {r} _ {\mu} = d _ {\mu} ^ {T} \otimes I _ {m} u + \left(d _ {\mu} ^ {T} P _ {\mu}\right) \otimes I _ {m} \nabla_ {\theta} v _ {\mu} - d _ {\mu} ^ {T} \otimes I _ {m} \nabla_ {\theta} v _ {\mu} \\ = d _ {\mu} ^ {T} \otimes I _ {m} u + d _ {\mu} ^ {T} \otimes I _ {m} \nabla_ {\theta} v _ {\mu} - d _ {\mu} ^ {T} \otimes I _ {m} \nabla_ {\theta} v _ {\mu} \\ = d _ {\mu} ^ {T} \otimes I _ {m} u. \\ \end{array}

Since $d_{\mu}^{T}\mathbf{1}_{n} = 1$ , the above equations become

\begin{array}{l} \nabla_ {\theta} \bar {r} _ {\mu} = d _ {\mu} ^ {T} \otimes I _ {m} u \\ = \sum_ {s \in \mathcal {S}} d _ {\mu} (s) u (s) \\ = \sum_ {s \in \mathcal {S}} d _ {\mu} (s) \nabla_ {\theta} \mu (s) \left(\nabla_ {a} q _ {\mu} (s, a)\right) | _ {a = \mu (s)} \\ = \mathbb {E} _ {S \sim d _ {\mu}} \left[ \nabla_ {\theta} \mu (S) \left(\nabla_ {a} q _ {\mu} (S, a)\right) | _ {a = \mu (S)} \right]. \\ \end{array}

The proof is complete.

10.4.2 Algorithm description

Based on the gradient given in Theorem 10.2, we can apply the gradient-ascent algorithm to maximize $J(\theta)$ :

\theta_ {t + 1} = \theta_ {t} + \alpha_ {\theta} \mathbb {E} _ {S \sim \eta} \left[ \nabla_ {\theta} \mu (S) \big (\nabla_ {a} q _ {\mu} (S, a) \big) | _ {a = \mu (S)} \right].

The corresponding stochastic gradient-ascent algorithm is

\theta_ {t + 1} = \theta_ {t} + \alpha_ {\theta} \nabla_ {\theta} \mu (s _ {t}) \big (\nabla_ {a} q _ {\mu} (s _ {t}, a) \big) | _ {a = \mu (s _ {t})}.

The implementation is summarized in Algorithm 10.4. It should be noted that this algorithm is off-policy since the behavior policy $\beta$ may be different from $\mu$ . First, the actor is off-policy. We already explained the reason when presenting Theorem 10.2. Second, the critic is also off-policy. Special attention must be paid to why the critic is off-policy but does not require the importance sampling technique. In particular, the experience sample required by the critic is $(s_t, a_t, r_{t+1}, s_{t+1}, \tilde{a}_{t+1})$ , where $\tilde{a}_{t+1} = \mu(s_{t+1})$ . The generation of this experience sample involves two policies. The first is the policy for generating $a_t$ at $s_t$ , and the second is the policy for generating $\tilde{a}_{t+1}$ at $s_{t+1}$ . The first policy that generates $a_t$ is the behavior policy since $a_t$ is used to interact with the environment. The second policy must be $\mu$ because it is the policy that the critic aims to evaluate. Hence, $\mu$ is the target policy. It should be noted that $\tilde{a}_{t+1}$ is not used to interact with the environment in the next time step. Hence, $\mu$ is not the behavior policy. Therefore, the critic is off-policy.

How to select the function $q(s, a, w)$ ? The original research work [74] that proposed the deterministic policy gradient method adopted linear functions: $q(s, a, w) = \phi^T(s, a)w$ where $\phi(s, a)$ is the feature vector. It is currently popular to represent $q(s, a, w)$ using neural networks, as suggested in the deep deterministic policy gradient (DDPG) method [75].

Algorithm 10.4: Deterministic policy gradient or deterministic actor-critic

Initialization: A given behavior policy $\beta (a|s)$ . A deterministic target policy $\mu (s,\theta_0)$ where $\theta_0$ is the initial parameter. A value function $q(s,a,w_0)$ where $w_{0}$ is the initial parameter. $\alpha_w,\alpha_\theta >0$ .

Goal: Learn an optimal policy to maximize $J(\theta)$ .

At time step $t$ in each episode, do

Generate $a_t$ following $\beta$ and then observe $r_{t+1}, s_{t+1}$ .

TD error:

\delta_ {t} = r _ {t + 1} + \gamma q (s _ {t + 1}, \mu (s _ {t + 1}, \theta_ {t}), w _ {t}) - q (s _ {t}, a _ {t}, w _ {t})

Actor (policy update):

\theta_ {t + 1} = \theta_ {t} + \alpha_ {\theta} \nabla_ {\theta} \mu (s _ {t}, \theta_ {t}) \left(\nabla_ {a} q (s _ {t}, a, w _ {t})\right) | _ {a = \mu (s _ {t})}

Critic(valueupdate):

w _ {t + 1} = w _ {t} + \alpha_ {w} \delta_ {t} \nabla_ {w} q \left(s _ {t}, a _ {t}, w _ {t}\right)

How to select the behavior policy $\beta$ ? It can be any exploratory policy. It can also be a stochastic policy obtained by adding noise to $\mu$ [75]. In this case, $\mu$ is also the behavior policy and hence this way is an on-policy implementation.

10.4_Deterministic_actor-critic