8.4_Deep_Q-learning

8.4 Deep Q-learning

We can integrate deep neural networks into Q-learning to obtain an approach called deep Q-learning or deep Q-network (DQN) [22, 60, 61]. Deep Q-learning is one of the

earliest and most successful deep reinforcement learning algorithms. Notably, the neural networks do not have to be deep. For simple tasks such as our grid world examples, shallow networks with one or two hidden layers may be sufficient.

Deep Q-learning can be viewed as an extension of the algorithm in (8.36). However, its mathematical formulation and implementation techniques are substantially different and deserve special attention.

8.4.1 Algorithm description

Mathematically, deep Q-learning aims to minimize the following objective function:

J=E[(R+γmaxaA(S)q^(S,a,w)q^(S,A,w))2],(8.37)J = \mathbb {E} \left[ \left(R + \gamma \max _ {a \in \mathcal {A} (S ^ {\prime})} \hat {q} (S ^ {\prime}, a, w) - \hat {q} (S, A, w)\right) ^ {2} \right], \tag {8.37}

where (S,A,R,S)(S, A, R, S') are random variables that denote a state, an action, the immediate reward, and the next state, respectively. This objective function can be viewed as the squared Bellman optimality error. That is because

q(s,a)=E[Rt+1+γmaxaA(St+1)q(St+1,a)St=s,At=a],foralls,aq (s, a) = \mathbb {E} \left[ R _ {t + 1} + \gamma \max _ {a \in \mathcal {A} (S _ {t + 1})} q (S _ {t + 1}, a) \Big | S _ {t} = s, A _ {t} = a \right], \quad \mathrm {f o r a l l} s, a

is the Bellman optimality equation (the proof is given in Box 7.5). Therefore, R+γmaxaA(S)q^(S,a,w)q^(S,A,w)R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w) should equal zero in the expectation sense when q^(S,A,w)\hat{q}(S, A, w) can accurately approximate the optimal action values.

To minimize the objective function in (8.37), we can use the gradient descent algorithm. To that end, we need to calculate the gradient of JJ with respect to ww . It is noted that the parameter ww appears not only in q^(S,A,w)\hat{q}(S, A, w) but also in yR+γmaxaA(S)q^(S,a,w)y \doteq R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) . As a result, it is nontrivial to calculate the gradient. For the sake of simplicity, it is assumed that the value of ww in yy is fixed (for a short period of time) so that the calculation of the gradient becomes much easier. In particular, we introduce two networks: one is a main network representing q^(s,a,w)\hat{q}(s, a, w) and the other is a target network q^(s,a,wT)\hat{q}(s, a, w_T) . The objective function in this case becomes

J=E[(R+γmaxaA(S)q^(S,a,wT)q^(S,A,w))2],J = \mathbb {E} \left[ \left(R + \gamma \max _ {a \in \mathcal {A} (S ^ {\prime})} \hat {q} (S ^ {\prime}, a, w _ {T}) - \hat {q} (S, A, w)\right) ^ {2} \right],

where wTw_{T} is the target network's parameter. When wTw_{T} is fixed, the gradient of JJ is

wJ=E[(R+γmaxaA(S)q^(S,a,wT)q^(S,A,w))wq^(S,A,w)],(8.38)\nabla_ {w} J = - \mathbb {E} \left[ \left(R + \gamma \max _ {a \in \mathcal {A} (S ^ {\prime})} \hat {q} (S ^ {\prime}, a, w _ {T}) - \hat {q} (S, A, w)\right) \nabla_ {w} \hat {q} (S, A, w) \right], \tag {8.38}

where some constant coefficients are omitted without loss of generality.

To use the gradient in (8.38) to minimize the objective function, we need to pay

attention to the following techniques.

The first technique is to use two networks, a main network and a target network, as mentioned when we calculate the gradient in (8.38). The implementation details are explained below. Let ww and wTw_{T} denote the parameters of the main and target networks, respectively. They are initially set to the same value.

In every iteration, we draw a mini-batch of samples {(s,a,r,s)}\{(s,a,r,s')\} from the replay buffer (the replay buffer will be explained soon). The inputs of the main network are ss and aa . The output y=q^(s,a,w)y = \hat{q}(s,a,w) is the estimated q-value. The target value of the output is yTr+γmaxaA(s)q^(s,a,wT)y_{T} \doteq r + \gamma \max_{a \in \mathcal{A}(s')} \hat{q}(s',a,w_{T}) . The main network is updated to minimize the TD error (also called the loss function) (yyT)2\sum(y - y_{T})^{2} over the samples {(s,a,yT)}\{(s,a,y_{T})\} .

Updating ww in the main network does not explicitly use the gradient in (8.38). Instead, it relies on the existing software tools for training neural networks. As a result, we need a mini-batch of samples to train a network instead of using a single sample to update the main network based on (8.38). This is one notable difference between deep and nondeep reinforcement learning algorithms.

The main network is updated in every iteration. By contrast, the target network is set to be the same as the main network every certain number of iterations to satisfy the assumption that wTw_{T} is fixed when calculating the gradient in (8.38).

\diamond The second technique is experience replay [22,60,62]. That is, after we have collected some experience samples, we do not use these samples in the order they were collected. Instead, we store them in a dataset called the replay buffer. In particular, let (s,a,r,s)(s,a,r,s') be an experience sample and B{(s,a,r,s)}\mathcal{B} \doteq \{(s,a,r,s')\} be the replay buffer. Every time we update the main network, we can draw a mini-batch of experience samples from the replay buffer. The draw of samples, or called experience replay, should follow a uniform distribution.

Why is experience replay necessary in deep Q-learning, and why must the replay follow a uniform distribution? The answer lies in the objective function in (8.37). In particular, to well define the objective function, we must specify the probability distributions for S,A,R,SS, A, R, S' . The distributions of RR and SS' are determined by the system model once (S,A)(S, A) is given. The simplest way to describe the distribution of the state-action pair (S,A)(S, A) is to assume it to be uniformly distributed.

However, the state-action samples may not be uniformly distributed in practice since they are generated as a sample sequence according to the behavior policy. It is necessary to break the correlation between the samples in the sequence to satisfy the assumption of uniform distribution. To do this, we can use the experience replay technique by uniformly drawing samples from the replay buffer. This is the mathematical reason why experience replay is necessary and why experience replay must follow a uniform distribution. A benefit of random sampling is that each experience sample

Algorithm 8.3: Deep Q-learning (off-policy version)

Initialization: A main network and a target network with the same initial parameter.

Goal: Learn an optimal target network to approximate the optimal action values from the experience samples generated by a given behavior policy πb\pi_{b} .

Store the experience samples generated by πb\pi_b in a replay buffer B={(s,a,r,s)}\mathcal{B} = \{(s, a, r, s')\}

For each iteration, do

Uniformly draw a mini-batch of samples from B\mathcal{B}

For each sample (s,a,r,s)(s, a, r, s') , calculate the target value as yT=r+γmaxaA(s)q^(s,a,wT)y_{T} = r + \gamma \max_{a \in \mathcal{A}(s')} \hat{q}(s', a, w_{T}) , where wTw_{T} is the parameter of the target network

Update the main network to minimize (yTq^(s,a,w))2(y_{T} - \hat{q}(s, a, w))^{2} using the mini-batch of samples

Set wT=ww_{T} = w every CC iterations

may be used multiple times, which can increase the data efficiency. This is especially important when we have a limited amount of data.

The implementation procedure of deep Q-learning is summarized in Algorithm 8.3. This implementation is off-policy. It can also be adapted to become on-policy if needed.

8.4.2 Illustrative examples

An example is given in Figure 8.11 to demonstrate Algorithm 8.3. This example aims to learn the optimal action values for every state-action pair. Once the optimal action values are obtained, the optimal greedy policy can be obtained immediately.

A single episode is generated by the behavior policy shown in Figure 8.11(a). This behavior policy is exploratory in the sense that it has the same probability of taking any action at any state. The episode has only 1,000 steps as shown in Figure 8.11(b). Although there are only 1,000 steps, almost all the state action pairs are visited in this episode due to the strong exploration ability of the behavior policy. The replay buffer is a set of 1,000 experience samples. The mini-batch size is 100, meaning that we uniformly draw 100 samples from the replay buffer every time we acquire samples.