4.1_Value_iteration

4.1 Value iteration

This section introduces the value iteration algorithm. It is exactly the algorithm suggested by the contraction mapping theorem for solving the Bellman optimality equation, as introduced in the last chapter (Theorem 3.3). In particular, the algorithm is

vk+1=maxπΠ(rπ+γPπvk),k=0,1,2,v _ {k + 1} = \max _ {\pi \in \Pi} (r _ {\pi} + \gamma P _ {\pi} v _ {k}), \quad k = 0, 1, 2, \ldots

It is guaranteed by Theorem 3.3 that vkv_{k} and πk\pi_{k} converge to the optimal state value and an optimal policy as kk\to \infty , respectively.

This algorithm is iterative and has two steps in every iteration.

The first step in every iteration is a policy update step. Mathematically, it aims to find a policy that can solve the following optimization problem:

πk+1=argmaxπ(rπ+γPπvk),\pi_ {k + 1} = \arg \max _ {\pi} (r _ {\pi} + \gamma P _ {\pi} v _ {k}),

where vkv_{k} is obtained in the previous iteration.

The second step is called a value update step. Mathematically, it calculates a new value vk+1v_{k+1} by

vk+1=rπk+1+γPπk+1vk,(4.1)v _ {k + 1} = r _ {\pi_ {k + 1}} + \gamma P _ {\pi_ {k + 1}} v _ {k}, \tag {4.1}

where vk+1v_{k + 1} will be used in the next iteration.

The value iteration algorithm introduced above is in a matrix-vector form. To implement this algorithm, we need to further examine its elementwise form. While the matrix-vector form is useful for understanding the core idea of the algorithm, the elementwise form is necessary for explaining the implementation details.

4.1.1 Elementwise form and implementation

Consider the time step kk and a state ss .

\diamond First, the elementwise form of the policy update step πk+1=argmaxπ(rπ+γPπvk)\pi_{k + 1} = \arg \max_{\pi}(r_{\pi} + \gamma P_{\pi}v_{k}) is

πk+1(s)=argmaxπaπ(as)(rp(rs,a)r+γsp(ss,a)vk(s))qk(s,a),sS.\pi_ {k + 1} (s) = \arg \max _ {\pi} \sum_ {a} \pi (a | s) \underbrace {\left(\sum_ {r} p (r | s , a) r + \gamma \sum_ {s ^ {\prime}} p (s ^ {\prime} | s , a) v _ {k} (s ^ {\prime})\right)} _ {q _ {k} (s, a)}, \quad s \in \mathcal {S}.

We showed in Section 3.3.1 that the optimal policy that can solve the above optimization problem is

πk+1(as)={1,a=ak(s),0,aak(s),(4.2)\pi_ {k + 1} (a | s) = \left\{ \begin{array}{l l} 1, & a = a _ {k} ^ {*} (s), \\ 0, & a \neq a _ {k} ^ {*} (s), \end{array} \right. \tag {4.2}

where ak(s)=argmaxaqk(s,a)a_{k}^{*}(s) = \arg \max_{a} q_{k}(s, a) . If ak(s)=argmaxaqk(s,a)a_{k}^{*}(s) = \arg \max_{a} q_{k}(s, a) has multiple solutions, we can select any of them without affecting the convergence of the algorithm. Since the new policy πk+1\pi_{k+1} selects the action with the greatest qk(s,a)q_{k}(s, a) , such a policy is called greedy.

Second, the elementwise form of the value update step vk+1=rπk+1+γPπk+1vkv_{k + 1} = r_{\pi_{k + 1}} + \gamma P_{\pi_{k + 1}}v_k is

vk+1(s)=aπk+1(as)(rp(rs,a)r+γsp(ss,a)vk(s))qk(s,a),sS.v _ {k + 1} (s) = \sum_ {a} \pi_ {k + 1} (a | s) \underbrace {\left(\sum_ {r} p (r | s , a) r + \gamma \sum_ {s ^ {\prime}} p (s ^ {\prime} | s , a) v _ {k} (s ^ {\prime})\right)} _ {q _ {k} (s, a)}, \quad s \in \mathcal {S}.

Substituting (4.2) into the above equation gives

vk+1(s)=maxaqk(s,a).v _ {k + 1} (s) = \max _ {a} q _ {k} (s, a).

In summary, the above steps can be illustrated as

vk(s)qk(s,a)newgreedypolicyπk+1(s)newvaluevk+1(s)=maxaqk(s,a)v _ {k} (s) \rightarrow q _ {k} (s, a) \rightarrow \mathrm {n e w g r e e d y p o l i c y} \pi_ {k + 1} (s) \rightarrow \mathrm {n e w v a l u e} v _ {k + 1} (s) = \max _ {a} q _ {k} (s, a)

The implementation details are summarized in Algorithm 4.1.

One problem that may be confusing is whether vkv_{k} in (4.1) is a state value. The answer is no. Although vkv_{k} eventually converges to the optimal state value, it is not ensured to satisfy the Bellman equation of any policy. For example, it does not satisfy vk=rπk+1+γPπk+1vkv_{k} = r_{\pi_{k + 1}} + \gamma P_{\pi_{k + 1}}v_{k} or vk=rπk+γPπkvkv_{k} = r_{\pi_{k}} + \gamma P_{\pi_{k}}v_{k} in general. It is merely an intermediate value generated by the algorithm. In addition, since vkv_{k} is not a state value, qkq_{k} is not an action value.

4.1.2 Illustrative examples

We next present an example to illustrate the step-by-step implementation of the value iteration algorithm. This example is a two-by-two grid with one forbidden area (Fig-

Algorithm 4.1: Value iteration algorithm

Initialization: The probability models p(rs,a)p(r|s,a) and p(ss,a)p(s'|s,a) for all (s,a)(s,a) are known. Initial guess v0v_0 .

Goal: Search for the optimal state value and an optimal policy for solving the Bellman optimality equation.

While vkv_{k} has not converged in the sense that vkvk1\| v_{k} - v_{k-1} \| is greater than a predefined small threshold, for the kk th iteration, do

For every state sSs \in S , do

For every action aA(s)a \in \mathcal{A}(s) , do

qv a l u e:qk(s,a)=rp(rs,a)r+γsp(ss,a)vk(s)\mathbf {q} - \text {v a l u e}: q _ {k} (s, a) = \sum_ {r} p (r | s, a) r + \gamma \sum_ {s ^ {\prime}} p \left(s ^ {\prime} \mid s, a\right) v _ {k} \left(s ^ {\prime}\right)

Maximum action value: ak(s)=argmaxaqk(s,a)a_k^*(s) = \arg \max_a q_k(s, a)

Policy update: πk+1(as)=1\pi_{k + 1}(a|s) = 1 if a=aka = a_k^* , and πk+1(as)=0\pi_{k + 1}(a|s) = 0 otherwise.

Value update: vk+1(s)=maxaqk(s,a)v_{k+1}(s) = \max_a q_k(s, a)

Table 4.1: The expression of q(s,a)q(s, a) for the example as shown in Figure 4.2.

ure 4.2). The target area is s4s_4 . The reward settings are rboundary=rforbidden=1r_{\mathrm{boundary}} = r_{\mathrm{forbidden}} = -1 and rtarget=1r_{\mathrm{target}} = 1 . The discount rate is γ=0.9\gamma = 0.9 .


Figure 4.2: An example for demonstrating the implementation of the value iteration algorithm.

The expression of the q-value for each state-action pair is shown in Table 4.1.

k=0:\diamond k = 0:

Without loss of generality, select the initial values as v0(s1)=v0(s2)=v0(s3)=v0(s4)=0v_{0}(s_{1}) = v_{0}(s_{2}) = v_{0}(s_{3}) = v_{0}(s_{4}) = 0 .

q-value calculation: Substituting v0(si)v_{0}(s_{i}) into Table 4.1 gives the q-values shown in Table 4.2.

Table 4.2: The value of q(s,a)q(s,a) at k=0k = 0 .

Table 4.3: The value of q(s,a)q\left( {s,a}\right) at k=1k = 1 .

Policy update: π1\pi_1 is obtained by selecting the actions with the greatest q-values for every state:

π1(a5s1)=1,π1(a3s2)=1,π1(a2s3)=1,π1(a5s4)=1.\pi_ {1} (a _ {5} | s _ {1}) = 1, \quad \pi_ {1} (a _ {3} | s _ {2}) = 1, \quad \pi_ {1} (a _ {2} | s _ {3}) = 1, \quad \pi_ {1} (a _ {5} | s _ {4}) = 1.

This policy is visualized in Figure 4.2 (the middle subfigure). It is clear that this policy is not optimal because it selects to stay still at s1s_1 . Notably, the q-values for (s1,a5)(s_1, a_5) and (s1,a3)(s_1, a_3) are actually the same, and we can randomly select either action.

Value update: v1v_{1} is obtained by updating the v-value to the greatest q-value for each state:

v1(s1)=0,v1(s2)=1,v1(s3)=1,v1(s4)=1.v _ {1} (s _ {1}) = 0, \quad v _ {1} (s _ {2}) = 1, \quad v _ {1} (s _ {3}) = 1, \quad v _ {1} (s _ {4}) = 1.

k=1k = 1

q-value calculation: Substituting v1(si)v_{1}(s_{i}) into Table 4.1 yields the q-values shown in Table 4.3.

Policy update: π2\pi_{2} is obtained by selecting the greatest q-values:

π2(a3s1)=1,π2(a3s2)=1,π2(a2s3)=1,π2(a5s4)=1.\pi_ {2} (a _ {3} | s _ {1}) = 1, \quad \pi_ {2} (a _ {3} | s _ {2}) = 1, \quad \pi_ {2} (a _ {2} | s _ {3}) = 1, \quad \pi_ {2} (a _ {5} | s _ {4}) = 1.

This policy is visualized in Figure 4.2 (the right subfigure).

Value update: v2v_{2} is obtained by updating the v-value to the greatest q-value for each state:

v2(s1)=γ1,v2(s2)=1+γ1,v2(s3)=1+γ1,v2(s4)=1+γ1.v _ {2} (s _ {1}) = \gamma 1, \quad v _ {2} (s _ {2}) = 1 + \gamma 1, \quad v _ {2} (s _ {3}) = 1 + \gamma 1, \quad v _ {2} (s _ {4}) = 1 + \gamma 1.

k=2,3,4,k = 2,3,4,\ldots

It is notable that policy π2\pi_2 , as illustrated in Figure 4.2, is already optimal. Therefore, we

4.2. Policy iteration

only need to run two iterations to obtain an optimal policy in this simple example. For more complex examples, we need to run more iterations until the value of vkv_{k} converges (e.g., until vk+1vk\| v_{k + 1} - v_k\| is smaller than a pre-specified threshold).

4.1_Value_iteration - 强化学习的数学基础 | OpenTech