4.2_Policy_iteration

4.2 Policy iteration

This section presents another important algorithm: policy iteration. Unlike value iteration, policy iteration is not for directly solving the Bellman optimality equation. However, it has an intimate relationship with value iteration, as shown later. Moreover, the idea of policy iteration is very important since it is widely utilized in reinforcement learning algorithms.

4.2.1 Algorithm analysis

Policy iteration is an iterative algorithm. Each iteration has two steps.

The first is a policy evaluation step. As its name suggests, this step evaluates a given policy by calculating the corresponding state value. That is to solve the following Bellman equation:

vπk=rπk+γPπkvπk,(4.3)v _ {\pi_ {k}} = r _ {\pi_ {k}} + \gamma P _ {\pi_ {k}} v _ {\pi_ {k}}, \tag {4.3}

where πk\pi_k is the policy obtained in the last iteration and vπkv_{\pi_k} is the state value to be calculated. The values of rπkr_{\pi_k} and PπkP_{\pi_k} can be obtained from the system model.

\diamond The second is a policy improvement step. As its name suggests, this step is used to improve the policy. In particular, once vπkv_{\pi_k} has been calculated in the first step, a new policy πk+1\pi_{k + 1} can be obtained as

πk+1=argmaxπ(rπ+γPπvπk).\pi_ {k + 1} = \arg \max _ {\pi} (r _ {\pi} + \gamma P _ {\pi} v _ {\pi_ {k}}).

Three questions naturally follow the above description of the algorithm.

In the policy evaluation step, how to solve the state value vπkv_{\pi_k} ?
In the policy improvement step, why is the new policy πk+1\pi_{k + 1} better than πk\pi_k ?
\diamond Why can this algorithm finally converge to an optimal policy?

We next answer these questions one by one.

In the policy evaluation step, how to calculate vπkv_{\pi_k} ?

We introduced two methods in Chapter 2 for solving the Bellman equation in (4.3). We next revisit the two methods briefly. The first method is a closed-form solution:

4.2. Policy iteration

vπk=(IγPπk)1rπkv_{\pi_k} = (I - \gamma P_{\pi_k})^{-1}r_{\pi_k} . This closed-form solution is useful for theoretical analysis purposes, but it is inefficient to implement since it requires other numerical algorithms to compute the matrix inverse. The second method is an iterative algorithm that can be easily implemented:

vπk(j+1)=rπk+γPπkvπk(j),j=0,1,2,(4.4)v _ {\pi_ {k}} ^ {(j + 1)} = r _ {\pi_ {k}} + \gamma P _ {\pi_ {k}} v _ {\pi_ {k}} ^ {(j)}, \quad j = 0, 1, 2, \dots \tag {4.4}

where vπk(j)v_{\pi_k}^{(j)} denotes the jj th estimate of vπkv_{\pi_k} . Starting from any initial guess vπk(0)v_{\pi_k}^{(0)} , it is ensured that vπk(j)vπkv_{\pi_k}^{(j)} \to v_{\pi_k} as jj \to \infty . Details can be found in Section 2.7.

Interestingly, policy iteration is an iterative algorithm with another iterative algorithm (4.4) embedded in the policy evaluation step. In theory, this embedded iterative algorithm requires an infinite number of steps (that is, jj \to \infty ) to converge to the true state value vπkv_{\pi_k} . This is, however, impossible to realize. In practice, the iterative process terminates when a certain criterion is satisfied. For example, the termination criterion can be that vπk(j+1)vπk(j)\| v_{\pi_k}^{(j + 1)} - v_{\pi_k}^{(j)} \| is less than a prespecified threshold or that jj exceeds a prespecified value. If we do not run an infinite number of iterations, we can only obtain an imprecise value of vπkv_{\pi_k} , which will be used in the subsequent policy improvement step. Would this cause problems? The answer is no. The reason will become clear when we introduce the truncated policy iteration algorithm later in Section 4.3.

In the policy improvement step, why is πk+1\pi_{k + 1} better than πk\pi_k ?

The policy improvement step can improve the given policy, as shown below.

Lemma 4.1 (Policy improvement). If πk+1=argmaxπ(rπ+γPπvπk)\pi_{k + 1} = \arg \max_{\pi}(r_{\pi} + \gamma P_{\pi}v_{\pi_k}) , then vπk+1vπkv_{\pi_{k + 1}}\geq v_{\pi_k}

Here, vπk+1vπkv_{\pi_{k+1}} \geq v_{\pi_k} means that vπk+1(s)vπk(s)v_{\pi_{k+1}}(s) \geq v_{\pi_k}(s) for all ss . The proof of this lemma is given in Box 4.1.

Box 4.1: Proof of Lemma 4.1

Since vπk+1v_{\pi_{k+1}} and vπkv_{\pi_k} are state values, they satisfy the Bellman equations:

vπk+1=rπk+1+γPπk+1vπk+1,v _ {\pi_ {k + 1}} = r _ {\pi_ {k + 1}} + \gamma P _ {\pi_ {k + 1}} v _ {\pi_ {k + 1}},
vπk=rπk+γPπkvπk.v _ {\pi_ {k}} = r _ {\pi_ {k}} + \gamma P _ {\pi_ {k}} v _ {\pi_ {k}}.

Since πk+1=argmaxπ(rπ+γPπvπk)\pi_{k + 1} = \arg \max_{\pi}(r_{\pi} + \gamma P_{\pi}v_{\pi_k}) , we know that

rπk+1+γPπk+1vπkrπk+γPπkvπk.r _ {\pi_ {k + 1}} + \gamma P _ {\pi_ {k + 1}} v _ {\pi_ {k}} \geq r _ {\pi_ {k}} + \gamma P _ {\pi_ {k}} v _ {\pi_ {k}}.

It then follows that

vπkvπk+1=(rπk+γPπkvπk)(rπk+1+γPπk+1vπk+1)(rπk+1+γPπk+1vπk)(rπk+1+γPπk+1vπk+1)γPπk+1(vπkvπk+1).\begin{array}{l} v _ {\pi_ {k}} - v _ {\pi_ {k + 1}} = \left(r _ {\pi_ {k}} + \gamma P _ {\pi_ {k}} v _ {\pi_ {k}}\right) - \left(r _ {\pi_ {k + 1}} + \gamma P _ {\pi_ {k + 1}} v _ {\pi_ {k + 1}}\right) \\ \leq \left(r _ {\pi_ {k + 1}} + \gamma P _ {\pi_ {k + 1}} v _ {\pi_ {k}}\right) - \left(r _ {\pi_ {k + 1}} + \gamma P _ {\pi_ {k + 1}} v _ {\pi_ {k + 1}}\right) \\ \leq \gamma P _ {\pi_ {k + 1}} (v _ {\pi_ {k}} - v _ {\pi_ {k + 1}}). \\ \end{array}

Therefore,

vπkvπk+1γ2Pπk+12(vπkvπk+1)γnPπk+1n(vπkvπk+1)limnγnPπk+1n(vπkvπk+1)=0.\begin{array}{l} v _ {\pi_ {k}} - v _ {\pi_ {k + 1}} \leq \gamma^ {2} P _ {\pi_ {k + 1}} ^ {2} (v _ {\pi_ {k}} - v _ {\pi_ {k + 1}}) \leq \ldots \leq \gamma^ {n} P _ {\pi_ {k + 1}} ^ {n} (v _ {\pi_ {k}} - v _ {\pi_ {k + 1}}) \\ \leq \lim _ {n \to \infty} \gamma^ {n} P _ {\pi_ {k + 1}} ^ {n} (v _ {\pi_ {k}} - v _ {\pi_ {k + 1}}) = 0. \\ \end{array}

The limit is due to the facts that γn0\gamma^n\to 0 as nn\to \infty and Pπk+1nP_{\pi_{k + 1}}^n is a nonnegative stochastic matrix for any nn . Here, a stochastic matrix refers to a nonnegative matrix whose row sums are equal to one for all rows.

Why can the policy iteration algorithm eventually find an optimal policy?

The policy iteration algorithm generates two sequences. The first is a sequence of policies: {π0,π1,,πk,}\{\pi_0,\pi_1,\ldots ,\pi_k,\ldots \} . The second is a sequence of state values: {vπ0,vπ1,,vπk,}\{v_{\pi_0},v_{\pi_1},\ldots ,v_{\pi_k},\ldots \} . Suppose that vv^{*} is the optimal state value. Then, vπkvv_{\pi_k}\leq v^* for all kk . Since the policies are continuously improved according to Lemma 4.1, we know that

vπ0vπ1vπ2vπkv.v _ {\pi_ {0}} \leq v _ {\pi_ {1}} \leq v _ {\pi_ {2}} \leq \dots \leq v _ {\pi_ {k}} \leq \dots \leq v ^ {*}.

Since vπkv_{\pi_k} is nondecreasing and always bounded from above by vv^* , it follows from the monotone convergence theorem [12] (Appendix C) that vπkv_{\pi_k} converges to a constant value, denoted as vv_\infty , when kk \to \infty . The following analysis shows that v=vv_\infty = v^* .

Theorem 4.1 (Convergence of policy iteration). The state value sequence {vπk}k=0\{v_{\pi_k}\}_{k=0}^{\infty} generated by the policy iteration algorithm converges to the optimal state value vv^* . As a result, the policy sequence {πk}k=0\{\pi_k\}_{k=0}^{\infty} converges to an optimal policy.

The proof of this theorem is given in Box 4.2. The proof not only shows the convergence of the policy iteration algorithm but also reveals the relationship between the policy iteration and value iteration algorithms. Loosely speaking, if both algorithms start from the same initial guess, policy iteration will converge faster than value iteration due to the additional iterations embedded in the policy evaluation step. This point will become clearer when we introduce the truncated policy iteration algorithm in Section 4.3.

Box 4.2: Proof of Theorem 4.1

The idea of the proof is to show that the policy iteration algorithm converges faster than the value iteration algorithm.

In particular, to prove the convergence of {vπk}k=0\{v_{\pi_k}\}_{k = 0}^{\infty} , we introduce another sequence {vk}k=0\{v_k\}_{k = 0}^{\infty} generated by

vk+1=f(vk)=maxπ(rπ+γPπvk).v _ {k + 1} = f (v _ {k}) = \max _ {\pi} (r _ {\pi} + \gamma P _ {\pi} v _ {k}).

This iterative algorithm is exactly the value iteration algorithm. We already know that vkv_{k} converges to vv^{*} when given any initial value v0v_{0} .

For k=0k = 0 , we can always find a v0v_{0} such that vπ0v0v_{\pi_0} \geq v_0 for any π0\pi_0 .

We next show that vkvπkvv_{k} \leq v_{\pi_{k}} \leq v^{*} for all kk by induction.

For k0k \geq 0 , suppose that vπkvkv_{\pi_k} \geq v_k .

For k+1k + 1 , we have

vπk+1vk+1=(rπk+1+γPπk+1vπk+1)maxπ(rπ+γPπvk)(rπk+1+γPπk+1vπk)maxπ(rπ+γPπvk)(becausevπk+1vπkbyLemma4.1andPπk+10)=(rπk+1+γPπk+1vπk)(rπk+γPπkvk)(s u p p o s eπk=argmaxπ(rπ+γPπvk))(rπk+γPπkvπk)(rπk+γPπkvk)(b e c a u s eπk+1=argmaxπ(rπ+γPπvπk))=γPπk(vπkvk).\begin{array}{l} v _ {\pi_ {k + 1}} - v _ {k + 1} = \left(r _ {\pi_ {k + 1}} + \gamma P _ {\pi_ {k + 1}} v _ {\pi_ {k + 1}}\right) - \max _ {\pi} \left(r _ {\pi} + \gamma P _ {\pi} v _ {k}\right) \\ \geq \left(r _ {\pi_ {k + 1}} + \gamma P _ {\pi_ {k + 1}} v _ {\pi_ {k}}\right) - \max _ {\pi} \left(r _ {\pi} + \gamma P _ {\pi} v _ {k}\right) \\ (b e c a u s e v _ {\pi_ {k + 1}} \geq v _ {\pi_ {k}} b y L e m m a 4. 1 a n d P _ {\pi_ {k + 1}} \geq 0) \\ = \left(r _ {\pi_ {k + 1}} + \gamma P _ {\pi_ {k + 1}} v _ {\pi_ {k}}\right) - \left(r _ {\pi_ {k} ^ {\prime}} + \gamma P _ {\pi_ {k} ^ {\prime}} v _ {k}\right) \\ \left(\text {s u p p o s e} \pi_ {k} ^ {\prime} = \arg \max _ {\pi} \left(r _ {\pi} + \gamma P _ {\pi} v _ {k}\right)\right) \\ \geq \left(r _ {\pi_ {k} ^ {\prime}} + \gamma P _ {\pi_ {k} ^ {\prime}} v _ {\pi_ {k}}\right) - \left(r _ {\pi_ {k} ^ {\prime}} + \gamma P _ {\pi_ {k} ^ {\prime}} v _ {k}\right) \\ \left(\text {b e c a u s e} \pi_ {k + 1} = \arg \max _ {\pi} \left(r _ {\pi} + \gamma P _ {\pi} v _ {\pi_ {k}}\right)\right) \\ = \gamma P _ {\pi_ {k} ^ {\prime}} \left(v _ {\pi_ {k}} - v _ {k}\right). \\ \end{array}

Since vπkvk0v_{\pi_k} - v_k \geq 0 and PπkP_{\pi_k'} is nonnegative, we have Pπk(vπkvk)0P_{\pi_k'}(v_{\pi_k} - v_k) \geq 0 and hence vπk+1vk+10v_{\pi_{k+1}} - v_{k+1} \geq 0 .

Therefore, we can show by induction that vkvπkvv_{k} \leq v_{\pi_{k}} \leq v^{*} for any k0k \geq 0 . Since vkv_{k} converges to vv^{*} , vπkv_{\pi_k} also converges to vv^{*} .

4.2.2 Elementwise form and implementation

To implement the policy iteration algorithm, we need to study its elementwise form.

\diamond First, the policy evaluation step solves vπkv_{\pi_k} from vπk=rπk+γPπkvπkv_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k} by using the

Algorithm 4.2: Policy iteration algorithm

Initialization: The system model, p(rs,a)p(r|s,a) and p(ss,a)p(s'|s,a) for all (s,a)(s,a) , is known. Initial guess π0\pi_0 .

Goal: Search for the optimal state value and an optimal policy.

While vπkv_{\pi_k} has not converged, for the kk th iteration, do

Policy evaluation:

Initialization: an arbitrary initial guess vπk(0)v_{\pi_k}^{(0)}

While vπk(j)v_{\pi_k}^{(j)} has not converged, for the jj th iteration, do

For every state sSs \in S , do

vπk(j+1)(s)=aπk(as)[rp(rs,a)r+γsp(ss,a)vπk(j)(s)]v _ {\pi_ {k}} ^ {(j + 1)} (s) = \sum_ {a} \pi_ {k} (a | s) \left[ \sum_ {r} p (r | s, a) r + \gamma \sum_ {s ^ {\prime}} p \left(s ^ {\prime} \mid s, a\right) v _ {\pi_ {k}} ^ {(j)} \left(s ^ {\prime}\right) \right]

Policy improvement:

For every state sSs \in S , do

For every action aAa \in \mathcal{A} , do

qπk(s,a)=rp(rs,a)r+γsp(ss,a)vπk(s)q _ {\pi_ {k}} (s, a) = \sum_ {r} p (r | s, a) r + \gamma \sum_ {s ^ {\prime}} p \left(s ^ {\prime} \mid s, a\right) v _ {\pi_ {k}} \left(s ^ {\prime}\right)

ak(s)=argmaxaqπk(s,a)a_{k}^{*}(s) = \arg \max_{a}q_{\pi_{k}}(s,a)

πk+1(as)=1i fa=ak,a n dπk+1(as)=0o t h e r w i s e\pi_ {k + 1} (a | s) = 1 \text {i f} a = a _ {k} ^ {*}, \text {a n d} \pi_ {k + 1} (a | s) = 0 \text {o t h e r w i s e}

iterative algorithm in (4.4). The elementwise form of this algorithm is

vπk(j+1)(s)=aπk(as)(rp(rs,a)r+γsp(ss,a)vπk(j)(s)),sS,v _ {\pi_ {k}} ^ {(j + 1)} (s) = \sum_ {a} \pi_ {k} (a | s) \left(\sum_ {r} p (r | s, a) r + \gamma \sum_ {s ^ {\prime}} p \left(s ^ {\prime} \mid s, a\right) v _ {\pi_ {k}} ^ {(j)} \left(s ^ {\prime}\right)\right), \quad s \in \mathcal {S},

where j=0,1,2,j = 0,1,2,\ldots

\diamond Second, the policy improvement step solves πk+1=argmaxπ(rπ+γPπvπk)\pi_{k + 1} = \arg \max_{\pi}(r_{\pi} + \gamma P_{\pi}v_{\pi_k}) . The elementwise form of this equation is

πk+1(s)=argmaxπaπ(as)(rp(rs,a)r+γsp(ss,a)vπk(s))qπk(s,a),sS,\pi_ {k + 1} (s) = \arg \max _ {\pi} \sum_ {a} \pi (a | s) \underbrace {\left(\sum_ {r} p (r | s , a) r + \gamma \sum_ {s ^ {\prime}} p (s ^ {\prime} | s , a) v _ {\pi_ {k}} (s ^ {\prime})\right)} _ {q _ {\pi_ {k}} (s, a)}, \quad s \in \mathcal {S},

where qπk(s,a)q_{\pi_k}(s,a) is the action value under policy πk\pi_k . Let ak(s)=argmaxaqπk(s,a)a_k^*(s) = \arg \max_a q_{\pi_k}(s,a) . Then, the greedy optimal policy is

πk+1(as)={1,a=ak(s),0,aak(s).\pi_ {k + 1} (a | s) = \left\{ \begin{array}{l l} 1, & a = a _ {k} ^ {*} (s), \\ 0, & a \neq a _ {k} ^ {*} (s). \end{array} \right.

The implementation details are summarized in Algorithm 4.2.

4.2.3 Illustrative examples

A simple example

Consider a simple example shown in Figure 4.3. There are two states with three possible actions: A={a,a0,ar}\mathcal{A} = \{a_{\ell}, a_0, a_r\} . The three actions represent moving leftward, staying unchanged, and moving rightward. The reward settings are rboundary=1r_{\mathrm{boundary}} = -1 and rtarget=1r_{\mathrm{target}} = 1 . The discount rate is γ=0.9\gamma = 0.9 .


(a)


(b)
Figure 4.3: An example for illustrating the implementation of the policy iteration algorithm.

We next present the implementation of the policy iteration algorithm in a step-by-step manner. When k=0k = 0 , we start with the initial policy shown in Figure 4.3(a). This policy is not good because it does not move toward the target area. We next show how to apply the policy iteration algorithm to obtain an optimal policy.

First, in the policy evaluation step, we need to solve the Bellman equation:

vπ0(s1)=1+γvπ0(s1),v _ {\pi_ {0}} (s _ {1}) = - 1 + \gamma v _ {\pi_ {0}} (s _ {1}),
vπ0(s2)=0+γvπ0(s1).v _ {\pi_ {0}} (s _ {2}) = 0 + \gamma v _ {\pi_ {0}} (s _ {1}).

Since the equation is simple, it can be manually solved that

vπ0(s1)=10,vπ0(s2)=9.v _ {\pi_ {0}} (s _ {1}) = - 1 0, \quad v _ {\pi_ {0}} (s _ {2}) = - 9.

In practice, the equation can be solved by the iterative algorithm in (4.4). For example, select the initial state values as vπ0(0)(s1)=vπ0(0)(s2)=0v_{\pi_0}^{(0)}(s_1) = v_{\pi_0}^{(0)}(s_2) = 0 . It follows from (4.4) that

{vπ0(1)(s1)=1+γvπ0(0)(s1)=1,vπ0(1)(s2)=0+γvπ0(0)(s1)=0,\left\{ \begin{array}{l l} v _ {\pi_ {0}} ^ {(1)} (s _ {1}) = - 1 + \gamma v _ {\pi_ {0}} ^ {(0)} (s _ {1}) = - 1, \\ v _ {\pi_ {0}} ^ {(1)} (s _ {2}) = 0 + \gamma v _ {\pi_ {0}} ^ {(0)} (s _ {1}) = 0, \end{array} \right.
{vπ0(2)(s1)=1+γvπ0(1)(s1)=1.9,vπ0(2)(s2)=0+γvπ0(1)(s1)=0.9,\left\{ \begin{array}{l} v _ {\pi_ {0}} ^ {(2)} (s _ {1}) = - 1 + \gamma v _ {\pi_ {0}} ^ {(1)} (s _ {1}) = - 1. 9, \\ v _ {\pi_ {0}} ^ {(2)} (s _ {2}) = 0 + \gamma v _ {\pi_ {0}} ^ {(1)} (s _ {1}) = - 0. 9, \end{array} \right.
{vπ0(3)(s1)=1+γvπ0(2)(s1)=2.71,vπ0(3)(s2)=0+γvπ0(2)(s1)=1.71,\left\{ \begin{array}{l} v _ {\pi_ {0}} ^ {(3)} (s _ {1}) = - 1 + \gamma v _ {\pi_ {0}} ^ {(2)} (s _ {1}) = - 2. 7 1, \\ v _ {\pi_ {0}} ^ {(3)} (s _ {2}) = 0 + \gamma v _ {\pi_ {0}} ^ {(2)} (s _ {1}) = - 1. 7 1, \end{array} \right.

With more iterations, we can see the trend: vπ0(j)(s1)vπ0(s1)=10v_{\pi_0}^{(j)}(s_1) \to v_{\pi_0}(s_1) = -10 and vπ0(j)(s2)vπ0(s2)=9v_{\pi_0}^{(j)}(s_2) \to v_{\pi_0}(s_2) = -9 as jj increases.

\diamond Second, in the policy improvement step, the key is to calculate qπ0(s,a)q_{\pi_0}(s,a) for each state-action pair. The following q-table can be used to demonstrate such a process:

Substituting vπ0(s1)=10,vπ0(s2)=9v_{\pi_0}(s_1) = -10, v_{\pi_0}(s_2) = -9 obtained in the previous policy evaluation step into Table 4.4 yields Table 4.5.

Table 4.4: The expression of qπk(s,a)q_{\pi_k}(s,a) for the example in Figure 4.3.

Table 4.5: The value of qπk(s,a){q}_{{\pi }_{k}}\left( {s,a}\right) when k=0k = 0 .

By seeking the greatest value of qπ0q_{\pi_0} , the improved policy π1\pi_1 can be obtained as

π1(ars1)=1,π1(a0s2)=1.\pi_ {1} (a _ {r} | s _ {1}) = 1, \quad \pi_ {1} (a _ {0} | s _ {2}) = 1.

This policy is illustrated in Figure 4.3(b). It is clear that this policy is optimal.

The above process shows that a single iteration is sufficient for finding the optimal policy in this simple example. More iterations are required for more complex examples.

A more complicated example

We next demonstrate the policy iteration algorithm using a more complicated example shown in Figure 4.4. The reward settings are rboundary=1r_{\mathrm{boundary}} = -1 , rforbidden=10r_{\mathrm{forbidden}} = -10 , and rtarget=1r_{\mathrm{target}} = 1 . The discount rate is γ=0.9\gamma = 0.9 . The policy iteration algorithm can converge to the optimal policy (Figure 4.4(h)) when starting from a random initial policy (Figure 4.4(a)).

Two interesting phenomena are observed during the iteration process.

\diamond First, if we observe how the policy evolves, an interesting pattern is that the states that are close to the target area find the optimal policies earlier than those far away. Only if the close states can find trajectories to the target first, can the farther states find trajectories passing through the close states to reach the target.
\diamond Second, the spatial distribution of the state values exhibits an interesting pattern: the states that are located closer to the target have greater state values. The reason for this pattern is that an agent starting from a farther state must travel for many steps to obtain a positive reward. Such rewards would be severely discounted and hence relatively small.


(a) π0\pi_0 and vπ0v_{\pi_0}


(b) π1\pi_1 and vπ1v_{\pi_1}


(c) π2\pi_2 and vπ2v_{\pi_2}


(d) π3\pi_3 and vπ3v_{\pi_3}


(e) π4\pi_4 and vπ4v_{\pi_4}


(f) π5\pi_5 and vπ5v_{\pi_5}


(g) π9\pi_9 and vπ9v_{\pi_9}


Figure 4.4: The evolution processes of the policies generated by the policy iteration algorithm.


(h) π10\pi_{10} and vπ10v_{\pi_{10}}