4.3_Truncated_policy_iteration

4.3 Truncated policy iteration

We next introduce a more general algorithm called truncated policy iteration. We will see that the value iteration and policy iteration algorithms are two special cases of the truncated policy iteration algorithm.

4.3.1 Comparing value iteration and policy iteration

First of all, we compare the value iteration and policy iteration algorithms by listing their steps as follows.

\diamond Policy iteration: Select an arbitrary initial policy π0\pi_0 . In the kk th iteration, do the following two steps.

  • Step 1: Policy evaluation (PE). Given πk\pi_k , solve vπkv_{\pi_k} from

vπk=rπk+γPπkvπk.v _ {\pi_ {k}} = r _ {\pi_ {k}} + \gamma P _ {\pi_ {k}} v _ {\pi_ {k}}.
  • Step 2: Policy improvement (PI). Given vπkv_{\pi_k} , solve πk+1\pi_{k+1} from

πk+1=argmaxπ(rπ+γPπvπk).\pi_ {k + 1} = \arg \max _ {\pi} (r _ {\pi} + \gamma P _ {\pi} v _ {\pi_ {k}}).

\diamond Value iteration: Select an arbitrary initial value v0v_0 . In the kk th iteration, do the following two steps.

  • Step 1: Policy update (PU). Given vkv_{k} , solve πk+1\pi_{k + 1} from

πk+1=argmaxπ(rπ+γPπvk).\pi_ {k + 1} = \arg \max _ {\pi} (r _ {\pi} + \gamma P _ {\pi} v _ {k}).
  • Step 2: Value update (VU). Given πk+1\pi_{k+1} , solve vk+1v_{k+1} from

vk+1=rπk+1+γPπk+1vk.v _ {k + 1} = r _ {\pi_ {k + 1}} + \gamma P _ {\pi_ {k + 1}} v _ {k}.

The above steps of the two algorithms can be illustrated as

Policy iteration: π0PEvπ0PIπ1PEvπ1PIπ2PEvπ2PI\pi_0 \xrightarrow{PE} v_{\pi_0} \xrightarrow{PI} \pi_1 \xrightarrow{PE} v_{\pi_1} \xrightarrow{PI} \pi_2 \xrightarrow{PE} v_{\pi_2} \xrightarrow{PI} \dots

Value iteration: v0PUπ1VUv1PUπ2VUv2PUv_{0} \xrightarrow{PU} \pi_{1}^{\prime} \xrightarrow{VU} v_{1} \xrightarrow{PU} \pi_{2}^{\prime} \xrightarrow{VU} v_{2} \xrightarrow{PU} \ldots

It can be seen that the procedures of the two algorithms are very similar.

We examine their value steps more closely to see the difference between the two algorithms. In particular, let both algorithms start from the same initial condition: v0=vπ0v_{0} = v_{\pi_{0}} . The procedures of the two algorithms are listed in Table 4.6. In the first three steps, the two algorithms generate the same results since v0=vπ0v_{0} = v_{\pi_{0}} . They become

Table 4.6: A comparison between the implementation steps of policy iteration and value iteration.

different in the fourth step. During the fourth step, the value iteration algorithm executes v1=rπ1+γPπ1v0v_{1} = r_{\pi_{1}} + \gamma P_{\pi_{1}}v_{0} , which is a one-step calculation, whereas the policy iteration algorithm solves vπ1=rπ1+γPπ1vπ1v_{\pi_1} = r_{\pi_1} + \gamma P_{\pi_1}v_{\pi_1} , which requires an infinite number of iterations. If we explicitly write out the iterative process for solving vπ1=rπ1+γPπ1vπ1v_{\pi_1} = r_{\pi_1} + \gamma P_{\pi_1}v_{\pi_1} in the fourth step, everything becomes clear. By letting vπ1(0)=v0v_{\pi_1}^{(0)} = v_0 , we have

vπ1(0)=v0valueiterationv1vπ1(1)=rπ1+γPπ1vπ1(0)vπ1(2)=rπ1+γPπ1vπ1(1)truncatedpolicyiterationvˉ1vπ1(j)=rπ1+γPπ1vπ1(j1)policyiterationvπ1vπ1()=rπ1+γPπ1vπ1()\begin{array}{r l} & v _ {\pi_ {1}} ^ {(0)} = v _ {0} \\ \mathrm {v a l u e i t e r a t i o n} \leftarrow v _ {1} \longleftarrow & v _ {\pi_ {1}} ^ {(1)} = r _ {\pi_ {1}} + \gamma P _ {\pi_ {1}} v _ {\pi_ {1}} ^ {(0)} \\ & v _ {\pi_ {1}} ^ {(2)} = r _ {\pi_ {1}} + \gamma P _ {\pi_ {1}} v _ {\pi_ {1}} ^ {(1)} \\ & \vdots \\ \mathrm {t r u n c a t e d p o l i c y i t e r a t i o n} \leftarrow \bar {v} _ {1} \longleftarrow & v _ {\pi_ {1}} ^ {(j)} = r _ {\pi_ {1}} + \gamma P _ {\pi_ {1}} v _ {\pi_ {1}} ^ {(j - 1)} \\ & \vdots \\ \mathrm {p o l i c y i t e r a t i o n} \leftarrow v _ {\pi_ {1}} \longleftarrow & v _ {\pi_ {1}} ^ {(\infty)} = r _ {\pi_ {1}} + \gamma P _ {\pi_ {1}} v _ {\pi_ {1}} ^ {(\infty)} \end{array}

The following observations can be obtained from the above process.

If the iteration is run only once, then vπ1(1)v_{\pi_1}^{(1)} is actually v1v_{1} , as calculated in the value iteration algorithm.
If the iteration is run an infinite number of times, then vπ1()v_{\pi_1}^{(\infty)} is actually vπ1v_{\pi_1} , as calculated in the policy iteration algorithm.
If the iteration is run a finite number of times (denoted as jtruncatedj_{\mathrm{truncated}} ), then such an algorithm is called truncated policy iteration. It is called truncated because the remaining iterations from jtruncatedj_{\mathrm{truncated}} to \infty are truncated.

As a result, the value iteration and policy iteration algorithms can be viewed as two extreme cases of the truncated policy iteration algorithm: value iteration terminates

Algorithm 4.3: Truncated policy iteration algorithm
Initialization: The probability models p(rs,a)p(r|s,a) and p(ss,a)p(s'|s,a) for all (s,a)(s,a) are known. Initial guess π0\pi_0 .
Goal: Search for the optimal state value and an optimal policy.
While vkv_k has not converged, for the kk th iteration, do
Policy evaluation:
Initialization: select the initial guess as vk(0)=vk1v_k^{(0)} = v_{k-1} . The maximum number of iterations is set as jtruncatedj_{\text{truncated}} .
While j<jtruncatedj < j_{\text{truncated}} , do
For every state sSs \in S , do
vk(j+1)(s)=aπk(as)[rp(rs,a)r+γsp(ss,a)vk(j)(s)]v_k^{(j+1)}(s) = \sum_a \pi_k(a|s) \left[ \sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_k^{(j)}(s') \right]
Set vk=vk(jtruncated)v_k = v_k^{(j_{\text{truncated}})}
Policy improvement:
For every state sSs \in S , do
For every action aA(s)a \in \mathcal{A}(s) , do
qk(s,a)=rp(rs,a)r+γsp(ss,a)vk(s)q_k(s,a) = \sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_k(s') ak(s)=argmaxaqk(s,a)a_k^*(s) = \arg \max_a q_k(s,a) πk+1(as)=1\pi_{k+1}(a|s) = 1 if a=aka = a_k^* , and πk+1(as)=0\pi_{k+1}(a|s) = 0 otherwise

at jtruncate=1j_{\mathrm{truncate}} = 1 , and policy iteration terminates at jtruncate=j_{\mathrm{truncate}} = \infty . It should be noted that, although the above comparison is illustrative, it is based on the condition that vπ1(0)=v0=vπ0v_{\pi_1}^{(0)} = v_0 = v_{\pi_0} . The two algorithms cannot be directly compared without this condition.

4.3.2 Truncated policy iteration algorithm

In a nutshell, the truncated policy iteration algorithm is the same as the policy iteration algorithm except that it merely runs a finite number of iterations in the policy evaluation step. Its implementation details are summarized in Algorithm 4.3. It is notable that vkv_{k} and vk(j)v_{k}^{(j)} in the algorithm are not state values. Instead, they are approximations of the true state values because only a finite number of iterations are executed in the policy evaluation step.

If vkv_{k} does not equal vπkv_{\pi_k} , will the algorithm still be able to find optimal policies? The answer is yes. Intuitively, truncated policy iteration is in between value iteration and policy iteration. On the one hand, it converges faster than the value iteration algorithm because it computes more than one iteration during the policy evaluation step. On the other hand, it converges slower than the policy iteration algorithm because it only computes a finite number of iterations. This intuition is illustrated in Figure 4.5. Such intuition is also supported by the following analysis.

Proposition 4.1 (Value improvement). Consider the iterative algorithm in the policy


Figure 4.5: An illustration of the relationships between the value iteration, policy iteration, and truncated policy iteration algorithms.

evaluation step:

vπk(j+1)=rπk+γPπkvπk(j),j=0,1,2,v _ {\pi_ {k}} ^ {(j + 1)} = r _ {\pi_ {k}} + \gamma P _ {\pi_ {k}} v _ {\pi_ {k}} ^ {(j)}, \quad j = 0, 1, 2, \ldots

If the initial guess is selected as vπk(0)=vπk1v_{\pi_k}^{(0)} = v_{\pi_{k - 1}} , it holds that

vπk(j+1)vπk(j)v _ {\pi_ {k}} ^ {(j + 1)} \geq v _ {\pi_ {k}} ^ {(j)}

for j=0,1,2,j = 0,1,2,\ldots

Box 4.3: Proof of Proposition 4.1

First, since vπk(j)=rπk+γPπkvπk(j1)v_{\pi_k}^{(j)} = r_{\pi_k} + \gamma P_{\pi_k}v_{\pi_k}^{(j - 1)} and vπk(j+1)=rπk+γPπkvπk(j)v_{\pi_k}^{(j + 1)} = r_{\pi_k} + \gamma P_{\pi_k}v_{\pi_k}^{(j)} , we have

vπk(j+1)vπk(j)=γPπk(vπk(j)vπk(j1))==γjPπkj(vπk(1)vπk(0)).(4.5)v _ {\pi_ {k}} ^ {(j + 1)} - v _ {\pi_ {k}} ^ {(j)} = \gamma P _ {\pi_ {k}} \left(v _ {\pi_ {k}} ^ {(j)} - v _ {\pi_ {k}} ^ {(j - 1)}\right) = \dots = \gamma^ {j} P _ {\pi_ {k}} ^ {j} \left(v _ {\pi_ {k}} ^ {(1)} - v _ {\pi_ {k}} ^ {(0)}\right). \tag {4.5}

Second, since vπk(0)=vπk1v_{\pi_k}^{(0)} = v_{\pi_{k - 1}} , we have

vπk(1)=rπk+γPπkvπk(0)=rπk+γPπkvπk1rπk1+γPπk1vπk1=vπk1=vπk(0),v _ {\pi_ {k}} ^ {(1)} = r _ {\pi_ {k}} + \gamma P _ {\pi_ {k}} v _ {\pi_ {k}} ^ {(0)} = r _ {\pi_ {k}} + \gamma P _ {\pi_ {k}} v _ {\pi_ {k - 1}} \geq r _ {\pi_ {k - 1}} + \gamma P _ {\pi_ {k - 1}} v _ {\pi_ {k - 1}} = v _ {\pi_ {k - 1}} = v _ {\pi_ {k}} ^ {(0)},

where the inequality is due to πk=argmaxπ(rπ+γPπvπk1)\pi_k = \arg \max_{\pi}(r_\pi + \gamma P_\pi v_{\pi_{k-1}}) . Substituting vπk(1)vπk(0)v_{\pi_k}^{(1)} \geq v_{\pi_k}^{(0)} into (4.5) yields vπk(j+1)vπk(j)v_{\pi_k}^{(j+1)} \geq v_{\pi_k}^{(j)} .

Notably, Proposition 4.1 requires the assumption that vπk(0)=vπk1v_{\pi_k}^{(0)} = v_{\pi_{k-1}} . However, vπk1v_{\pi_{k-1}} is unavailable in practice, and only vk1v_{k-1} is available. Nevertheless, Proposition 4.1 still sheds light on the convergence of the truncated policy iteration algorithm. A more in-depth discussion of this topic can be found in [2, Section 6.5].

Up to now, the advantages of truncated policy iteration are clear. Compared to the

policy iteration algorithm, the truncated one merely requires a finite number of iterations in the policy evaluation step and hence is more computationally efficient. Compared to value iteration, the truncated policy iteration algorithm can speed up its convergence rate by running for a few more iterations in the policy evaluation step.