3.4_Solving_an_optimal_policy_from_the_BOE

3.4 Solving an optimal policy from the BOE

With the preparation in the last section, we are ready to solve the BOE to obtain the optimal state value vv^{*} and an optimal policy π\pi^{*} .

\diamond Solving vv^{*} : If vv^{*} is a solution of the BOE, then it satisfies

v=maxπΠ(rπ+γPπv).v ^ {*} = \max _ {\pi \in \Pi} (r _ {\pi} + \gamma P _ {\pi} v ^ {*}).

Clearly, vv^{*} is a fixed point because v=f(v)v^{*} = f(v^{*}) . Then, the contraction mapping theorem suggests the following results.

Theorem 3.3 (Existence, uniqueness, and algorithm). For the BOE v=f(v)=maxπΠ(rπ+γPπv)v = f(v) = \max_{\pi \in \Pi} (r_{\pi} + \gamma P_{\pi}v) , there always exists a unique solution vv^{*} , which can be solved iteratively by

vk+1=f(vk)=maxπΠ(rπ+γPπvk),k=0,1,2,.v _ {k + 1} = f (v _ {k}) = \max _ {\pi \in \Pi} (r _ {\pi} + \gamma P _ {\pi} v _ {k}), \quad k = 0, 1, 2, \ldots .

The value of vkv_{k} converges to vv^{*} exponentially fast as kk \to \infty given any initial guess v0v_{0} .

The proof of this theorem directly follows from the contraction mapping theorem since f(v)f(v) is a contraction mapping. This theorem is important because it answers some fundamental questions.

  • Existence of vv^{*} : The solution of the BOE always exists.

  • Uniqueness of vv^{*} : The solution vv^{*} is always unique.

  • Algorithm for solving vv^{*} : The value of vv^{*} can be solved by the iterative algorithm suggested by Theorem 3.3. This iterative algorithm has a specific name called value iteration. Its implementation will be introduced in detail in Chapter 4. We mainly focus on the fundamental properties of the BOE in the present chapter.

\diamond Solving π\pi^{*} : Once the value of vv^{*} has been obtained, we can easily obtain π\pi^{*} by solving

π=argmaxπΠ(rπ+γPπv).(3.6)\pi^ {*} = \arg \max _ {\pi \in \Pi} (r _ {\pi} + \gamma P _ {\pi} v ^ {*}). \tag {3.6}

The value of π\pi^{*} will be given in Theorem 3.5. Substituting (3.6) into the BOE yields

v=rπ+γPπv.v ^ {*} = r _ {\pi^ {*}} + \gamma P _ {\pi^ {*}} v ^ {*}.

Therefore, v=vπv^{*} = v_{\pi^{*}} is the state value of π\pi^{*} , and the BOE is a special Bellman equation whose corresponding policy is π\pi^{*} .

At this point, although we can solve vv^{*} and π\pi^{*} , it is still unclear whether the solution is optimal. The following theorem reveals the optimality of the solution.

Theorem 3.4 (Optimality of vv^{*} and π\pi^{*} ). The solution vv^{*} is the optimal state value, and π\pi^{*} is an optimal policy. That is, for any policy π\pi , it holds that

v=vπvπ,v ^ {*} = v _ {\pi^ {*}} \geq v _ {\pi},

where vπv_{\pi} is the state value of π\pi , and \geq is an elementwise comparison.

Now, it is clear why we must study the BOE: its solution corresponds to optimal state values and optimal policies. The proof of the above theorem is given in the following box.

Box 3.3: Proof of Theorem 3.4

For any policy π\pi , it holds that

vπ=rπ+γPπvπ.v _ {\pi} = r _ {\pi} + \gamma P _ {\pi} v _ {\pi}.

Since

v=maxπ(rπ+γPπv)=rπ+γPπvrπ+γPπv,v ^ {*} = \max _ {\pi} (r _ {\pi} + \gamma P _ {\pi} v ^ {*}) = r _ {\pi^ {*}} + \gamma P _ {\pi^ {*}} v ^ {*} \geq r _ {\pi} + \gamma P _ {\pi} v ^ {*},

we have

vvπ(rπ+γPπv)(rπ+γPπvπ)=γPπ(vvπ).v ^ {*} - v _ {\pi} \geq (r _ {\pi} + \gamma P _ {\pi} v ^ {*}) - (r _ {\pi} + \gamma P _ {\pi} v _ {\pi}) = \gamma P _ {\pi} (v ^ {*} - v _ {\pi}).

Repeatedly applying the above inequality gives vvπγPπ(vvπ)γ2Pπ2(vvπ)γnPπn(vvπ)v^{*} - v_{\pi} \geq \gamma P_{\pi}(v^{*} - v_{\pi}) \geq \gamma^{2}P_{\pi}^{2}(v^{*} - v_{\pi}) \geq \dots \geq \gamma^{n}P_{\pi}^{n}(v^{*} - v_{\pi}) . It follows that

vvπlimnγnPπn(vvπ)=0,v ^ {*} - v _ {\pi} \geq \lim _ {n \to \infty} \gamma^ {n} P _ {\pi} ^ {n} (v ^ {*} - v _ {\pi}) = 0,

where the last equality is true because γ<1\gamma < 1 and PπnP_{\pi}^{n} is a nonnegative matrix with all its elements less than or equal to 1 (because Pπn1=1P_{\pi}^{n}\mathbf{1} = \mathbf{1} ). Therefore, vvπv^{*} \geq v_{\pi} for any π\pi .

We next examine π\pi^{*} in (3.6) more closely. In particular, the following theorem shows that there always exists a deterministic greedy policy that is optimal.

Theorem 3.5 (Greedy optimal policy). For any sSs \in S , the deterministic greedy policy

π(as)={1,a=a(s),0,aa(s),(3.7)\pi^ {*} (a | s) = \left\{ \begin{array}{l l} 1, & a = a ^ {*} (s), \\ 0, & a \neq a ^ {*} (s), \end{array} \right. \tag {3.7}

is an optimal policy for solving the BOE. Here,

a(s)=argmaxaq(a,s),a ^ {*} (s) = \arg \max _ {a} q ^ {*} (a, s),

where

q(s,a)rRp(rs,a)r+γsSp(ss,a)v(s).q ^ {*} (s, a) \doteq \sum_ {r \in \mathcal {R}} p (r | s, a) r + \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s, a) v ^ {*} (s ^ {\prime}).

Box 3.4: Proof of Theorem 3.5

While the matrix-vector form of the optimal policy is π=argmaxπ(rπ+γPπv)\pi^{*} = \arg \max_{\pi}(r_{\pi} + \gamma P_{\pi}v^{*}) , its elementwise form is

π(s)=argmaxπΠaAπ(as)(rRp(rs,a)r+γsSp(ss,a)v(s))q(s,a),sS.\pi^ {*} (s) = \arg \max _ {\pi \in \Pi} \sum_ {a \in \mathcal {A}} \pi (a | s) \underbrace {\left(\sum_ {r \in \mathcal {R}} p (r | s , a) r + \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s , a) v ^ {*} (s ^ {\prime})\right)} _ {q ^ {*} (s, a)}, \quad s \in \mathcal {S}.

It is clear that aAπ(as)q(s,a)\sum_{a\in \mathcal{A}}\pi (a|s)q^{*}(s,a) is maximized if π(s)\pi (s) selects the action with the greatest q(s,a)q^{*}(s,a) .

The policy in (3.7) is called greedy because it seeks the actions with the greatest q(s,a)q^{*}(s,a) . Finally, we discuss two important properties of π\pi^{*} .

Uniqueness of optimal policies: Although the value of vv^{*} is unique, the optimal policy that corresponds to vv^{*} may not be unique. This can be easily verified by counterexamples. For example, the two policies shown in Figure 3.3 are both optimal.
\diamond Stochasticity of optimal policies: An optimal policy can be either stochastic or deterministic, as demonstrated in Figure 3.3. However, it is certain that there always exists a deterministic optimal policy according to Theorem 3.5.


Figure 3.3: Examples for demonstrating that optimal policies may not be unique. The two policies are different but are both optimal.