2.7_Solving_state_values_from_the_Bellman_equation

2.7 Solving state values from the Bellman equation

Calculating the state values of a given policy is a fundamental problem in reinforcement learning. This problem is often referred to as policy evaluation. In this section, we present two methods for calculating state values from the Bellman equation.

2.7.1 Closed-form solution

Since vπ=rπ+γPπvπv_{\pi} = r_{\pi} + \gamma P_{\pi}v_{\pi} is a simple linear equation, its closed-form solution can be easily obtained as

vπ=(IγPπ)1rπ.v _ {\pi} = \left(I - \gamma P _ {\pi}\right) ^ {- 1} r _ {\pi}.

Some properties of (IγPπ)1(I - \gamma P_{\pi})^{-1} are given below.

\diamond IγPπI - \gamma P_{\pi} is invertible. The proof is as follows. According to the Gershgorin circle theorem [4], every eigenvalue of IγPπI - \gamma P_{\pi} lies within at least one of the Gershgorin circles. The ii th Gershgorin circle has a center at [IγPπ]ii=1γpπ(sisi)[I - \gamma P_{\pi}]_{ii} = 1 - \gamma p_{\pi}(s_i|s_i) and a radius equal to ji[IγPπ]ij=jiγpπ(sjsi)\sum_{j\neq i}\left|[I - \gamma P_{\pi}]_{ij}\right| = \sum_{j\neq i}\gamma p_{\pi}(s_j|s_i) . Since γ<1\gamma < 1 , we know that the radius is less than the magnitude of the center: jiγpπ(sjsi)<1γpπ(sisi)\sum_{j\neq i}\gamma p_{\pi}(s_j|s_i) < 1 - \gamma p_{\pi}(s_i|s_i) . Therefore, all Gershgorin circles do not encircle the origin, and hence no eigenvalue of IγPπI - \gamma P_{\pi} is zero.
\diamond (IγPπ)1I(I - \gamma P_{\pi})^{-1} \geq I , meaning that every element of (IγPπ)1(I - \gamma P_{\pi})^{-1} is nonnegative and, more specifically, no less than that of the identity matrix. This is because PπP_{\pi} has nonnegative entries, and hence (IγPπ)1=I+γPπ+γ2Pπ2+I0(I - \gamma P_{\pi})^{-1} = I + \gamma P_{\pi} + \gamma^{2}P_{\pi}^{2} + \dots \geq I \geq 0 .
For any vector r0r \geq 0 , it holds that (IγPπ)1rr0(I - \gamma P_{\pi})^{-1}r \geq r \geq 0 . This property follows from the second property because [(IγPπ)1I]r0[(I - \gamma P_{\pi})^{-1} - I]r \geq 0 . As a consequence, if r1r2r_1 \geq r_2 , we have (IγPπ)1r1(IγPπ)1r2(I - \gamma P_{\pi})^{-1}r_1 \geq (I - \gamma P_{\pi})^{-1}r_2 .

2.7.2 Iterative solution

Although the closed-form solution is useful for theoretical analysis purposes, it is not applicable in practice because it involves a matrix inversion operation, which still needs to be calculated by other numerical algorithms. In fact, we can directly solve the Bellman equation using the following iterative algorithm:

vk+1=rπ+γPπvk,k=0,1,2,.(2.11)v _ {k + 1} = r _ {\pi} + \gamma P _ {\pi} v _ {k}, \quad k = 0, 1, 2, \dots . \tag {2.11}

This algorithm generates a sequence of values {v0,v1,v2,}\{v_0, v_1, v_2, \ldots\} , where v0Rnv_0 \in \mathbb{R}^n is an initial guess of vπv_\pi . It holds that

vkvπ=(IγPπ)1rπ,a sk.(2.12)v _ {k} \rightarrow v _ {\pi} = \left(I - \gamma P _ {\pi}\right) ^ {- 1} r _ {\pi}, \quad \text {a s} k \rightarrow \infty . \tag {2.12}

Interested readers may see the proof in Box 2.1.

Box 2.1: Convergence proof of (2.12)

Define the error as δk=vkvπ\delta_{k} = v_{k} - v_{\pi} . We only need to show that δk0\delta_{k} \to 0 . Substituting vk+1=δk+1+vπv_{k+1} = \delta_{k+1} + v_{\pi} and vk=δk+vπv_{k} = \delta_{k} + v_{\pi} into vk+1=rπ+γPπvkv_{k+1} = r_{\pi} + \gamma P_{\pi} v_{k} gives

δk+1+vπ=rπ+γPπ(δk+vπ),\delta_ {k + 1} + v _ {\pi} = r _ {\pi} + \gamma P _ {\pi} (\delta_ {k} + v _ {\pi}),

which can be rewritten as

δk+1=vπ+rπ+γPπδk+γPπvπ,=γPπδkvπ+(rπ+γPπvπ),=γPπδk.\begin{array}{l} \delta_ {k + 1} = - v _ {\pi} + r _ {\pi} + \gamma P _ {\pi} \delta_ {k} + \gamma P _ {\pi} v _ {\pi}, \\ = \gamma P _ {\pi} \delta_ {k} - v _ {\pi} + \left(r _ {\pi} + \gamma P _ {\pi} v _ {\pi}\right), \\ = \gamma P _ {\pi} \delta_ {k}. \\ \end{array}

As a result,

δk+1=γPπδk=γ2Pπ2δk1==γk+1Pπk+1δ0.\delta_ {k + 1} = \gamma P _ {\pi} \delta_ {k} = \gamma^ {2} P _ {\pi} ^ {2} \delta_ {k - 1} = \dots = \gamma^ {k + 1} P _ {\pi} ^ {k + 1} \delta_ {0}.

Since every entry of PπP_{\pi} is nonnegative and no greater than one, we have that 0Pπk10 \leq P_{\pi}^{k} \leq 1 for any kk . That is, every entry of PπkP_{\pi}^{k} is no greater than 1. On the other hand, since γ<1\gamma < 1 , we know that γk0\gamma^{k} \to 0 , and hence δk+1=γk+1Pπk+1δ00\delta_{k+1} = \gamma^{k+1} P_{\pi}^{k+1} \delta_{0} \to 0 as kk \to \infty .

2.7.3 Illustrative examples

We next apply the algorithm in (2.11) to solve the state values of some examples.

The examples are shown in Figure 2.7. The orange cells represent forbidden areas. The blue cell represents the target area. The reward settings are rboundary=rforbidden=1r_{\mathrm{boundary}} = r_{\mathrm{forbidden}} = -1


(a) Two "good" policies and their state values. The state values of the two policies are the same, but the two policies are different at the top two states in the fourth column.

Figure 2.7: Examples of policies and their corresponding state values.

(b) Two "bad" policies and their state values. The state values are smaller than those of the "good" policies.

and rtarget=1r_{\mathrm{target}} = 1 . Here, the discount rate is γ=0.9\gamma = 0.9

Figure 2.7(a) shows two "good" policies and their corresponding state values obtained by (2.11). The two policies have the same state values but differ at the top two states in the fourth column. Therefore, we know that different policies may have the same state values.

Figure 2.7(b) shows two "bad" policies and their corresponding state values. These two policies are bad because the actions of many states are intuitively unreasonable. Such intuition is supported by the obtained state values. As can be seen, the state values of these two policies are negative and much smaller than those of the good policies in Figure 2.7(a).