2.2_Motivating_example_2_How_to_calculate_returns

2.2 Motivating example 2: How to calculate returns?

While we have demonstrated the importance of returns, a question that immediately follows is how to calculate the returns when following a given policy.

There are two ways to calculate returns.

The first is simply by definition: a return equals the discounted sum of all the rewards collected along a trajectory. Consider the example in Figure 2.3. Let viv_{i} denote the return obtained by starting from sis_{i} for i=1,2,3,4i = 1,2,3,4 . Then, the returns obtained when


Figure 2.3: An example for demonstrating how to calculate returns. There are no target or forbidden cells in this example.

starting from the four states in Figure 2.3 can be calculated as

v1=r1+γr2+γ2r3+,v _ {1} = r _ {1} + \gamma r _ {2} + \gamma^ {2} r _ {3} + \dots ,
v2=r2+γr3+γ2r4+,(2.2)v _ {2} = r _ {2} + \gamma r _ {3} + \gamma^ {2} r _ {4} + \dots , \tag {2.2}
v3=r3+γr4+γ2r1+,v _ {3} = r _ {3} + \gamma r _ {4} + \gamma^ {2} r _ {1} + \dots ,
v4=r4+γr1+γ2r2+.v _ {4} = r _ {4} + \gamma r _ {1} + \gamma^ {2} r _ {2} + \dots .

The second way, which is more important, is based on the idea of bootstrapping. By observing the expressions of the returns in (2.2), we can rewrite them as

v1=r1+γ(r2+γr3+)=r1+γv2,v _ {1} = r _ {1} + \gamma (r _ {2} + \gamma r _ {3} + \dots) = r _ {1} + \gamma v _ {2},
v2=r2+γ(r3+γr4+)=r2+γv3,(2.3)v _ {2} = r _ {2} + \gamma \left(r _ {3} + \gamma r _ {4} + \dots\right) = r _ {2} + \gamma v _ {3}, \tag {2.3}
v3=r3+γ(r4+γr1+)=r3+γv4,v _ {3} = r _ {3} + \gamma (r _ {4} + \gamma r _ {1} + \dots) = r _ {3} + \gamma v _ {4},
v4=r4+γ(r1+γr2+)=r4+γv1.v _ {4} = r _ {4} + \gamma (r _ {1} + \gamma r _ {2} + \dots) = r _ {4} + \gamma v _ {1}.

The above equations indicate an interesting phenomenon that the values of the returns rely on each other. More specifically, v1v_{1} relies on v2v_{2} , v2v_{2} relies on v3v_{3} , v3v_{3} relies on v4v_{4} , and v4v_{4} relies on v1v_{1} . This reflects the idea of bootstrapping, which is to obtain the values of some quantities from themselves.

At first glance, bootstrapping is an endless loop because the calculation of an unknown value relies on another unknown value. In fact, bootstrapping is easier to understand if we view it from a mathematical perspective. In particular, the equations in (2.3) can be reformed into a linear matrix-vector equation:

[v1v2v3v4]v=[r1r2r3r4]+[γv2γv3γv4γv1]=[r1r2r3r4]r+γ[0100001000011000]P[v1v2v3v4]v,\underbrace {\left[ \begin{array}{c} v _ {1} \\ v _ {2} \\ v _ {3} \\ v _ {4} \end{array} \right]} _ {v} = \left[ \begin{array}{c} r _ {1} \\ r _ {2} \\ r _ {3} \\ r _ {4} \end{array} \right] + \left[ \begin{array}{c} \gamma v _ {2} \\ \gamma v _ {3} \\ \gamma v _ {4} \\ \gamma v _ {1} \end{array} \right] = \underbrace {\left[ \begin{array}{c} r _ {1} \\ r _ {2} \\ r _ {3} \\ r _ {4} \end{array} \right]} _ {r} + \gamma \underbrace {\left[ \begin{array}{c c c c} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \end{array} \right]} _ {P} \underbrace {\left[ \begin{array}{c} v _ {1} \\ v _ {2} \\ v _ {3} \\ v _ {4} \end{array} \right]} _ {v},

which can be written compactly as

v=r+γPv.v = r + \gamma P v.

Thus, the value of vv can be calculated easily as v=(IγP)1rv = (I - \gamma P)^{-1}r , where II is the identity matrix with appropriate dimensions. One may ask whether IγPI - \gamma P is always invertible. The answer is yes and explained in Section 2.7.1.

In fact, (2.3) is the Bellman equation for this simple example. Although it is simple, (2.3) demonstrates the core idea of the Bellman equation: the return obtained by starting from one state depends on those obtained when starting from other states. The idea of bootstrapping and the Bellman equation for general scenarios will be formalized in the following sections.