2.4_Bellman_equation

2.4 Bellman equation

We now introduce the Bellman equation, a mathematical tool for analyzing state values. In a nutshell, the Bellman equation is a set of linear equations that describe the relationships between the values of all the states.

We next derive the Bellman equation. First, note that GtG_{t} can be rewritten as

Gt=Rt+1+γRt+2+γ2Rt+3+=Rt+1+γ(Rt+2+γRt+3+)=Rt+1+γGt+1,\begin{array}{l} G _ {t} = R _ {t + 1} + \gamma R _ {t + 2} + \gamma^ {2} R _ {t + 3} + \ldots \\ = R _ {t + 1} + \gamma (R _ {t + 2} + \gamma R _ {t + 3} + \ldots) \\ = R _ {t + 1} + \gamma G _ {t + 1}, \\ \end{array}

where Gt+1=Rt+2+γRt+3+G_{t + 1} = R_{t + 2} + \gamma R_{t + 3} + \ldots . This equation establishes the relationship between GtG_{t} and Gt+1G_{t + 1} . Then, the state value can be written as

vπ(s)=E[GtSt=s]=E[Rt+1+γGt+1St=s]=E[Rt+1St=s]+γE[Gt+1St=s].(2.4)\begin{array}{l} v _ {\pi} (s) = \mathbb {E} [ G _ {t} | S _ {t} = s ] \\ = \mathbb {E} \bigl [ R _ {t + 1} + \gamma G _ {t + 1} | S _ {t} = s \bigr ] \\ = \mathbb {E} \left[ R _ {t + 1} \mid S _ {t} = s \right] + \gamma \mathbb {E} \left[ G _ {t + 1} \mid S _ {t} = s \right]. \tag {2.4} \\ \end{array}

The two terms in (2.4) are analyzed below.

The first term, E[Rt+1St=s]\mathbb{E}[R_{t + 1}|S_t = s] , is the expectation of the immediate rewards. By using the law of total expectation (Appendix A), it can be calculated as

E[Rt+1St=s]=aAπ(as)E[Rt+1St=s,At=a]=aAπ(as)rRp(rs,a)r.(2.5)\begin{array}{l} \mathbb {E} [ R _ {t + 1} | S _ {t} = s ] = \sum_ {a \in \mathcal {A}} \pi (a | s) \mathbb {E} [ R _ {t + 1} | S _ {t} = s, A _ {t} = a ] \\ = \sum_ {a \in \mathcal {A}} \pi (a | s) \sum_ {r \in \mathcal {R}} p (r | s, a) r. \tag {2.5} \\ \end{array}

Here, A\mathcal{A} and R\mathcal{R} are the sets of possible actions and rewards, respectively. It should be noted that A\mathcal{A} may be different for different states. In this case, A\mathcal{A} should be written as A(s)\mathcal{A}(s) . Similarly, R\mathcal{R} may also depend on (s,a)(s,a) . We drop the dependence on ss or (s,a)(s,a) for the sake of simplicity in this book. Nevertheless, the conclusions are still valid in the presence of dependence.

The second term, E[Gt+1St=s]\mathbb{E}[G_{t + 1}|S_t = s] , is the expectation of the future rewards. It can be calculated as

E[Gt+1St=s]=sSE[Gt+1St=s,St+1=s]p(ss)=sSE[Gt+1St+1=s]p(ss)(d u e t o t h e M a r k o v p r o p e r t y)=sSvπ(s)p(ss)=sSvπ(s)aAp(ss,a)π(as).(2.6)\begin{array}{l} \mathbb {E} [ G _ {t + 1} | S _ {t} = s ] = \sum_ {s ^ {\prime} \in \mathcal {S}} \mathbb {E} [ G _ {t + 1} | S _ {t} = s, S _ {t + 1} = s ^ {\prime} ] p (s ^ {\prime} | s) \\ = \sum_ {s ^ {\prime} \in \mathcal {S}} \mathbb {E} \left[ G _ {t + 1} \mid S _ {t + 1} = s ^ {\prime} \right] p \left(s ^ {\prime} \mid s\right) \quad (\text {d u e t o t h e M a r k o v p r o p e r t y}) \\ = \sum_ {s ^ {\prime} \in \mathcal {S}} v _ {\pi} (s ^ {\prime}) p (s ^ {\prime} | s) \\ = \sum_ {s ^ {\prime} \in \mathcal {S}} v _ {\pi} \left(s ^ {\prime}\right) \sum_ {a \in \mathcal {A}} p \left(s ^ {\prime} \mid s, a\right) \pi (a \mid s). \tag {2.6} \\ \end{array}

The above derivation uses the fact that E[Gt+1St=s,St+1=s]=E[Gt+1St+1=s]\mathbb{E}[G_{t + 1}|S_t = s,S_{t + 1} = s'] = \mathbb{E}[G_{t + 1}|S_{t + 1} = s'] , which is due to the Markov property that the future rewards depend merely on the present state rather than the previous ones.

Substituting (2.5)-(2.6) into (2.4) yields

vπ(s)=E[Rt+1St=s]+γE[Gt+1St=s],=aAπ(as)rRp(rs,a)rmean of immediate rewards+γaAπ(as)sSp(ss,a)vπ(s)mean of future rewards=aAπ(as)[rRp(rs,a)r+γsSp(ss,a)vπ(s)],f o r a l lsS.(2.7)\begin{array}{l} v _ {\pi} (s) = \mathbb {E} [ R _ {t + 1} | S _ {t} = s ] + \gamma \mathbb {E} [ G _ {t + 1} | S _ {t} = s ], \\ = \underbrace{\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r}_{\text{mean of immediate rewards}} + \underbrace{\gamma\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s^{\prime}\in\mathcal{S}}p(s^{\prime}|s,a)v_{\pi}(s^{\prime})}_{\text{mean of future rewards}} \\ = \sum_ {a \in \mathcal {A}} \pi (a | s) \left[ \sum_ {r \in \mathcal {R}} p (r | s, a) r + \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p \left(s ^ {\prime} \mid s, a\right) v _ {\pi} \left(s ^ {\prime}\right) \right], \quad \text {f o r a l l} s \in \mathcal {S}. \tag {2.7} \\ \end{array}

This equation is the Bellman equation, which characterizes the relationships of state values. It is a fundamental tool for designing and analyzing reinforcement learning algorithms.

The Bellman equation seems complex at first glance. In fact, it has a clear structure. Some remarks are given below.

\diamond vπ(s)v_{\pi}(s) and vπ(s)v_{\pi}(s') are unknown state values to be calculated. It may be confusing to beginners how to calculate the unknown vπ(s)v_{\pi}(s) given that it relies on another unknown vπ(s)v_{\pi}(s') . It must be noted that the Bellman equation refers to a set of linear equations for all states rather than a single equation. If we put these equations together, it becomes clear how to calculate all the state values. Details will be given in Section 2.7.
\diamond π(as)\pi(a|s) is a given policy. Since state values can be used to evaluate a policy, solving the state values from the Bellman equation is a policy evaluation process, which is an important process in many reinforcement learning algorithms, as we will see later in the book.
\diamond p(rs,a)p(r|s,a) and p(ss,a)p(s'|s,a) represent the system model. We will first show how to calculate the state values with this model in Section 2.7, and then show how to do that without the model by using model-free algorithms later in this book.

In addition to the expression in (2.7), readers may also encounter other expressions of the Bellman equation in the literature. We next introduce two equivalent expressions.

First, it follows from the law of total probability that

p(ss,a)=rRp(s,rs,a),p (s ^ {\prime} | s, a) = \sum_ {r \in \mathcal {R}} p (s ^ {\prime}, r | s, a),
p(rs,a)=sSp(s,rs,a).p (r | s, a) = \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime}, r | s, a).

Then, equation (2.7) can be rewritten as

vπ(s)=aAπ(as)sSrRp(s,rs,a)[r+γvπ(s)].v _ {\pi} (s) = \sum_ {a \in \mathcal {A}} \pi (a | s) \sum_ {s ^ {\prime} \in \mathcal {S}} \sum_ {r \in \mathcal {R}} p (s ^ {\prime}, r | s, a) [ r + \gamma v _ {\pi} (s ^ {\prime}) ].

This is the expression used in [3].

Second, the reward rr may depend solely on the next state ss' in some problems. As a result, we can write the reward as r(s)r(s') and hence p(r(s)s,a)=p(ss,a)p(r(s') | s, a) = p(s'|s, a) , substituting which into (2.7) gives

vπ(s)=aAπ(as)sSp(ss,a)[r(s)+γvπ(s)].v _ {\pi} (s) = \sum_ {a \in \mathcal {A}} \pi (a | s) \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s, a) [ r (s ^ {\prime}) + \gamma v _ {\pi} (s ^ {\prime}) ].