We now introduce the Bellman equation, a mathematical tool for analyzing state values. In a nutshell, the Bellman equation is a set of linear equations that describe the relationships between the values of all the states.
We next derive the Bellman equation. First, note that Gt can be rewritten as
The first term, E[Rt+1∣St=s] , is the expectation of the immediate rewards. By using the law of total expectation (Appendix A), it can be calculated as
Here, A and R are the sets of possible actions and rewards, respectively. It should be noted that A may be different for different states. In this case, A should be written as A(s) . Similarly, R may also depend on (s,a) . We drop the dependence on s or (s,a) for the sake of simplicity in this book. Nevertheless, the conclusions are still valid in the presence of dependence.
The second term, E[Gt+1∣St=s] , is the expectation of the future rewards. It can be calculated as
E[Gt+1∣St=s]=∑s′∈SE[Gt+1∣St=s,St+1=s′]p(s′∣s)=∑s′∈SE[Gt+1∣St+1=s′]p(s′∣s)(d u e t o t h e M a r k o v p r o p e r t y)=∑s′∈Svπ(s′)p(s′∣s)=∑s′∈Svπ(s′)∑a∈Ap(s′∣s,a)π(a∣s).(2.6)
The above derivation uses the fact that E[Gt+1∣St=s,St+1=s′]=E[Gt+1∣St+1=s′] , which is due to the Markov property that the future rewards depend merely on the present state rather than the previous ones.
Substituting (2.5)-(2.6) into (2.4) yields
vπ(s)=E[Rt+1∣St=s]+γE[Gt+1∣St=s],=mean of immediate rewardsa∈A∑π(a∣s)r∈R∑p(r∣s,a)r+mean of future rewardsγa∈A∑π(a∣s)s′∈S∑p(s′∣s,a)vπ(s′)=∑a∈Aπ(a∣s)[∑r∈Rp(r∣s,a)r+γ∑s′∈Sp(s′∣s,a)vπ(s′)],f o r a l ls∈S.(2.7)
This equation is the Bellman equation, which characterizes the relationships of state values. It is a fundamental tool for designing and analyzing reinforcement learning algorithms.
The Bellman equation seems complex at first glance. In fact, it has a clear structure. Some remarks are given below.
⋄vπ(s) and vπ(s′) are unknown state values to be calculated. It may be confusing to beginners how to calculate the unknown vπ(s) given that it relies on another unknown vπ(s′) . It must be noted that the Bellman equation refers to a set of linear equations for all states rather than a single equation. If we put these equations together, it becomes clear how to calculate all the state values. Details will be given in Section 2.7. ⋄π(a∣s) is a given policy. Since state values can be used to evaluate a policy, solving the state values from the Bellman equation is a policy evaluation process, which is an important process in many reinforcement learning algorithms, as we will see later in the book. ⋄p(r∣s,a) and p(s′∣s,a) represent the system model. We will first show how to calculate the state values with this model in Section 2.7, and then show how to do that without the model by using model-free algorithms later in this book.
In addition to the expression in (2.7), readers may also encounter other expressions of the Bellman equation in the literature. We next introduce two equivalent expressions.
First, it follows from the law of total probability that
Second, the reward r may depend solely on the next state s′ in some problems. As a result, we can write the reward as r(s′) and hence p(r(s′)∣s,a)=p(s′∣s,a) , substituting which into (2.7) gives