The tool for analyzing optimal policies and optimal state values is the Bellman optimality equation (BOE). By solving this equation, we can obtain optimal policies and optimal state values. We next present the expression of the BOE and then analyze it in detail.
For every s∈S , the elementwise expression of the BOE is
where v(s),v(s′) are unknown variables to be solved and
q(s,a)≐r∈R∑p(r∣s,a)r+γs′∈S∑p(s′∣s,a)v(s′).
Here, π(s) denotes a policy for state s , and Π(s) is the set of all possible policies for s .
The BOE is an elegant and powerful tool for analyzing optimal policies. However, it may be nontrivial to understand this equation. For example, this equation has two unknown variables v(s) and π(a∣s) . It may be confusing to beginners how to solve two unknown variables from one equation. Moreover, the BOE is actually a special Bellman equation. However, it is nontrivial to see that since its expression is quite different from that of the Bellman equation. We also need to answer the following fundamental questions about the BOE.
⋄ Existence: Does this equation have a solution? Uniqueness: Is the solution unique? Algorithm: How to solve this equation? ⋄ Optimality: How is the solution related to optimal policies?
Once we can answer these questions, we will clearly understand optimal state values and optimal policies.
3.3.1 Maximization of the right-hand side of the BOE
We next clarify how to solve the maximization problem on the right-hand side of the BOE in (3.1). At first glance, it may be confusing to beginners how to solve two unknown variables v(s) and π(a∣s) from one equation. In fact, these two unknown variables can be solved one by one. This idea is illustrated by the following example.
Example 3.1. Consider two unknown variables x,y∈R that satisfy
x=y∈Rmax(2x−1−y2).
The first step is to solve y on the right-hand side of the equation. Regardless of the value of x , we always have maxy(2x−1−y2)=2x−1 , where the maximum is achieved when y=0 . The second step is to solve x . When y=0 , the equation becomes x=2x−1 , which leads to x=1 . Therefore, y=0 and x=1 are the solutions of the equation.
We now turn to the maximization problem on the right-hand side of the BOE. The BOE in (3.1) can be written concisely as
v(s)=π(s)∈Π(s)maxa∈A∑π(a∣s)q(s,a),s∈S.
Inspired by Example 3.1, we can first solve the optimal π on the right-hand side. How to do that? The following example demonstrates its basic idea.
Example 3.2. Given q1,q2,q3∈R , we would like to find the optimal values of c1,c2,c3 to maximize
i=1∑3ciqi=c1q1+c2q2+c3q3,
where c1+c2+c3=1 and c1,c2,c3≥0
Without loss of generality, suppose that q3≥q1,q2 . Then, the optimal solution is c3∗=1 and c1∗=c2∗=0 . This is because
Here, a∗=argmaxaq(s,a) . In summary, the optimal policy π(s) is the one that selects the action that has the greatest value of q(s,a) .
3.3.2 Matrix-vector form of the BOE
The BOE refers to a set of equations defined for all states. If we combine these equations, we can obtain a concise matrix-vector form, which will be extensively used in this chapter.
The matrix-vector form of the BOE is
v=π∈Πmax(rπ+γPπv),(3.2)
where v∈R∣S∣ and maxπ is performed in an elementwise manner. The structures of rπ and Pπ are the same as those in the matrix-vector form of the normal Bellman equation:
Since the optimal value of π is determined by v , the right-hand side of (3.2) is a function of v , denoted as
f(v)≐π∈Πmax(rπ+γPπv).
Then, the BOE can be expressed in a concise form as
v=f(v).(3.3)
In the remainder of this section, we show how to solve this nonlinear equation.
3.3.3 Contraction mapping theorem
Since the BOE can be expressed as a nonlinear equation v=f(v) , we next introduce the contraction mapping theorem [6] to analyze it. The contraction mapping theorem is a powerful tool for analyzing general nonlinear equations. It is also known as the fixed-point theorem. Readers who already know this theorem can skip this part. Otherwise, the reader is advised to be familiar with this theorem since it is the key to analyzing the
BOE.
Consider a function f(x) , where x∈Rd and f:Rd→Rd . A point x∗ is called a fixed point if
f(x∗)=x∗.
The interpretation of the above equation is that the map of x∗ is itself. This is the reason why x∗ is called "fixed". The function f is a contraction mapping (or contractive function) if there exists γ∈(0,1) such that
∥f(x1)−f(x2)∥≤γ∥x1−x2∥
for any x1,x2∈Rd . In this book, ∥⋅∥ denotes a vector or matrix norm.
Example 3.3. We present three examples to demonstrate fixed points and contraction mappings.
x=f(x)=0.5xx∈R
It is easy to verify that x=0 is a fixed point since 0=0.5⋅0 . Moreover, f(x)=0.5x is a contraction mapping because ∥0.5x1−0.5x2∥=0.5∥x1−x2∥≤γ∥x1−x2∥ for any γ∈[0.5,1) .
x=f(x)=Ax , where x∈Rn,A∈Rn×n and ∥A∥≤γ<1
It is easy to verify that x=0 is a fixed point since 0=A0 . To see the contraction property, ∥Ax1−Ax2∥=∥A(x1−x2)∥≤∥A∥∥x1−x2∥≤γ∥x1−x2∥ . Therefore, f(x)=Ax is a contraction mapping.
x=f(x)=0.5sinx,x∈R.
It is easy to see that x=0 is a fixed point since 0=0.5sin0 . Moreover, it follows from the mean value theorem [7, 8] that
As a result, ∣0.5sinx1−0.5sinx2∣≤0.5∣x1−x2∣ and hence f(x)=0.5sinx is a contraction mapping.
The relationship between a fixed point and the contraction property is characterized by the following classic theorem.
Theorem 3.1 (Contraction mapping theorem). For any equation that has the form x=f(x) where x and f(x) are real vectors, if f is a contraction mapping, then the following properties hold.
⋄ Existence: There exists a fixed point x∗ satisfying f(x∗)=x∗ .
Uniqueness: The fixed point x∗ is unique. ⋄ Algorithm: Consider the iterative process:
xk+1=f(xk),
where k=0,1,2,… . Then, xk→x∗ as k→∞ for any initial guess x0 . Moreover, the convergence rate is exponentially fast.
The contraction mapping theorem not only can tell whether the solution of a nonlinear equation exists but also suggests a numerical algorithm for solving the equation. The proof of the theorem is given in Box 3.1.
The following example demonstrates how to calculate the fixed points of some equations using the iterative algorithm suggested by the contraction mapping theorem.
Example 3.4. Let us revisit the abovementioned examples: x=0.5x , x=Ax , and x=0.5sinx . While it has been shown that the right-hand sides of these three equations are all contraction mappings, it follows from the contraction mapping theorem that they each have a unique fixed point, which can be easily verified to be x∗=0 . Moreover, the fixed points of the three equations can be iteratively solved by the following algorithms:
xk+1=0.5xk,
xk+1=Axk,
xk+1=0.5sinxk,
given any initial guess x0 .
Box 3.1: Proof of the contraction mapping theorem
Part 1: We prove that the sequence {xk}k=1∞ with xk=f(xk−1) is convergent.
The proof relies on Cauchy sequences. A sequence x1,x2,… is called Cauchy if for any small ε>0 , there exists N such that ∥xm−xn∥<ε for all m,n>N . The intuitive interpretation is that there exists a finite integer N such that all the elements after N are sufficiently close to each other. Cauchy sequences are important because it is guaranteed that a Cauchy sequence converges to a limit. Its convergence property will be used to prove the contraction mapping theorem. Note that we must have ∥xm−xn∥<ε for all m,n>N . If we simply have xn+1−xn→0 , it is insufficient to claim that the sequence is a Cauchy sequence. For example, it holds that xn+1−xn→0 for xn=n , but apparently, xn=n diverges.
We next show that {xk=f(xk−1)}k=1∞ is a Cauchy sequence and hence converges.
First, since f is a contraction mapping, we have
∥xk+1−xk∥=∥f(xk)−f(xk−1)∥≤γ∥xk−xk−1∥.
Similarly, we have ∥xk−xk−1∥≤γ∥xk−1−xk−2∥ , … , ∥x2−x1∥≤γ∥x1−x0∥ . Thus, we have
Since γ<1 , we know that ∥xk+1−xk∥ converges to zero exponentially fast as k→∞ given any x1,x0 . Notably, the convergence of {∥xk+1−xk∥} is not sufficient for implying the convergence of {xk} . Therefore, we need to further consider ∥xm−xn∥ for any m>n . In particular,
As a result, for any ε , we can always find N such that ∥xm−xn∥<ε for all m,n>N . Therefore, this sequence is Cauchy and hence converges to a limit point denoted as x∗=limk→∞xk .
Part 2: We show that the limit x∗=limk→∞xk is a fixed point. To do that, since
∥f(xk)−xk∥=∥xk+1−xk∥≤γk∥x1−x0∥,
we know that ∥f(xk)−xk∥ converges to zero exponentially fast. Hence, we have f(x∗)=x∗ at the limit.
Part 3: We show that the fixed point is unique. Suppose that there is another fixed point x′ satisfying f(x′)=x′ . Then,
∥x′−x∗∥=∥f(x′)−f(x∗)∥≤γ∥x′−x∗∥.
Since γ<1 , this inequality holds if and only if ∥x′−x∗∥=0 . Therefore, x′=x∗ .
Part 4: We show that xk converges to x∗ exponentially fast. Recall that ∥xm−xn∥≤1−γγn∥x1−x0∥ , as proven in (3.4). Since m can be arbitrarily large, we have
∥x∗−xn∥=m→∞lim∥xm−xn∥≤1−γγn∥x1−x0∥.
Since γ<1 , the error converges to zero exponentially fast as n→∞ .
3.3.4 Contraction property of the right-hand side of the BOE
We next show that f(v) in the BOE in (3.3) is a contraction mapping. Thus, the contraction mapping theorem introduced in the previous subsection can be applied.
Theorem 3.2 (Contraction property of f(v) ). The function f(v) on the right-hand side of the BOE in (3.3) is a contraction mapping. In particular, for any v1,v2∈R∣S∣ , it holds that
∥f(v1)−f(v2)∥∞≤γ∥v1−v2∥∞,
where γ∈(0,1) is the discount rate, and ∥⋅∥∞ is the maximum norm, which is the maximum absolute value of the elements of a vector.
The proof of the theorem is given in Box 3.2. This theorem is important because we can use the powerful contraction mapping theorem to analyze the BOE.
Box 3.2: Proof of Theorem 3.2
Consider any two vectors v1,v2∈R∣S∣ , and suppose that π1∗≐argmaxπ(rπ+γPπv1) and π2∗≐argmaxπ(rπ+γPπv2) . Then,