3.1_Motivating_example_How_to_improve_policies

3.1 Motivating example: How to improve policies?


Figure 3.2: An example for demonstrating policy improvement.

Consider the policy shown in Figure 3.2. Here, the orange and blue cells represent the forbidden and target areas, respectively. The policy here is not good because it selects a2a_2 (rightward) in state s1s_1 . How can we improve the given policy to obtain a better policy? The answer lies in state values and action values.

\diamond Intuition: It is intuitively clear that the policy can improve if it selects a3a_3 (downward) instead of a2a_2 (rightward) at s1s_1 . This is because moving downward enables the agent to avoid entering the forbidden area.
\diamond Mathematics: The above intuition can be realized based on the calculation of state values and action values.

First, we calculate the state values of the given policy. In particular, the Bellman equation of this policy is

vπ(s1)=1+γvπ(s2),vπ(s2)=+1+γvπ(s4),vπ(s3)=+1+γvπ(s4),vπ(s4)=+1+γvπ(s4).\begin{array}{l} v _ {\pi} (s _ {1}) = - 1 + \gamma v _ {\pi} (s _ {2}), \\ v _ {\pi} (s _ {2}) = + 1 + \gamma v _ {\pi} (s _ {4}), \\ v _ {\pi} (s _ {3}) = + 1 + \gamma v _ {\pi} (s _ {4}), \\ v _ {\pi} (s _ {4}) = + 1 + \gamma v _ {\pi} (s _ {4}). \\ \end{array}

Let γ=0.9\gamma = 0.9 . It can be easily solved that

vπ(s4)=vπ(s3)=vπ(s2)=10,v _ {\pi} \left(s _ {4}\right) = v _ {\pi} \left(s _ {3}\right) = v _ {\pi} \left(s _ {2}\right) = 1 0,
vπ(s1)=8.v _ {\pi} (s _ {1}) = 8.

Second, we calculate the action values for state s1s_1 :

qπ(s1,a1)=1+γvπ(s1)=6.2,q _ {\pi} \left(s _ {1}, a _ {1}\right) = - 1 + \gamma v _ {\pi} \left(s _ {1}\right) = 6. 2,
qπ(s1,a2)=1+γvπ(s2)=8,q _ {\pi} \left(s _ {1}, a _ {2}\right) = - 1 + \gamma v _ {\pi} \left(s _ {2}\right) = 8,
qπ(s1,a3)=0+γvπ(s3)=9,q _ {\pi} (s _ {1}, a _ {3}) = 0 + \gamma v _ {\pi} (s _ {3}) = 9,
qπ(s1,a4)=1+γvπ(s1)=6.2,q _ {\pi} \left(s _ {1}, a _ {4}\right) = - 1 + \gamma v _ {\pi} \left(s _ {1}\right) = 6. 2,
qπ(s1,a5)=0+γvπ(s1)=7.2.q _ {\pi} \left(s _ {1}, a _ {5}\right) = 0 + \gamma v _ {\pi} \left(s _ {1}\right) = 7. 2.

It is notable that action a3a_3 has the greatest action value:

qπ(s1,a3)qπ(s1,ai),f o r a l li3.q _ {\pi} \left(s _ {1}, a _ {3}\right) \geq q _ {\pi} \left(s _ {1}, a _ {i}\right), \quad \text {f o r a l l} i \neq 3.

Therefore, we can update the policy to select a3a_3 at s1s_1 .

This example illustrates that we can obtain a better policy if we update the policy to select the action with the greatest action value. This is the basic idea of many reinforcement learning algorithms.

This example is very simple in the sense that the given policy is only not good for state s1s_1 . If the policy is also not good for the other states, will selecting the action with the greatest action value still generate a better policy? Moreover, whether there always exist optimal policies? What does an optimal policy look like? We will answer all of these questions in this chapter.