2.8_From_state_value_to_action_value

2.8 From state value to action value

While we have been discussing state values thus far in this chapter, we now turn to the action value, which indicates the "value" of taking an action at a state. While the concept of action value is important, the reason why it is introduced in the last section of this chapter is that it heavily relies on the concept of state values. It is important to understand state values well first before studying action values.

The action value of a state-action pair (s,a)(s,a) is defined as

qπ(s,a)E[GtSt=s,At=a].q _ {\pi} (s, a) \doteq \mathbb {E} [ G _ {t} | S _ {t} = s, A _ {t} = a ].

As can be seen, the action value is defined as the expected return that can be obtained after taking an action at a state. It must be noted that qπ(s,a)q_{\pi}(s,a) depends on a state-action pair (s,a)(s,a) rather than an action alone. It may be more rigorous to call this value a state-action value, but it is conventionally called an action value for simplicity.

What is the relationship between action values and state values?

First, it follows from the properties of conditional expectation that

E[GtSt=s]vπ(s)=aAE[GtSt=s,At=a]qπ(s,a)π(as).\underbrace {\mathbb {E} [ G _ {t} | S _ {t} = s ]} _ {v _ {\pi} (s)} = \sum_ {a \in \mathcal {A}} \underbrace {\mathbb {E} [ G _ {t} | S _ {t} = s , A _ {t} = a ]} _ {q _ {\pi} (s, a)} \pi (a | s).

It then follows that

vπ(s)=aAπ(as)qπ(s,a).(2.13)v _ {\pi} (s) = \sum_ {a \in \mathcal {A}} \pi (a | s) q _ {\pi} (s, a). \tag {2.13}

As a result, a state value is the expectation of the action values associated with that state.

Second, since the state value is given by

vπ(s)=aAπ(as)[rRp(rs,a)r+γsSp(ss,a)vπ(s)],v _ {\pi} (s) = \sum_ {a \in \mathcal {A}} \pi (a | s) \Big [ \sum_ {r \in \mathcal {R}} p (r | s, a) r + \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s, a) v _ {\pi} (s ^ {\prime}) \Big ],

comparing it with (2.13) leads to

qπ(s,a)=rRp(rs,a)r+γsSp(ss,a)vπ(s).(2.14)q _ {\pi} (s, a) = \sum_ {r \in \mathcal {R}} p (r | s, a) r + \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p \left(s ^ {\prime} \mid s, a\right) v _ {\pi} \left(s ^ {\prime}\right). \tag {2.14}

It can be seen that the action value consists of two terms. The first term is the mean of the immediate rewards, and the second term is the mean of the future rewards.

Both (2.13) and (2.14) describe the relationship between state values and action values. They are the two sides of the same coin: (2.13) shows how to obtain state values from action values, whereas (2.14) shows how to obtain action values from state values.

2.8.1 Illustrative examples


Figure 2.8: An example for demonstrating the process of calculating action values.

We next present an example to illustrate the process of calculating action values and discuss a common mistake that beginners may make.

Consider the stochastic policy shown in Figure 2.8. We next only examine the actions of s1s_1 . The other states can be examined similarly. The action value of (s1,a2)(s_1, a_2) is

qπ(s1,a2)=1+γvπ(s2),q _ {\pi} (s _ {1}, a _ {2}) = - 1 + \gamma v _ {\pi} (s _ {2}),

where s2s_2 is the next state. Similarly, it can be obtained that

qπ(s1,a3)=0+γvπ(s3).q _ {\pi} (s _ {1}, a _ {3}) = 0 + \gamma v _ {\pi} (s _ {3}).

A common mistake that beginners may make is about the values of the actions that the given policy does not select. For example, the policy in Figure 2.8 can only select a2a_2 or a3a_3 and cannot select a1,a4,a5a_1, a_4, a_5 . One may argue that since the policy does not select a1,a4,a5a_1, a_4, a_5 , we do not need to calculate their action values, or we can simply set qπ(s1,a1)=qπ(s1,a4)=qπ(s1,a5)=0q_{\pi}(s_1, a_1) = q_{\pi}(s_1, a_4) = q_{\pi}(s_1, a_5) = 0 . This is wrong.

First, even if an action would not be selected by a policy, it still has an action value.

In this example, although policy π\pi does not take a1a_1 at s1s_1 , we can still calculate its

action value by observing what we would obtain after taking this action. Specifically, after taking a1a_1 , the agent is bounced back to s1s_1 (hence, the immediate reward is 1-1 ) and then continues moving in the state space starting from s1s_1 by following π\pi (hence, the future reward is γvπ(s1)\gamma v_{\pi}(s_1) ). As a result, the action value of (s1,a1)(s_1, a_1) is

qπ(s1,a1)=1+γvπ(s1).q _ {\pi} (s _ {1}, a _ {1}) = - 1 + \gamma v _ {\pi} (s _ {1}).

Similarly, for a4a_4 and a5a_5 , which cannot be possibly selected by the given policy either, we have

qπ(s1,a4)=1+γvπ(s1),q _ {\pi} (s _ {1}, a _ {4}) = - 1 + \gamma v _ {\pi} (s _ {1}),
qπ(s1,a5)=0+γvπ(s1).q _ {\pi} (s _ {1}, a _ {5}) = 0 + \gamma v _ {\pi} (s _ {1}).

\diamond Second, why do we care about the actions that the given policy would not select? Although some actions cannot be possibly selected by a given policy, this does not mean that these actions are not good. It is possible that the given policy is not good, so it cannot select the best action. The purpose of reinforcement learning is to find optimal policies. To that end, we must keep exploring all actions to determine better actions for each state.

Finally, after computing the action values, we can also calculate the state value according to (2.13):

vπ(s1)=0.5qπ(s1,a2)+0.5qπ(s1,a3),=0.5[0+γvπ(s3)]+0.5[1+γvπ(s2)].\begin{array}{l} v _ {\pi} (s _ {1}) = 0. 5 q _ {\pi} (s _ {1}, a _ {2}) + 0. 5 q _ {\pi} (s _ {1}, a _ {3}), \\ = 0. 5 \left[ 0 + \gamma v _ {\pi} \left(s _ {3}\right) \right] + 0. 5 \left[ - 1 + \gamma v _ {\pi} \left(s _ {2}\right) \right]. \\ \end{array}

2.8.2 The Bellman equation in terms of action values

The Bellman equation that we previously introduced was defined based on state values. In fact, it can also be expressed in terms of action values.

In particular, substituting (2.13) into (2.14) yields

qπ(s,a)=rRp(rs,a)r+γsSp(ss,a)aA(s)π(as)qπ(s,a),q _ {\pi} (s, a) = \sum_ {r \in \mathcal {R}} p (r | s, a) r + \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s, a) \sum_ {a ^ {\prime} \in \mathcal {A} (s ^ {\prime})} \pi (a ^ {\prime} | s ^ {\prime}) q _ {\pi} (s ^ {\prime}, a ^ {\prime}),

which is an equation of action values. The above equation is valid for every state-action pair. If we put all these equations together, their matrix-vector form is

qπ=r~+γPΠqπ,(2.15)q _ {\pi} = \tilde {r} + \gamma P \Pi q _ {\pi}, \tag {2.15}
2.8_From_state_value_to_action_value - 强化学习的数学基础 | OpenTech