3.2_Optimal_state_values_and_optimal_policies

3.2 Optimal state values and optimal policies

While the ultimate goal of reinforcement learning is to obtain optimal policies, it is necessary to first define what an optimal policy is. The definition is based on state values. In particular, consider two given policies π1\pi_1 and π2\pi_2 . If the state value of π1\pi_1 is greater than or equal to that of π2\pi_2 for any state:

vπ1(s)vπ2(s),f o r a l lsS,v _ {\pi_ {1}} (s) \geq v _ {\pi_ {2}} (s), \quad \text {f o r a l l} s \in \mathcal {S},

then π1\pi_1 is said to be better than π2\pi_2 . Furthermore, if a policy is better than all the other possible policies, then this policy is optimal. This is formally stated below.

Definition 3.1 (Optimal policy and optimal state value). A policy π\pi^{*} is optimal if vπ(s)vπ(s)v_{\pi^{*}}(s)\geq v_{\pi}(s) for all sSs\in S and for any other policy π\pi . The state values of π\pi^{*} are the optimal state values.

The above definition indicates that an optimal policy has the greatest state value for every state compared to all the other policies. This definition also leads to many questions:

\diamond Existence: Does the optimal policy exist?
Uniqueness: Is the optimal policy unique?
Stochasticity: Is the optimal policy stochastic or deterministic?
\diamond Algorithm: How to obtain the optimal policy and the optimal state values?

These fundamental questions must be clearly answered to thoroughly understand optimal policies. For example, regarding the existence of optimal policies, if optimal policies do not exist, then we do not need to bother to design algorithms to find them.

We will answer all these questions in the remainder of this chapter.