3.2 Optimal state values and optimal policies

While the ultimate goal of reinforcement learning is to obtain optimal policies, it is necessary to first define what an optimal policy is. The definition is based on state values. In particular, consider two given policies $\pi_1$ and $\pi_2$ . If the state value of $\pi_1$ is greater than or equal to that of $\pi_2$ for any state:

v _ {\pi_ {1}} (s) \geq v _ {\pi_ {2}} (s), \quad \text {f o r a l l} s \in \mathcal {S},

then $\pi_1$ is said to be better than $\pi_2$ . Furthermore, if a policy is better than all the other possible policies, then this policy is optimal. This is formally stated below.

Definition 3.1 (Optimal policy and optimal state value). A policy $\pi^{*}$ is optimal if $v_{\pi^{*}}(s)\geq v_{\pi}(s)$ for all $s\in S$ and for any other policy $\pi$ . The state values of $\pi^{*}$ are the optimal state values.

The above definition indicates that an optimal policy has the greatest state value for every state compared to all the other policies. This definition also leads to many questions:

$\diamond$ Existence: Does the optimal policy exist?
Uniqueness: Is the optimal policy unique?
Stochasticity: Is the optimal policy stochastic or deterministic?
$\diamond$ Algorithm: How to obtain the optimal policy and the optimal state values?

These fundamental questions must be clearly answered to thoroughly understand optimal policies. For example, regarding the existence of optimal policies, if optimal policies do not exist, then we do not need to bother to design algorithms to find them.

We will answer all these questions in the remainder of this chapter.

3.2_Optimal_state_values_and_optimal_policies

3.2 Optimal state values and optimal policies