2.3 State values

We mentioned that returns can be used to evaluate policies. However, they are inapplicable to stochastic systems because starting from one state may lead to different returns. Motivated by this problem, we introduce the concept of state value in this section.

First, we need to introduce some necessary notations. Consider a sequence of time steps $t = 0,1,2,\ldots$ . At time $t$ , the agent is in state $S_{t}$ , and the action taken following a policy $\pi$ is $A_{t}$ . The next state is $S_{t + 1}$ , and the immediate reward obtained is $R_{t + 1}$ . This process can be expressed concisely as

S _ {t} \xrightarrow {A _ {t}} S _ {t + 1}, R _ {t + 1}.

Note that $S_{t}, S_{t+1}, A_{t}, R_{t+1}$ are all random variables. Moreover, $S_{t}, S_{t+1} \in \mathcal{S}$ , $A_{t} \in \mathcal{A}(S_{t})$ , and $R_{t+1} \in \mathcal{R}(S_{t}, A_{t})$ .

Starting from $t$ , we can obtain a state-action-reward trajectory:

S _ {t} \xrightarrow {A _ {t}} S _ {t + 1}, R _ {t + 1} \xrightarrow {A _ {t + 1}} S _ {t + 2}, R _ {t + 2} \xrightarrow {A _ {t + 2}} S _ {t + 3}, R _ {t + 3} \dots .

By definition, the discounted return along the trajectory is

G _ {t} \doteq R _ {t + 1} + \gamma R _ {t + 2} + \gamma^ {2} R _ {t + 3} + \dots ,

where $\gamma \in (0,1)$ is the discount rate. Note that $G_{t}$ is a random variable since $R_{t + 1}, R_{t + 2}, \ldots$ are all random variables.

Since $G_{t}$ is a random variable, we can calculate its expected value (also called the expectation or mean):

v _ {\pi} (s) \doteq \mathbb {E} [ G _ {t} | S _ {t} = s ].

Here, $v_{\pi}(s)$ is called the state-value function or simply the state value of $s$ . Some important remarks are given below.

$\diamond$ $v_{\pi}(s)$ depends on $s$ . This is because its definition is a conditional expectation with the condition that the agent starts from $S_{t} = s$ .
$\diamond$ $v_{\pi}(s)$ depends on $\pi$ . This is because the trajectories are generated by following the policy $\pi$ . For a different policy, the state value may be different.
$\diamond$ $v_{\pi}(s)$ does not depend on $t$ . If the agent moves in the state space, $t$ represents the current time step. The value of $v_{\pi}(s)$ is determined once the policy is given.

The relationship between state values and returns is further clarified as follows. When both the policy and the system model are deterministic, starting from a state always leads to the same trajectory. In this case, the return obtained starting from a state is equal to the value of that state. By contrast, when either the policy or the system model is stochastic, starting from the same state may generate different trajectories. In this case, the returns of different trajectories are different, and the state value is the mean of these returns.

Although returns can be used to evaluate policies as shown in Section 2.1, it is more formal to use state values to evaluate policies: policies that generate greater state values are better. Therefore, state values constitute a core concept in reinforcement learning. While state values are important, a question that immediately follows is how to calculate them. This question is answered in the next section.

2.3_State_values

2.3 State values