2.3_State_values

2.3 State values

We mentioned that returns can be used to evaluate policies. However, they are inapplicable to stochastic systems because starting from one state may lead to different returns. Motivated by this problem, we introduce the concept of state value in this section.

First, we need to introduce some necessary notations. Consider a sequence of time steps t=0,1,2,t = 0,1,2,\ldots . At time tt , the agent is in state StS_{t} , and the action taken following a policy π\pi is AtA_{t} . The next state is St+1S_{t + 1} , and the immediate reward obtained is Rt+1R_{t + 1} . This process can be expressed concisely as

StAtSt+1,Rt+1.S _ {t} \xrightarrow {A _ {t}} S _ {t + 1}, R _ {t + 1}.

Note that St,St+1,At,Rt+1S_{t}, S_{t+1}, A_{t}, R_{t+1} are all random variables. Moreover, St,St+1SS_{t}, S_{t+1} \in \mathcal{S} , AtA(St)A_{t} \in \mathcal{A}(S_{t}) , and Rt+1R(St,At)R_{t+1} \in \mathcal{R}(S_{t}, A_{t}) .

Starting from tt , we can obtain a state-action-reward trajectory:

StAtSt+1,Rt+1At+1St+2,Rt+2At+2St+3,Rt+3.S _ {t} \xrightarrow {A _ {t}} S _ {t + 1}, R _ {t + 1} \xrightarrow {A _ {t + 1}} S _ {t + 2}, R _ {t + 2} \xrightarrow {A _ {t + 2}} S _ {t + 3}, R _ {t + 3} \dots .

By definition, the discounted return along the trajectory is

GtRt+1+γRt+2+γ2Rt+3+,G _ {t} \doteq R _ {t + 1} + \gamma R _ {t + 2} + \gamma^ {2} R _ {t + 3} + \dots ,

where γ(0,1)\gamma \in (0,1) is the discount rate. Note that GtG_{t} is a random variable since Rt+1,Rt+2,R_{t + 1}, R_{t + 2}, \ldots are all random variables.

Since GtG_{t} is a random variable, we can calculate its expected value (also called the expectation or mean):

vπ(s)E[GtSt=s].v _ {\pi} (s) \doteq \mathbb {E} [ G _ {t} | S _ {t} = s ].

Here, vπ(s)v_{\pi}(s) is called the state-value function or simply the state value of ss . Some important remarks are given below.

\diamond vπ(s)v_{\pi}(s) depends on ss . This is because its definition is a conditional expectation with the condition that the agent starts from St=sS_{t} = s .
\diamond vπ(s)v_{\pi}(s) depends on π\pi . This is because the trajectories are generated by following the policy π\pi . For a different policy, the state value may be different.
\diamond vπ(s)v_{\pi}(s) does not depend on tt . If the agent moves in the state space, tt represents the current time step. The value of vπ(s)v_{\pi}(s) is determined once the policy is given.

The relationship between state values and returns is further clarified as follows. When both the policy and the system model are deterministic, starting from a state always leads to the same trajectory. In this case, the return obtained starting from a state is equal to the value of that state. By contrast, when either the policy or the system model is stochastic, starting from the same state may generate different trajectories. In this case, the returns of different trajectories are different, and the state value is the mean of these returns.

Although returns can be used to evaluate policies as shown in Section 2.1, it is more formal to use state values to evaluate policies: policies that generate greater state values are better. Therefore, state values constitute a core concept in reinforcement learning. While state values are important, a question that immediately follows is how to calculate them. This question is answered in the next section.