1.5 Reward

Reward is one of the most unique concepts in reinforcement learning.

After executing an action at a state, the agent obtains a reward, denoted as $r$ , as feedback from the environment. The reward is a function of the state $s$ and action $a$ . Hence, it is also denoted as $r(s, a)$ . Its value can be a positive or negative real number or zero. Different rewards have different impacts on the policy that the agent would eventually learn. Generally speaking, with a positive reward, we encourage the agent to take the corresponding action. With a negative reward, we discourage the agent from taking that action.

In the grid world example, the rewards are designed as follows:

If the agent attempts to exit the boundary, let $r_{\mathrm{boundary}} = -1$
If the agent attempts to enter a forbidden cell, let $r_{\text{forbidden}} = -1$ .
If the agent reaches the target state, let $r_{\mathrm{target}} = +1$
$\diamond$ Otherwise, the agent obtains a reward of $r_{\mathrm{other}} = 0$

Special attention should be given to the target state $s_9$ . The reward process does not have to terminate after the agent reaches $s_9$ . If the agent takes action $a_5$ at $s_9$ , the next state is again $s_9$ , and the reward is $r_{\mathrm{target}} = +1$ . If the agent takes action $a_2$ , the next state is also $s_9$ , but the reward is $r_{\mathrm{boundary}} = -1$ .

A reward can be interpreted as a human-machine interface, with which we can guide the agent to behave as we expect. For example, with the rewards designed above, we can expect that the agent tends to avoid exiting the boundary or stepping into the forbidden cells. Designing appropriate rewards is an important step in reinforcement learning. This step is, however, nontrivial for complex tasks since it may require the user to understand the given problem well. Nevertheless, it may still be much easier than solving the problem with other approaches that require a professional background or a deep understanding of the given problem.

The process of getting a reward after executing an action can be intuitively represented as a table, as shown in Table 1.3. Each row of the table corresponds to a state, and each column corresponds to an action. The value in each cell of the table indicates the reward that can be obtained by taking an action at a state.

One question that beginners may ask is as follows: if given the table of rewards, can we find good policies by simply selecting the actions with the greatest rewards? The answer is no. That is because these rewards are immediate rewards that can be obtained after taking an action. To determine a good policy, we must consider the total reward obtained in the long run (see Section 1.6 for more information). An action with the greatest immediate reward may not lead to the greatest total reward.

Although intuitive, the tabular representation is only able to describe deterministic reward processes. A more general approach is to use conditional probabilities $p(r|s,a)$ to describe reward processes. For example, for state $s_1$ , we have

p (r = - 1 | s _ {1}, a _ {1}) = 1, \quad p (r \neq - 1 | s _ {1}, a _ {1}) = 0.

Table 1.3: A tabular representation of the process of obtaining rewards. Here, the process is deterministic. Each cell indicates how much reward can be obtained after the agent takes an action at a given state.

This indicates that, when taking $a_1$ at $s_1$ , the agent obtains $r = -1$ with certainty. In this example, the reward process is deterministic. In general, it can be stochastic. For example, if a student studies hard, he or she would receive a positive reward (e.g., higher grades on exams), but the specific value of the reward may be uncertain.

1.5_Reward

1.5 Reward