1.5_Reward

1.5 Reward

Reward is one of the most unique concepts in reinforcement learning.

After executing an action at a state, the agent obtains a reward, denoted as rr , as feedback from the environment. The reward is a function of the state ss and action aa . Hence, it is also denoted as r(s,a)r(s, a) . Its value can be a positive or negative real number or zero. Different rewards have different impacts on the policy that the agent would eventually learn. Generally speaking, with a positive reward, we encourage the agent to take the corresponding action. With a negative reward, we discourage the agent from taking that action.

In the grid world example, the rewards are designed as follows:

If the agent attempts to exit the boundary, let rboundary=1r_{\mathrm{boundary}} = -1
If the agent attempts to enter a forbidden cell, let rforbidden=1r_{\text{forbidden}} = -1 .
If the agent reaches the target state, let rtarget=+1r_{\mathrm{target}} = +1
\diamond Otherwise, the agent obtains a reward of rother=0r_{\mathrm{other}} = 0

Special attention should be given to the target state s9s_9 . The reward process does not have to terminate after the agent reaches s9s_9 . If the agent takes action a5a_5 at s9s_9 , the next state is again s9s_9 , and the reward is rtarget=+1r_{\mathrm{target}} = +1 . If the agent takes action a2a_2 , the next state is also s9s_9 , but the reward is rboundary=1r_{\mathrm{boundary}} = -1 .

A reward can be interpreted as a human-machine interface, with which we can guide the agent to behave as we expect. For example, with the rewards designed above, we can expect that the agent tends to avoid exiting the boundary or stepping into the forbidden cells. Designing appropriate rewards is an important step in reinforcement learning. This step is, however, nontrivial for complex tasks since it may require the user to understand the given problem well. Nevertheless, it may still be much easier than solving the problem with other approaches that require a professional background or a deep understanding of the given problem.

The process of getting a reward after executing an action can be intuitively represented as a table, as shown in Table 1.3. Each row of the table corresponds to a state, and each column corresponds to an action. The value in each cell of the table indicates the reward that can be obtained by taking an action at a state.

One question that beginners may ask is as follows: if given the table of rewards, can we find good policies by simply selecting the actions with the greatest rewards? The answer is no. That is because these rewards are immediate rewards that can be obtained after taking an action. To determine a good policy, we must consider the total reward obtained in the long run (see Section 1.6 for more information). An action with the greatest immediate reward may not lead to the greatest total reward.

Although intuitive, the tabular representation is only able to describe deterministic reward processes. A more general approach is to use conditional probabilities p(rs,a)p(r|s,a) to describe reward processes. For example, for state s1s_1 , we have

p(r=1s1,a1)=1,p(r1s1,a1)=0.p (r = - 1 | s _ {1}, a _ {1}) = 1, \quad p (r \neq - 1 | s _ {1}, a _ {1}) = 0.

Table 1.3: A tabular representation of the process of obtaining rewards. Here, the process is deterministic. Each cell indicates how much reward can be obtained after the agent takes an action at a given state.

This indicates that, when taking a1a_1 at s1s_1 , the agent obtains r=1r = -1 with certainty. In this example, the reward process is deterministic. In general, it can be stochastic. For example, if a student studies hard, he or she would receive a positive reward (e.g., higher grades on exams), but the specific value of the reward may be uncertain.