3.5_Factors_that_influence_optimal_policies

3.5 Factors that influence optimal policies

The BOE is a powerful tool for analyzing optimal policies. We next apply the BOE to study what factors can influence optimal policies. This question can be easily answered by observing the elementwise expression of the BOE:

v(s)=maxπ(s)Π(s)aAπ(as)(rRp(rs,a)r+γsSp(ss,a)v(s)),sS.v (s) = \max _ {\pi (s) \in \Pi (s)} \sum_ {a \in \mathcal {A}} \pi (a | s) \left(\sum_ {r \in \mathcal {R}} p (r | s, a) r + \gamma \sum_ {s ^ {\prime} \in \mathcal {S}} p (s ^ {\prime} | s, a) v (s ^ {\prime})\right), \quad s \in \mathcal {S}.

The optimal state value and optimal policy are determined by the following parameters: 1) the immediate reward rr , 2) the discount rate γ\gamma , and 3) the system model p(ss,a),p(rs,a)p(s'|s,a), p(r|s,a) . While the system model is fixed, we next discuss how the optimal policy varies when we change the values of rr and γ\gamma . All the optimal policies presented in this section can be obtained via the algorithm in Theorem 3.3. The implementation details of the algorithm will be given in Chapter 4. The present chapter mainly focuses on the fundamental properties of optimal policies.

A baseline example

Consider the example in Figure 3.4. The reward settings are rboundary=rforbidden=1r_{\mathrm{boundary}} = r_{\mathrm{forbidden}} = -1 and rtarget=1r_{\mathrm{target}} = 1 . In addition, the agent receives a reward of rother=0r_{\mathrm{other}} = 0 for every movement step. The discount rate is selected as γ=0.9\gamma = 0.9 .

With the above parameters, the optimal policy and optimal state values are given in Figure 3.4(a). It is interesting that the agent is not afraid of passing through forbidden areas to reach the target area. More specifically, starting from the state at (row=4,(\mathrm{row} = 4, column =1= 1 ), the agent has two options for reaching the target area. The first option is to avoid all the forbidden areas and travel a long distance to the target area. The second option is to pass through forbidden areas. Although the agent obtains negative rewards when entering forbidden areas, the cumulative reward of the second trajectory is greater than that of the first trajectory. Therefore, the optimal policy is far-sighted due to the relatively large value of γ\gamma

Impact of the discount rate

If we change the discount rate from γ=0.9\gamma = 0.9 to γ=0.5\gamma = 0.5 and keep other parameters unchanged, the optimal policy becomes the one shown in Figure 3.4(b). It is interesting that the agent does not dare to take risks anymore. Instead, it would travel a long distance to reach the target while avoiding all the forbidden areas. This is because the optimal policy becomes short-sighted due to the relatively small value of γ\gamma .

In the extreme case where γ=0\gamma = 0 , the corresponding optimal policy is shown in Figure 3.4(c). In this case, the agent is not able to reach the target area. This is


(a) Baseline example: rboundary=rforbidden=1r_{\mathrm{boundary}} = r_{\mathrm{forbidden}} = -1 , rtarget=1r_{\mathrm{target}} = 1 , γ=0.9\gamma = 0.9


(b) The discount rate is changed to γ=0.5\gamma = 0.5 . The other parameters are the same as those in (a).


(c) The discount rate is changed to γ=0\gamma = 0 . The other parameters are the same as those in (a).


(d) rforbiddenr_{\mathrm{forbidden}} is changed from 1-1 to 10-10 . The other parameters are the same as those in (a).


Figure 3.4: The optimal policies and optimal state values given different parameter values.

because the optimal policy for each state is extremely short-sighted and merely selects the action with the greatest immediate reward instead of the greatest total reward.

In addition, the spatial distribution of the state values exhibits an interesting pattern: the states close to the target have greater state values, whereas those far away have lower values. This pattern can be observed from all the examples shown in Figure 3.4. It can be explained by using the discount rate: if a state must travel along a longer trajectory to reach the target, its state value is smaller due to the discount rate.

Impact of the reward values

If we want to strictly prohibit the agent from entering any forbidden area, we can increase the punishment received for doing so. For instance, if rforbiddenr_{\text{forbidden}} is changed from -1 to -10, the resulting optimal policy can avoid all the forbidden areas (see Figure 3.4(d)).

However, changing the rewards does not always lead to different optimal policies. One important fact is that optimal policies are invariant to affine transformations of the rewards. In other words, if we scale all the rewards or add the same value to all the rewards, the optimal policy remains the same.

Theorem 3.6 (Optimal policy invariance). Consider a Markov decision process with vRSv^{*} \in \mathbb{R}^{|S|} as the optimal state value satisfying v=maxπΠ(rπ+γPπv)v^{*} = \max_{\pi \in \Pi} (r_{\pi} + \gamma P_{\pi} v^{*}) . If every reward rRr \in \mathcal{R} is changed by an affine transformation to αr+β\alpha r + \beta , where α,βR\alpha, \beta \in \mathbb{R} and α>0\alpha > 0 , then the corresponding optimal state value vv' is also an affine transformation of vv^{*} :

v=αv+β1γ1,(3.8)v ^ {\prime} = \alpha v ^ {*} + \frac {\beta}{1 - \gamma} \mathbf {1}, \tag {3.8}

where γ(0,1)\gamma \in (0,1) is the discount rate and 1=[1,,1]T\mathbf{1} = [1,\dots ,1]^T . Consequently, the optimal policy derived from vv^{\prime} is invariant to the affine transformation of the reward values.

Box 3.5: Proof of Theorem 3.6

For any policy π\pi , define rπ=[,rπ(s),]Tr_{\pi} = [\ldots, r_{\pi}(s), \ldots]^T where

rπ(s)=aAπ(as)rRp(rs,a)r,sS.r _ {\pi} (s) = \sum_ {a \in \mathcal {A}} \pi (a | s) \sum_ {r \in \mathcal {R}} p (r | s, a) r, \quad s \in \mathcal {S}.

If rαr+βr \to \alpha r + \beta , then rπ(s)αrπ(s)+βr_{\pi}(s) \to \alpha r_{\pi}(s) + \beta and hence rπαrπ+β1r_{\pi} \to \alpha r_{\pi} + \beta \mathbf{1} , where 1=[1,,1]T\mathbf{1} = [1, \ldots, 1]^T . In this case, the BOE becomes

v=maxπΠ(αrπ+β1+γPπv).(3.9)v ^ {\prime} = \max _ {\pi \in \Pi} \left(\alpha r _ {\pi} + \beta \mathbf {1} + \gamma P _ {\pi} v ^ {\prime}\right). \tag {3.9}

We next solve the new BOE in (3.9) by showing that v=αv+c1v' = \alpha v^* + c\mathbf{1} with c=β/(1γ)c = \beta/(1 - \gamma) is a solution of (3.9). In particular, substituting v=αv+c1v' = \alpha v^* + c\mathbf{1} into (3.9) gives

αv+c1=maxπΠ(αrπ+β1+γPπ(αv+c1))=maxπΠ(αrπ+β1+αγPπv+cγ1),\alpha v ^ {*} + c \mathbf {1} = \max _ {\pi \in \Pi} (\alpha r _ {\pi} + \beta \mathbf {1} + \gamma P _ {\pi} (\alpha v ^ {*} + c \mathbf {1})) = \max _ {\pi \in \Pi} (\alpha r _ {\pi} + \beta \mathbf {1} + \alpha \gamma P _ {\pi} v ^ {*} + c \gamma \mathbf {1}),

where the last equality is due to the fact that Pπ1=1P_{\pi} \mathbf{1} = \mathbf{1} . The above equation can be reorganized as

αv=maxπΠ(αrπ+αγPπv)+β1+cγ1c1,\alpha v ^ {*} = \max _ {\pi \in \Pi} (\alpha r _ {\pi} + \alpha \gamma P _ {\pi} v ^ {*}) + \beta \mathbf {1} + c \gamma \mathbf {1} - c \mathbf {1},

which is equivalent to

β1+cγ1c1=0.\beta \mathbf {1} + c \gamma \mathbf {1} - c \mathbf {1} = 0.

Since c=β/(1γ)c = \beta / (1 - \gamma) , the above equation is valid and hence v=αv+c1v' = \alpha v^* + c\mathbf{1} is the solution of (3.9). Since (3.9) is the BOE, vv' is also the unique solution. Finally, since vv' is an affine transformation of vv^* , the relative relationships between the action values remain the same. Hence, the greedy optimal policy derived from vv' is the same as that from vv^* : argmaxπΠ(rπ+γPπv)\arg \max_{\pi \in \Pi}(r_\pi + \gamma P_\pi v') is the same as argmaxπ(rπ+γPπv)\arg \max_{\pi}(r_\pi + \gamma P_\pi v^*) .

Readers may refer to [9] for a further discussion on the conditions under which modifications to the reward values preserve the optimal policy.

Avoiding meaningless detours

In the reward setting, the agent receives a reward of rother=0r_{\text{other}} = 0 for every movement step (unless it enters a forbidden area or the target area or attempts to go beyond the boundary). Since a zero reward is not a punishment, would the optimal policy take meaningless detours before reaching the target? Should we set rotherr_{\text{other}} to be negative to encourage the agent to reach the target as quickly as possible?


(a) Optimal policy


(b) Non-optimal policy
Figure 3.5: Examples illustrating that optimal policies do not take meaningless detours due to the discount rate.

Consider the examples in Figure 3.5, where the bottom-right cell is the target area

to reach. The two policies here are the same except for state s2s_2 . By the policy in Figure 3.5(a), the agent moves downward at s2s_2 and the resulting trajectory is s2s4s_2 \rightarrow s_4 . By the policy in Figure 3.5(b), the agent moves leftward and the resulting trajectory is s2s1s3s4s_2 \rightarrow s_1 \rightarrow s_3 \rightarrow s_4 .

It is notable that the second policy takes a detour before reaching the target area. If we merely consider the immediate rewards, taking this detour does not matter because no negative immediate rewards will be obtained. However, if we consider the discounted return, then this detour matters. In particular, for the first policy, the discounted return is

return=1+γ1+γ21+=1/(1γ)=10.\mathrm {r e t u r n} = 1 + \gamma 1 + \gamma^ {2} 1 + \dots = 1 / (1 - \gamma) = 1 0.

As a comparison, the discounted return for the second policy is

return=0+γ0+γ21+γ31+=γ2/(1γ)=8.1.\mathrm {r e t u r n} = 0 + \gamma 0 + \gamma^ {2} 1 + \gamma^ {3} 1 + \dots = \gamma^ {2} / (1 - \gamma) = 8. 1.

It is clear that the shorter the trajectory is, the greater the return is. Therefore, although the immediate reward of every step does not encourage the agent to approach the target as quickly as possible, the discount rate does encourage it to do so.

A misunderstanding that beginners may have is that adding a negative reward (e.g., -1) on top of the rewards obtained for every movement is necessary to encourage the agent to reach the target as quickly as possible. This is a misunderstanding because adding the same reward on top of all rewards is an affine transformation, which preserves the optimal policy. Moreover, optimal policies do not take meaningless detours due to the discount rate, even though a detour may not receive any immediate negative rewards.