1.4_Policy

1.4 Policy

A policy tells the agent which actions to take at every state. Intuitively, policies can be depicted as arrows (see Figure 1.4(a)). Following a policy, the agent can generate a trajectory starting from an initial state (see Figure 1.4(b)).


(a) A deterministic policy


(b) Trajectories obtained from the policy


Figure 1.4: A policy represented by arrows and some trajectories obtained by starting from different initial states.

Mathematically, policies can be described by conditional probabilities. Denote the policy in Figure 1.4 as π(as)\pi(a|s) , which is a conditional probability distribution function defined for every state. For example, the policy for s1s_1 is

π(a1s1)=0,\pi \left(a _ {1} \mid s _ {1}\right) = 0,
π(a2s1)=1,\pi \left(a _ {2} \mid s _ {1}\right) = 1,
π(a3s1)=0,\pi \left(a _ {3} \mid s _ {1}\right) = 0,
π(a4s1)=0,\pi \left(a _ {4} \mid s _ {1}\right) = 0,
π(a5s1)=0,\pi \left(a _ {5} \mid s _ {1}\right) = 0,

which indicates that the probability of taking action a2a_2 in state s1s_1 is one, and the probabilities of taking other actions are zero.

The above policy is deterministic. Policies may be stochastic in general. For example, the policy shown in Figure 1.5 is stochastic: in state s1s_1 , the agent may take actions to go either rightward or downward. The probabilities of taking these two actions are the

same (both are 0.5). In this case, the policy for s1s_1 is

π(a1s1)=0,\pi \left(a _ {1} \mid s _ {1}\right) = 0,
π(a2s1)=0.5,\pi (a _ {2} | s _ {1}) = 0. 5,
π(a3s1)=0.5,\pi \left(a _ {3} \mid s _ {1}\right) = 0. 5,
π(a4s1)=0,\pi \left(a _ {4} \mid s _ {1}\right) = 0,
π(a5s1)=0.\pi \left(a _ {5} \mid s _ {1}\right) = 0.


Figure 1.5: A stochastic policy. In state s1s_1 , the agent may move rightward or downward with equal probabilities of 0.5.

Policies represented by conditional probabilities can be stored as tables. For example, Table 1.2 represents the stochastic policy depicted in Figure 1.5. The entry in the ii th row and jj th column is the probability of taking the jj th action at the ii th state. Such a representation is called a tabular representation. We will introduce another way to represent policies as parameterized functions in Chapter 8.

Table 1.2: A tabular representation of a policy. Each entry indicates the probability of taking an action at a state.