5.4_MC_epsilon-Greedy_Learning_without_exploring_starts

5.4 MC ϵ\epsilon -Greedy: Learning without exploring starts

We next extend the MC Exploring Starts algorithm by removing the exploring starts condition. This condition actually requires that every state-action pair can be visited sufficiently many times, which can also be achieved based on soft policies.

5.4.1 ϵ\epsilon -greedy policies

A policy is soft if it has a positive probability of taking any action at any state. Consider an extreme case in which we only have a single episode. With a soft policy, a single episode that is sufficiently long can visit every state-action pair many times (see the examples in Figure 5.8). Thus, we do not need to generate a large number of episodes starting from different state-action pairs, and then the exploring starts requirement can be removed.

One type of common soft policies is ϵ\epsilon -greedy policies. An ϵ\epsilon -greedy policy is a stochastic policy that has a higher chance of choosing the greedy action and the same nonzero probability of taking any other action. Here, the greedy action refers to the action with the greatest action value. In particular, suppose that ϵ[0,1]\epsilon \in [0,1] . The corresponding ϵ\epsilon -greedy policy has the following form:

π(as)={1ϵA(s)(A(s)1),forthegreedyaction,ϵA(s),fortheotherA(s)1actions,\pi (a | s) = \left\{ \begin{array}{l l} 1 - \frac {\epsilon}{| \mathcal {A} (s) |} (| \mathcal {A} (s) | - 1), & \mathrm {f o r t h e g r e e d y a c t i o n ,} \\ \frac {\epsilon}{| \mathcal {A} (s) |}, & \mathrm {f o r t h e o t h e r | \mathcal {A} (s) | - 1 a c t i o n s ,} \end{array} \right.

where A(s)|\mathcal{A}(s)| denotes the number of actions associated with ss .

When ϵ=0\epsilon = 0 , ϵ\epsilon -greedy becomes greedy. When ϵ=1\epsilon = 1 , the probability of taking any action equals 1A(s)\frac{1}{|\mathcal{A}(s)|} .

The probability of taking the greedy action is always greater than that of taking any

other action because

1ϵA(s)(A(s)1)=1ϵ+ϵA(s)ϵA(s)1 - \frac {\epsilon}{| \mathcal {A} (s) |} (| \mathcal {A} (s) | - 1) = 1 - \epsilon + \frac {\epsilon}{| \mathcal {A} (s) |} \geq \frac {\epsilon}{| \mathcal {A} (s) |}

for any ϵ[0,1]\epsilon \in [0,1] .

While an ϵ\epsilon -greedy policy is stochastic, how can we select an action by following such a policy? We can first generate a random number xx in [0,1][0,1] by following a uniform distribution. If xϵx \geq \epsilon , then we select the greedy action. If x<ϵx < \epsilon , then we randomly select an action in A(s)\mathcal{A}(s) with the probability of 1A(s)\frac{1}{|\mathcal{A}(s)|} (we may select the greedy action again). In this way, the total probability of selecting the greedy action is 1ϵ+ϵA(s)1 - \epsilon + \frac{\epsilon}{|\mathcal{A}(s)|} , and the probability of selecting any other action is ϵA(s)\frac{\epsilon}{|\mathcal{A}(s)|} .

5.4.2 Algorithm description

To integrate ϵ\epsilon -greedy policies into MC learning, we only need to change the policy improvement step from greedy to ϵ\epsilon -greedy.

In particular, the policy improvement step in MC Basic or MC Exploring Starts aims to solve

πk+1(s)=argmaxπΠaπ(as)qπk(s,a),(5.4)\pi_ {k + 1} (s) = \arg \max _ {\pi \in \Pi} \sum_ {a} \pi (a | s) q _ {\pi_ {k}} (s, a), \tag {5.4}

where Π\Pi denotes the set of all possible policies. We know that the solution of (5.4) is a greedy policy:

πk+1(as)={1,a=ak,0,aak,\pi_ {k + 1} (a | s) = \left\{ \begin{array}{l l} 1, & a = a _ {k} ^ {*}, \\ 0, & a \neq a _ {k} ^ {*}, \end{array} \right.

where ak=argmaxaqπk(s,a)a_{k}^{*} = \arg \max_{a}q_{\pi_{k}}(s,a)

Now, the policy improvement step is changed to solve

πk+1(s)=argmaxπΠϵaπ(as)qπk(s,a),(5.5)\pi_ {k + 1} (s) = \arg \max _ {\pi \in \Pi_ {\epsilon}} \sum_ {a} \pi (a | s) q _ {\pi_ {k}} (s, a), \tag {5.5}

where Πϵ\Pi_{\epsilon} denotes the set of all ϵ\epsilon -greedy policies with a given value of ϵ\epsilon . In this way, we force the policy to be ϵ\epsilon -greedy. The solution of (5.5) is

πk+1(as)={1A(s)1A(s)ϵ,a=ak,1A(s)ϵ,aak,\pi_ {k + 1} (a | s) = \left\{ \begin{array}{l l} 1 - \frac {| \mathcal {A} (s) | - 1}{| \mathcal {A} (s) |} \epsilon , & a = a _ {k} ^ {*}, \\ \frac {1}{| \mathcal {A} (s) |} \epsilon , & a \neq a _ {k} ^ {*}, \end{array} \right.

where ak=argmaxaqπk(s,a)a_{k}^{*} = \arg \max_{a} q_{\pi_{k}}(s, a) . With the above change, we obtain another algorithm called MCϵGreedyMC \epsilon-Greedy . The details of this algorithm are given in Algorithm 5.3. Here, the every-visit strategy is employed to better utilize the samples.

Algorithm 5.3: MC ϵ\epsilon -Greedy (a variant of MC Exploring Starts)
Initialization: Initial policy π0(as)\pi_0(a|s) and initial value q(s,a)q(s,a) for all (s,a)(s,a) . Returns (s,a)=0(s,a) = 0 and Num(s,a)=0\mathrm{Num}(s,a) = 0 for all (s,a)(s,a) . ϵ(0,1]\epsilon \in (0,1]
Goal: Search for an optimal policy.
For each episode, do
Episode generation: Select a starting state-action pair (s0,a0)(s_0,a_0) (the exploring starts condition is not required). Following the current policy, generate an episode of length TT : s0,a0,r1,,sT1,aT1,rTs_0,a_0,r_1,\ldots ,s_{T - 1},a_{T - 1},r_T . Initialization for each episode: g0g\gets 0
For each step of the episode, t=T1,T2,,0t = T - 1,T - 2,\dots ,0 do
gγg+rt+1g\gets \gamma g + r_{t + 1}
Returns (st,at)Returns(st,at)+g(s_t,a_t)\gets \mathrm{Returns}(s_t,a_t) + g Num(st,at)Num(st,at)+1\mathrm{Num}(s_t,a_t)\gets \mathrm{Num}(s_t,a_t) + 1
Policy evaluation:
q(st,at)Returns(st,at)/Num(st,at)q(s_{t},a_{t})\gets \mathrm{Returns}(s_{t},a_{t}) / \mathrm{Num}(s_{t},a_{t})
Policy improvement:
Let a=argmaxaq(st,a)a^* = \arg \max_a q(s_t,a) and
π(ast)={1A(st)1A(st)ϵ,a=a1A(st)ϵ,aa\pi (a|s_t) = \left\{ \begin{array}{ll}1 - \frac{|A(s_t)| - 1}{|A(s_t)|}\epsilon , & a = a^*\\ \frac{1}{|A(s_t)|}\epsilon , & a\neq a^* \end{array} \right.

If greedy policies are replaced by ϵ\epsilon -greedy policies in the policy improvement step, can we still guarantee to obtain optimal policies? The answer is both yes and no. By yes, we mean that, when given sufficient samples, the algorithm can converge to an ϵ\epsilon -greedy policy that is optimal in the set Πϵ\Pi_{\epsilon} . By no, we mean that the policy is merely optimal in Πϵ\Pi_{\epsilon} but may not be optimal in Π\Pi . However, if ϵ\epsilon is sufficiently small, the optimal policies in Πϵ\Pi_{\epsilon} are close to those in Π\Pi .

5.4.3 Illustrative examples

Consider the grid world example shown in Figure 5.5. The aim is to find the optimal policy for every state. A single episode with one million steps is generated in every iteration of the MC ϵ\epsilon -Greedy algorithm. Here, we deliberately consider the extreme case with merely one single episode. We set rboundary=rforbidden=1r_{\mathrm{boundary}} = r_{\mathrm{forbidden}} = -1 , rtarget=1r_{\mathrm{target}} = 1 , and γ=0.9\gamma = 0.9 .

The initial policy is a uniform policy that has the same probability 0.2 of taking any action, as shown in Figure 5.5. The optimal ϵ\epsilon -greedy policy with ϵ=0.5\epsilon = 0.5 can be obtained after two iterations. Although each iteration merely uses a single episode, the policy gradually improves because all the state-action pairs can be visited and hence their values can be accurately estimated.


(a) Initial policy


(b) After the first iteration


(c) After the second iteration
Figure 5.5: The evolution process of the MC ϵ\epsilon -Greedy algorithm based on single episodes.