6.2_Robbins-Monro_algorithm

6.2 Robbins-Monro algorithm

Stochastic approximation refers to a broad class of stochastic iterative algorithms for solving root-finding or optimization problems [24]. Compared to many other root-finding

algorithms such as gradient-based ones, stochastic approximation is powerful in the sense that it does not require the expression of the objective function or its derivative.

The Robbins-Monro (RM) algorithm is a pioneering work in the field of stochastic approximation [24-27]. The famous stochastic gradient descent algorithm is a special form of the RM algorithm, as shown in Section 6.4. We next introduce the details of the RM algorithm.

Suppose that we would like to find the root of the equation

g(w)=0,g (w) = 0,

where wRw \in \mathbb{R} is the unknown variable and g:RRg: \mathbb{R} \to \mathbb{R} is a function. Many problems can be formulated as root-finding problems. For example, if J(w)J(w) is an objective function to be optimized, this optimization problem can be converted to solving g(w)wJ(w)=0g(w) \doteq \nabla_w J(w) = 0 . In addition, an equation such as g(w)=cg(w) = c , where cc is a constant, can also be converted to the above equation by rewriting g(w)cg(w) - c as a new function.

If the expression of gg or its derivative is known, there are many numerical algorithms that can be used. However, the problem we are facing is that the expression of the function gg is unknown. For example, the function may be represented by an artificial neural network whose structure and parameters are unknown. Moreover, we can only obtain a noisy observation of g(w)g(w) :

g~(w,η)=g(w)+η,\tilde {g} (w, \eta) = g (w) + \eta ,

where ηR\eta \in \mathbb{R} is the observation error, which may or may not be Gaussian. In summary, it is a black-box system where only the input ww and the noisy output g~(w,η)\tilde{g}(w, \eta) are known (see Figure 6.2). Our aim is to solve g(w)=0g(w) = 0 using ww and g~\tilde{g} .


Figure 6.2: An illustration of the problem of solving g(w)=0g(w) = 0 from ww and g~\tilde{g} .

The RM algorithm that can solve g(w)=0g(w) = 0 is

wk+1=wkakg~(wk,ηk),k=1,2,3,(6.5)w _ {k + 1} = w _ {k} - a _ {k} \tilde {g} \left(w _ {k}, \eta_ {k}\right), \quad k = 1, 2, 3, \dots \tag {6.5}

where wkw_{k} is the kk th estimate of the root, g~(wk,ηk)\tilde{g}(w_{k}, \eta_{k}) is the kk th noisy observation, and aka_{k} is a positive coefficient. As can be seen, the RM algorithm does not require any information about the function. It only requires the input and output.


Figure 6.3: An illustrative example of the RM algorithm.

To illustrate the RM algorithm, consider an example in which g(w)=w35g(w) = w^3 - 5 . The true root is 51/31.715^{1/3} \approx 1.71 . Now, suppose that we can only observe the input ww and the output g~(w)=g(w)+η\tilde{g}(w) = g(w) + \eta , where η\eta is i.i.d. and obeys a standard normal distribution with a zero mean and a standard deviation of 1. The initial guess is w1=0w_1 = 0 , and the coefficient is ak=1/ka_k = 1/k . The evolution process of wkw_k is shown in Figure 6.3. Even though the observation is corrupted by noise ηk\eta_k , the estimate wkw_k can still converge to the true root. Note that the initial guess w1w_1 must be properly selected to ensure convergence for the specific function of g(w)=w35g(w) = w^3 - 5 . In the following subsection, we present the conditions under which the RM algorithm converges for any initial guesses.

6.2.1 Convergence properties

Why can the RM algorithm in (6.5) find the root of g(w)=0g(w) = 0 ? We next illustrate the idea with an example and then provide a rigorous convergence analysis.

Consider the example shown in Figure 6.4. In this example, g(w)=tanh(w1)g(w) = \tanh (w - 1) . The true root of g(w)=0g(w) = 0 is w=1w^{*} = 1 . We apply the RM algorithm with w1=3w_{1} = 3 and ak=1/ka_{k} = 1 / k . To better illustrate the reason for convergence, we simply set ηk0\eta_k\equiv 0 , and consequently, g~(wk,ηk)=g(wk)\tilde{g} (w_k,\eta_k) = g(w_k) . The RM algorithm in this case is wk+1=wkakg(wk)w_{k + 1} = w_{k} - a_{k}g(w_{k}) . The resulting {wk}\{w_k\} generated by the RM algorithm is shown in Figure 6.4. It can be seen that wkw_{k} converges to the true root w=1w^{*} = 1 .

This simple example can illustrate why the RM algorithm converges.

\diamond When wk>ww_{k} > w^{*} , we have g(wk)>0g(w_{k}) > 0 . Then, wk+1=wkakg(wk)<wkw_{k + 1} = w_{k} - a_{k}g(w_{k}) < w_{k} . If akg(wk)a_{k}g(w_{k}) is sufficiently small, we have w<wk+1<wkw^{*} < w_{k + 1} < w_{k} . As a result, wk+1w_{k + 1} is closer to ww^{*} than wkw_{k} .
\diamond When wk<ww_{k} < w^{*} , we have g(wk)<0g(w_{k}) < 0 . Then, wk+1=wkakg(wk)>wkw_{k + 1} = w_{k} - a_{k}g(w_{k}) > w_{k} . If akg(wk)|a_{k}g(w_{k})| is sufficiently small, we have w>wk+1>wkw^{*} > w_{k + 1} > w_{k} . As a result, wk+1w_{k + 1} is closer to ww^{*} than wkw_{k} .

In either case, wk+1w_{k+1} is closer to ww^* than wkw_k . Therefore, it is intuitive that wkw_k converges to ww^* .


Figure 6.4: An example for illustrating the convergence of the RM algorithm.

The above example is simple since the observation error is assumed to be zero. It would be nontrivial to analyze the convergence in the presence of stochastic observation errors. A rigorous convergence result is given below.

Theorem 6.1 (Robbins-Monro theorem). In the Robbins-Monro algorithm in (6.5), if

(a) 0<c1wg(w)c20 < c_{1} \leq \nabla_{w} g(w) \leq c_{2} for all ww ;
(b) k=1ak=\sum_{k=1}^{\infty} a_k = \infty and k=1ak2<\sum_{k=1}^{\infty} a_k^2 < \infty ;
(c) E[ηkHk]=0\mathbb{E}[\eta_k|\mathcal{H}_k] = 0 and E[ηk2Hk]<\mathbb{E}[\eta_k^2 |\mathcal{H}_k] < \infty ;

where Hk={wk,wk1,}\mathcal{H}_k = \{w_k, w_{k-1}, \ldots\} , then wkw_k almost surely converges to the root ww^* satisfying g(w)=0g(w^*) = 0 .

We postpone the proof of this theorem to Section 6.3.3. This theorem relies on the notion of almost sure convergence, which is introduced in Appendix B.

The three conditions in Theorem 6.1 are explained as follows.

In the first condition, 0<c1wg(w)0 < c_{1} \leq \nabla_{w} g(w) indicates that g(w)g(w) is a monotonically increasing function. This condition ensures that the root of g(w)=0g(w) = 0 exists and is unique. If g(w)g(w) is monotonically decreasing, we can simply treat g(w)-g(w) as a new function that is monotonically increasing.

As an application, we can formulate an optimization problem in which the objective function is J(w)J(w) as a root-finding problem: g(w)wJ(w)=0g(w) \doteq \nabla_w J(w) = 0 . In this case, the condition that g(w)g(w) is monotonically increasing indicates that J(w)J(w) is convex, which is a commonly adopted assumption in optimization problems.

The inequality wg(w)c2\nabla_w g(w) \leq c_2 indicates that the gradient of g(w)g(w) is bounded from above. For example, g(w)=tanh(w1)g(w) = \tanh (w - 1) satisfies this condition, but g(w)=w35g(w) = w^3 - 5 does not.

The second condition about {ak}\{a_k\} is interesting. We often see conditions like this in reinforcement learning algorithms. In particular, the condition k=1ak2<\sum_{k=1}^{\infty} a_k^2 < \infty means that limnk=1nak2\lim_{n \to \infty} \sum_{k=1}^{n} a_k^2 is bounded from above. It requires that aka_k converges to zero as kk \to \infty . The condition k=1ak=\sum_{k=1}^{\infty} a_k = \infty means that limnk=1nak\lim_{n \to \infty} \sum_{k=1}^{n} a_k is infinitely large. It requires that aka_k should not converge to zero too fast. These conditions have interesting properties, which will be analyzed in detail shortly.
The third condition is mild. It does not require the observation error ηk\eta_{k} to be Gaussian. An important special case is that {ηk}\{\eta_k\} is an i.i.d. stochastic sequence satisfying E[ηk]=0\mathbb{E}[\eta_k] = 0 and E[ηk2]<\mathbb{E}[\eta_k^2 ] < \infty . In this case, the third condition is valid because ηk\eta_{k} is independent of Hk\mathcal{H}_k and hence we have E[ηkHk]=E[ηk]=0\mathbb{E}[\eta_k|\mathcal{H}_k] = \mathbb{E}[\eta_k] = 0 and E[ηk2Hk]=E[ηk2]\mathbb{E}[\eta_k^2 |\mathcal{H}_k] = \mathbb{E}[\eta_k^2 ] .

We next examine the second condition about the coefficients {ak}\{a_k\} more closely.

\diamond Why is the second condition important for the convergence of the RM algorithm?

This question can naturally be answered when we present a rigorous proof of the above theorem later. Here, we would like to provide some insightful intuition.

First, k=1ak2<\sum_{k=1}^{\infty} a_k^2 < \infty indicates that ak0a_k \to 0 as kk \to \infty . Why is this condition important? Suppose that the observation g~(wk,ηk)\tilde{g}(w_k, \eta_k) is always bounded. Since

wk+1wk=akg~(wk,ηk),w _ {k + 1} - w _ {k} = - a _ {k} \tilde {g} (w _ {k}, \eta_ {k}),

if ak0a_k \to 0 , then akg~(wk,ηk)0a_k \tilde{g}(w_k, \eta_k) \to 0 and hence wk+1wk0w_{k+1} - w_k \to 0 , indicating that wk+1w_{k+1} and wkw_k approach each other when kk \to \infty . Otherwise, if aka_k does not converge, then wkw_k may still fluctuate when kk \to \infty .

Second, k=1ak=\sum_{k=1}^{\infty} a_k = \infty indicates that aka_k should not converge to zero too fast. Why is this condition important? Summarizing both sides of the equations of w2w1=a1g~(w1,η1)w_2 - w_1 = -a_1 \tilde{g}(w_1, \eta_1) , w3w2=a2g~(w2,η2)w_3 - w_2 = -a_2 \tilde{g}(w_2, \eta_2) , w4w3=a3g~(w3,η3)w_4 - w_3 = -a_3 \tilde{g}(w_3, \eta_3) , ... gives

w1w=k=1akg~(wk,ηk).w _ {1} - w _ {\infty} = \sum_ {k = 1} ^ {\infty} a _ {k} \tilde {g} (w _ {k}, \eta_ {k}).

If k=1ak<\sum_{k=1}^{\infty} a_k < \infty , then k=1akg~(wk,ηk)|\sum_{k=1}^{\infty} a_k \tilde{g}(w_k, \eta_k)| is also bounded. Let bb denote the finite upper bound such that

w1w=k=1akg~(wk,ηk)b.(6.6)\left| w _ {1} - w _ {\infty} \right| = \left| \sum_ {k = 1} ^ {\infty} a _ {k} \tilde {g} \left(w _ {k}, \eta_ {k}\right) \right| \leq b. \tag {6.6}

If the initial guess w1w_{1} is selected far away from ww^{*} so that w1w>b|w_{1} - w^{*}| > b , then it is impossible to have w=ww_{\infty} = w^{*} according to (6.6). This suggests that the RM algorithm cannot find the true solution ww^{*} in this case. Therefore, the condition k=1ak=\sum_{k=1}^{\infty} a_{k} = \infty is necessary to ensure convergence given an arbitrary initial guess.

What kinds of sequences satisfy k=1ak=\sum_{k=1}^{\infty} a_k = \infty and k=1ak2<\sum_{k=1}^{\infty} a_k^2 < \infty ?

One typical sequence is

ak=1k.a _ {k} = \frac {1}{k}.

On the one hand, it holds that

limn(k=1n1klnn)=κ,\lim _ {n \rightarrow \infty} \left(\sum_ {k = 1} ^ {n} \frac {1}{k} - \ln n\right) = \kappa ,

where κ0.577\kappa \approx 0.577 is called the Euler-Mascheroni constant (or Euler's constant) [28]. Since lnn\ln n\to \infty as nn\to \infty , we have

k=11k=.\sum_ {k = 1} ^ {\infty} \frac {1}{k} = \infty .

In fact, Hn=k=1n1kH_{n} = \sum_{k=1}^{n} \frac{1}{k} is called the harmonic number in number theory [29]. On the other hand, it holds that

k=11k2=π26<.\sum_ {k = 1} ^ {\infty} \frac {1}{k ^ {2}} = \frac {\pi^ {2}}{6} < \infty .

Finding the value of k=11k2\sum_{k=1}^{\infty} \frac{1}{k^2} is known as the Basel problem [30].

In summary, the sequence {ak=1/k}\{a_k = 1 / k\} satisfies the second condition in Theorem 6.1. Notably, a slight modification, such as ak=1/(k+1)a_k = 1 / (k + 1) or ak=ck/ka_k = c_k / k where ckc_k is bounded, also preserves this condition.

In the RM algorithm, aka_{k} is often selected as a sufficiently small constant in many applications. Although the second condition is not satisfied anymore in this case because k=1ak2=\sum_{k=1}^{\infty} a_{k}^{2} = \infty rather than k=1ak2<\sum_{k=1}^{\infty} a_{k}^{2} < \infty , the algorithm can still converge in a certain sense [24, Section 1.5]. In addition, g(x)=x35g(x) = x^{3} - 5 in the example shown in Figure 6.3 does not satisfy the first condition, but the RM algorithm can still find the root if the initial guess is adequately (not arbitrarily) selected.

6.2.2 Application to mean estimation

We next apply the Robbins-Monro theorem to analyze the mean estimation problem, which has been discussed in Section 6.1. Recall that

wk+1=wk+αk(xkwk)w _ {k + 1} = w _ {k} + \alpha_ {k} \left(x _ {k} - w _ {k}\right)

is the mean estimation algorithm in (6.4). When αk=1/k\alpha_{k} = 1 / k , we can obtain the analytical expression of wk+1w_{k + 1} as wk+1=1/ki=1kxiw_{k + 1} = 1 / k\sum_{i = 1}^{k}x_{i} . However, we would not be able to obtain an analytical expression when given general values of αk\alpha_{k} . In this case, the convergence analysis is nontrivial. We can show that the algorithm in this case is a special RM

algorithm and hence its convergence naturally follows.

In particular, define a function as

g(w)wE[X].g (w) \doteq w - \mathbb {E} [ X ].

The original problem is to obtain the value of E[X]\mathbb{E}[X] . This problem is formulated as a root-finding problem to solve g(w)=0g(w) = 0 . Given a value of ww , the noisy observation that we can obtain is g~wx\tilde{g} \doteq w - x , where xx is a sample of XX . Note that g~\tilde{g} can be written as

g~(w,η)=wx=wx+E[X]E[X]=(wE[X])+(E[X]x)=˙g(w)+η,\begin{array}{l} \tilde {g} (w, \eta) = w - x \\ = w - x + \mathbb {E} [ X ] - \mathbb {E} [ X ] \\ = (w - \mathbb {E} [ X ]) + (\mathbb {E} [ X ] - x) \dot {=} g (w) + \eta , \\ \end{array}

where ηE[X]x\eta \doteq \mathbb{E}[X] - x

The RM algorithm for solving this problem is

wk+1=wkαkg~(wk,ηk)=wkαk(wkxk),w _ {k + 1} = w _ {k} - \alpha_ {k} \tilde {g} \left(w _ {k}, \eta_ {k}\right) = w _ {k} - \alpha_ {k} \left(w _ {k} - x _ {k}\right),

which is exactly the algorithm in (6.4). As a result, it is guaranteed by Theorem 6.1 that wkw_{k} converges to E[X]\mathbb{E}[X] almost surely if k=1αk=\sum_{k=1}^{\infty} \alpha_{k} = \infty , k=1αk2<\sum_{k=1}^{\infty} \alpha_{k}^{2} < \infty , and {xk}\{x_{k}\} is i.i.d. It is worth mentioning that the convergence property does not rely on any assumption regarding the distribution of XX .