6.4_Stochastic_gradient_descent

6.4 Stochastic gradient descent

This section introduces stochastic gradient descent (SGD) algorithms, which are widely used in the field of machine learning. We will see that SGD is a special RM algorithm, and the mean estimation algorithm is a special SGD algorithm.

Consider the following optimization problem:

minwJ(w)=E[f(w,X)],(6.10)\min _ {w} J (w) = \mathbb {E} [ f (w, X) ], \tag {6.10}

where ww is the parameter to be optimized, and XX is a random variable. The expectation is calculated with respect to XX . Here, ww and XX can be either scalars or vectors. The function f()f(\cdot) is a scalar.

A straightforward method for solving (6.10) is gradient descent. In particular, the gradient of E[f(w,X)]\mathbb{E}[f(w,X)] is wE[f(w,X)]=E[wf(w,X)]\nabla_w\mathbb{E}[f(w,X)] = \mathbb{E}[\nabla_w f(w,X)] . Then, the gradient descent algorithm is

wk+1=wkαkwJ(wk)=wkαkE[wf(wk,X)].(6.11)w _ {k + 1} = w _ {k} - \alpha_ {k} \nabla_ {w} J (w _ {k}) = w _ {k} - \alpha_ {k} \mathbb {E} [ \nabla_ {w} f (w _ {k}, X) ]. \qquad \qquad (6. 1 1)

This gradient descent algorithm can find the optimal solution ww^{*} under some mild conditions such as the convexity of ff . Preliminaries about gradient descent algorithms can be found in Appendix D.

The gradient descent algorithm requires the expected value E[wf(wk,X)]\mathbb{E}[\nabla_w f(w_k, X)] . One way to obtain the expected value is based on the probability distribution of XX . The

distribution is, however, often unknown in practice. Another way is to collect a large number of i.i.d. samples {xi}i=1n\{x_{i}\}_{i = 1}^{n} of XX so that the expected value can be approximated as

E[wf(wk,X)]1ni=1nwf(wk,xi).\mathbb {E} [ \nabla_ {w} f (w _ {k}, X) ] \approx \frac {1}{n} \sum_ {i = 1} ^ {n} \nabla_ {w} f (w _ {k}, x _ {i}).

Then, (6.11) becomes

wk+1=wkαkni=1nwf(wk,xi).(6.12)w _ {k + 1} = w _ {k} - \frac {\alpha_ {k}}{n} \sum_ {i = 1} ^ {n} \nabla_ {w} f \left(w _ {k}, x _ {i}\right). \tag {6.12}

One problem of the algorithm in (6.12) is that it requires all the samples in each iteration. In practice, if the samples are collected one by one, then it is favorable to update ww every time a sample is collected. To that end, we can use the following algorithm:

wk+1=wkαkwf(wk,xk),(6.13)w _ {k + 1} = w _ {k} - \alpha_ {k} \nabla_ {w} f \left(w _ {k}, x _ {k}\right), \tag {6.13}

where xkx_{k} is the sample collected at time step kk . This is the well-known stochastic gradient descent algorithm. This algorithm is called "stochastic" because it relies on stochastic samples {xk}\{x_{k}\} .

Compared to the gradient descent algorithm in (6.11), SGD replaces the true gradient E[wf(w,X)]\mathbb{E}[\nabla_w f(w, X)] with the stochastic gradient wf(wk,xk)\nabla_w f(w_k, x_k) . Since wf(wk,xk)E[wf(w,X)]\nabla_w f(w_k, x_k) \neq \mathbb{E}[\nabla_w f(w, X)] , can such a replacement still ensure wkww_k \to w^* as kk \to \infty ? The answer is yes. We next present an intuitive explanation and postpone the rigorous proof of the convergence to Section 6.4.5.

In particular, since

wf(wk,xk)=E[wf(wk,X)]+(wf(wk,xk)E[wf(wk,X)])=E[wf(wk,X)]+ηk,\begin{array}{l} \nabla_ {w} f (w _ {k}, x _ {k}) = \mathbb {E} [ \nabla_ {w} f (w _ {k}, X) ] + (\nabla_ {w} f (w _ {k}, x _ {k}) - \mathbb {E} [ \nabla_ {w} f (w _ {k}, X) ]) \\ \stackrel {\cdot} {=} \mathbb {E} [ \nabla_ {w} f (w _ {k}, X) ] + \eta_ {k}, \\ \end{array}

the SGD algorithm in (6.13) can be rewritten as

wk+1=wkαkE[wf(wk,X)]αkηk.w _ {k + 1} = w _ {k} - \alpha_ {k} \mathbb {E} [ \nabla_ {w} f (w _ {k}, X) ] - \alpha_ {k} \eta_ {k}.

Therefore, the SGD algorithm is the same as the regular gradient descent algorithm except that it has a perturbation term αkηk\alpha_{k}\eta_{k} . Since {xk}\{x_{k}\} is i.i.d., we have Exk[wf(wk,xk)]=EX[wf(wk,X)]\mathbb{E}_{x_k}[\nabla_w f(w_k,x_k)] = \mathbb{E}_X[\nabla_w f(w_k,X)] . As a result,

E[ηk]=E[wf(wk,xk)E[wf(wk,X)]]=Exk[wf(wk,xk)]EX[wf(wk,X)]=0.\mathbb {E} [ \eta_ {k} ] = \mathbb {E} \Big [ \nabla_ {w} f (w _ {k}, x _ {k}) - \mathbb {E} [ \nabla_ {w} f (w _ {k}, X) ] \Big ] = \mathbb {E} _ {x _ {k}} [ \nabla_ {w} f (w _ {k}, x _ {k}) ] - \mathbb {E} _ {X} [ \nabla_ {w} f (w _ {k}, X) ] = 0.

Therefore, the perturbation term ηk\eta_{k} has a zero mean, which intuitively suggests that it may not jeopardize the convergence property. A rigorous proof of the convergence of

SGD is given in Section 6.4.5.

6.4.1 Application to mean estimation

We next apply SGD to analyze the mean estimation problem and show that the mean estimation algorithm in (6.4) is a special SGD algorithm. To that end, we formulate the mean estimation problem as an optimization problem:

minwJ(w)=E[12wX2]E[f(w,X)],(6.14)\min _ {w} J (w) = \mathbb {E} \left[ \frac {1}{2} \| w - X \| ^ {2} \right] \doteq \mathbb {E} [ f (w, X) ], \tag {6.14}

where f(w,X)=wX2/2f(w,X) = \| w - X\| ^2 /2 and the gradient is wf(w,X)=wX\nabla_wf(w,X) = w - X . It can be verified that the optimal solution is w=E[X]w^{*} = \mathbb{E}[X] by solving wJ(w)=0\nabla_wJ(w) = 0 . Therefore, this optimization problem is equivalent to the mean estimation problem.

The gradient descent algorithm for solving (6.14) is

wk+1=wkαkwJ(wk)=wkαkE[wf(wk,X)]=wkαkE[wkX].\begin{array}{l} w _ {k + 1} = w _ {k} - \alpha_ {k} \nabla_ {w} J (w _ {k}) \\ = w _ {k} - \alpha_ {k} \mathbb {E} \left[ \nabla_ {w} f \left(w _ {k}, X\right) \right] \\ = w _ {k} - \alpha_ {k} \mathbb {E} [ w _ {k} - X ]. \\ \end{array}

This gradient descent algorithm is not applicable since E[wkX]\mathbb{E}[w_k - X] or E[X]\mathbb{E}[X] on the right-hand side is unknown (in fact, it is what we need to solve).

The SGD algorithm for solving (6.14) is

wk+1=wkαkwf(wk,xk)=wkαk(wkxk),w _ {k + 1} = w _ {k} - \alpha_ {k} \nabla_ {w} f (w _ {k}, x _ {k}) = w _ {k} - \alpha_ {k} (w _ {k} - x _ {k}),

where xkx_{k} is a sample obtained at time step kk . Notably, this SGD algorithm is the same as the iterative mean estimation algorithm in (6.4). Therefore, (6.4) is an SGD algorithm designed specifically for solving the mean estimation problem.

6.4.2 Convergence pattern of SGD

The idea of the SGD algorithm is to replace the true gradient with a stochastic gradient. However, since the stochastic gradient is random, one may ask whether the convergence speed of SGD is slow or random. Fortunately, SGD can converge efficiently in general. An interesting convergence pattern is that it behaves similarly to the regular gradient descent algorithm when the estimate wkw_{k} is far from the optimal solution ww^{*} . Only when wkw_{k} is close to ww^{*} , does the convergence of SGD exhibit more randomness.

An analysis of this pattern and an illustrative example are given below.

Analysis: The relative error between the stochastic and true gradients is

δkwf(wk,xk)E[wf(wk,X)]E[wf(wk,X)].\delta_ {k} \doteq \frac {| \nabla_ {w} f (w _ {k} , x _ {k}) - \mathbb {E} [ \nabla_ {w} f (w _ {k} , X) ] |}{| \mathbb {E} [ \nabla_ {w} f (w _ {k} , X) ] |}.

For the sake of simplicity, we consider the case where ww and wf(w,x)\nabla_w f(w,x) are both scalars. Since ww^* is the optimal solution, it holds that E[wf(w,X)]=0\mathbb{E}[\nabla_w f(w^*,X)] = 0 . Then, the relative error can be rewritten as

δk=wf(wk,xk)E[wf(wk,X)]E[wf(wk,X)]E[wf(w,X)]=wf(wk,xk)E[wf(wk,X)]E[w2f(w~k,X)(wkw)],(6.15)\delta_ {k} = \frac {| \nabla_ {w} f (w _ {k} , x _ {k}) - \mathbb {E} [ \nabla_ {w} f (w _ {k} , X) ] |}{| \mathbb {E} [ \nabla_ {w} f (w _ {k} , X) ] - \mathbb {E} [ \nabla_ {w} f (w ^ {*} , X) ] |} = \frac {| \nabla_ {w} f (w _ {k} , x _ {k}) - \mathbb {E} [ \nabla_ {w} f (w _ {k} , X) ] |}{| \mathbb {E} [ \nabla_ {w} ^ {2} f (\tilde {w} _ {k} , X) (w _ {k} - w ^ {*}) ] |}, (6. 1 5)

where the last equality is due to the mean value theorem [7, 8] and w~k[wk,w]\tilde{w}_k \in [w_k, w^*] . Suppose that ff is strictly convex such that w2fc>0\nabla_w^2 f \geq c > 0 for all w,Xw, X . Then, the denominator in (6.15) becomes

E[w2f(w~k,X)(wkw)]=E[w2f(w~k,X)](wkw)cwkw.\begin{array}{l} \left| \mathbb {E} [ \nabla_ {w} ^ {2} f (\tilde {w} _ {k}, X) (w _ {k} - w ^ {*}) ] \right| = \left| \mathbb {E} [ \nabla_ {w} ^ {2} f (\tilde {w} _ {k}, X) ] \right| | (w _ {k} - w ^ {*}) | \\ \geq c | w _ {k} - w ^ {*} |. \\ \end{array}

Substituting the above inequality into (6.15) yields

δkwf(wk,xk)s t o c h a s t i c g r a d i e n tE[wf(wk,X)]t r u e g r a d i e n tcwkw.\delta_ {k} \leq \frac {\left| \overbrace {\nabla_ {w} f (w _ {k} , x _ {k})} ^ {\text {s t o c h a s t i c g r a d i e n t}} - \overbrace {\mathbb {E} [ \nabla_ {w} f (w _ {k} , X) ]} ^ {\text {t r u e g r a d i e n t}} \right|}{\underbrace {c | w _ {k} - w ^ {*} |} ^ {\left. \right.}}.

distance to the optimal solution

The above inequality suggests an interesting convergence pattern of SGD: the relative error δk\delta_{k} is inversely proportional to wkw|w_{k} - w^{*}| . As a result, when wkw|w_{k} - w^{*}| is large, δk\delta_{k} is small. In this case, the SGD algorithm behaves like the gradient descent algorithm and hence wkw_{k} quickly converges to ww^{*} . When wkw_{k} is close to ww^{*} , the relative error δk\delta_{k} may be large, and the convergence exhibits more randomness.

Example: A good example for demonstrating the above analysis is the mean estimation problem. Consider the mean estimation problem in (6.14). When ww and XX are both scalar, we have f(w,X)=wX2/2f(w, X) = |w - X|^2 / 2 and hence

wf(w,xk)=wxk,\nabla_ {w} f (w, x _ {k}) = w - x _ {k},
E[wf(w,xk)]=wE[X]=ww.\mathbb {E} \left[ \nabla_ {w} f (w, x _ {k}) \right] = w - \mathbb {E} [ X ] = w - w ^ {*}.

Thus, the relative error is

δk=wf(wk,xk)E[wf(wk,X)]E[wf(wk,X)]=(wkxk)(wkE[X])wkw=E[X]xkwkw.\delta_ {k} = \frac {| \nabla_ {w} f (w _ {k} , x _ {k}) - \mathbb {E} [ \nabla_ {w} f (w _ {k} , X) ] |}{| \mathbb {E} [ \nabla_ {w} f (w _ {k} , X) ] |} = \frac {| (w _ {k} - x _ {k}) - (w _ {k} - \mathbb {E} [ X ]) |}{| w _ {k} - w ^ {*} |} = \frac {| \mathbb {E} [ X ] - x _ {k} |}{| w _ {k} - w ^ {*} |}.

The expression of the relative error clearly shows that δk\delta_{k} is inversely proportional to


Figure 6.5: An example for demonstrating stochastic and mini-batch gradient descent algorithms. The distribution of XR2X \in \mathbb{R}^2 is uniform in the square area centered at the origin with a side length as 20. The mean is E[X]=0\mathbb{E}[X] = 0 . The mean estimation is based on 100 i.i.d. samples.

wkw\left|w_{k} - w^{*}\right| . As a result, when wkw_{k} is far from ww^{*} , the relative error is small, and SGD behaves like gradient descent. In addition, since δk\delta_{k} is proportional to E[X]xk|\mathbb{E}[X] - x_k| , the mean of δk\delta_{k} is proportional to the variance of XX .

The simulation results are shown in Figure 6.5. Here, XR2X \in \mathbb{R}^2 represents a random position in the plane. Its distribution is uniform in the square area centered at the origin and E[X]=0\mathbb{E}[X] = 0 . The mean estimation is based on 100 i.i.d. samples. Although the initial guess of the mean is far away from the true value, it can be seen that the SGD estimate quickly approaches the neighborhood of the origin. When the estimate is close to the origin, the convergence process exhibits certain randomness.

6.4.3 A deterministic formulation of SGD

The formulation of SGD in (6.13) involves random variables. One may often encounter a deterministic formulation of SGD without involving any random variables.

In particular, consider a set of real numbers {xi}i=1n\{x_{i}\}_{i = 1}^{n} , where xix_{i} does not have to be a sample of any random variable. The optimization problem to be solved is to minimize the average:

minwJ(w)=1ni=1nf(w,xi),\min _ {w} J (w) = \frac {1}{n} \sum_ {i = 1} ^ {n} f (w, x _ {i}),

where f(w,xi)f(w, x_i) is a parameterized function, and ww is the parameter to be optimized. The gradient descent algorithm for solving this problem is

wk+1=wkαkwJ(wk)=wkαk1ni=1nwf(wk,xi).w _ {k + 1} = w _ {k} - \alpha_ {k} \nabla_ {w} J (w _ {k}) = w _ {k} - \alpha_ {k} \frac {1}{n} \sum_ {i = 1} ^ {n} \nabla_ {w} f (w _ {k}, x _ {i}).

Suppose that the set {xi}i=1n\{x_{i}\}_{i = 1}^{n} is large and we can only fetch a single number each time.

In this case, it is favorable to update wkw_{k} in an incremental manner:

wk+1=wkαkwf(wk,xk).(6.16)w _ {k + 1} = w _ {k} - \alpha_ {k} \nabla_ {w} f \left(w _ {k}, x _ {k}\right). \tag {6.16}

It must be noted that xkx_{k} here is the number fetched at time step kk instead of the kk th element in the set {xi}i=1n\{x_{i}\}_{i = 1}^{n} .

The algorithm in (6.16) is very similar to SGD, but its problem formulation is subtly different because it does not involve any random variables or expected values. Then, many questions arise. For example, is this algorithm SGD? How should we use the finite set of numbers {xi}i=1n\{x_{i}\}_{i = 1}^{n} ? Should we sort these numbers in a certain order and then use them one by one, or should we randomly sample a number from the set?

A quick answer to the above questions is that, although no random variables are involved in the above formulation, we can convert the deterministic formulation to the stochastic formulation by introducing a random variable. In particular, let XX be a random variable defined on the set {xi}i=1n\{x_{i}\}_{i = 1}^{n} . Suppose that its probability distribution is uniform such that p(X=xi)=1/np(X = x_i) = 1 / n . Then, the deterministic optimization problem becomes a stochastic one:

minwJ(w)=1ni=1nf(w,xi)=E[f(w,X)].\min _ {w} J (w) = \frac {1}{n} \sum_ {i = 1} ^ {n} f (w, x _ {i}) = \mathbb {E} [ f (w, X) ].

The last equality in the above equation is strict instead of approximate. Therefore, the algorithm in (6.16) is SGD, and the estimate converges if xkx_{k} is uniformly and independently sampled from {xi}i=1n\{x_{i}\}_{i = 1}^{n} . Note that xkx_{k} may repeatedly take the same number in {xi}i=1n\{x_{i}\}_{i = 1}^{n} since it is sampled randomly.

6.4.4 BGD, SGD, and mini-batch GD

While SGD uses a single sample in every iteration, we next introduce mini-batch gradient descent (MBGD), which uses a few more samples in every iteration. When all samples are used in every iteration, the algorithm is called batch gradient descent (BGD).

In particular, suppose that we would like to find the optimal solution that can minimize J(w)=E[f(w,X)]J(w) = \mathbb{E}[f(w,X)] given a set of random samples {xi}i=1n\{x_{i}\}_{i = 1}^{n} of XX . The BGD, SGD, and MBGD algorithms for solving this problem are, respectively,

wk+1=wkαk1ni=1nwf(wk,xi),(B G D)w _ {k + 1} = w _ {k} - \alpha_ {k} \frac {1}{n} \sum_ {i = 1} ^ {n} \nabla_ {w} f \left(w _ {k}, x _ {i}\right), \quad (\text {B G D})
wk+1=wkαk1mjIkwf(wk,xj),(M B G D)w _ {k + 1} = w _ {k} - \alpha_ {k} \frac {1}{m} \sum_ {j \in \mathcal {I} _ {k}} \nabla_ {w} f (w _ {k}, x _ {j}), \quad (\text {M B G D})
wk+1=wkαkwf(wk,xk).(S G D)w _ {k + 1} = w _ {k} - \alpha_ {k} \nabla_ {w} f (w _ {k}, x _ {k}). \quad (\text {S G D})

In the BGD algorithm, all the samples are used in every iteration. When nn is large, (1/n)i=1nwf(wk,xi)(1/n)\sum_{i=1}^{n}\nabla_{w}f(w_{k},x_{i}) is close to the true gradient E[wf(wk,X)]\mathbb{E}[\nabla_{w}f(w_{k},X)] . In the MBGD al-

gorithm, Ik\mathcal{I}_k is a subset of {1,,n}\{1,\ldots ,n\} obtained at time kk . The size of the set is Ik=m|\mathcal{I}_k| = m . The samples in Ik\mathcal{I}_k are also assumed to be i.i.d. In the SGD algorithm, xkx_{k} is randomly sampled from {xi}i=1n\{x_{i}\}_{i = 1}^{n} at time kk .

MBGD can be viewed as an intermediate version between SGD and BGD. Compared to SGD, MBGD has less randomness because it uses more samples instead of just one as in SGD. Compared to BGD, MBGD does not require using all the samples in every iteration, making it more flexible. If m=1m = 1 , then MBGD becomes SGD. However, if m=nm = n , MBGD may not become SGD. This is because MBGD uses nn randomly fetched samples, whereas BGD uses all nn numbers. These nn randomly fetched samples may contain the same number multiple times and hence may not cover all nn numbers in {xi}i=1n\{x_i\}_{i=1}^n .

The convergence speed of MBGD is faster than that of SGD in general. This is because SGD uses wf(wk,xk)\nabla_{w}f(w_{k},x_{k}) to approximate the true gradient, whereas MBGD uses (1/m)jIkwf(wk,xj)(1 / m)\sum_{j\in \mathcal{I}_k}\nabla_{w}f(w_k,x_j) , which is closer to the true gradient because the randomness is averaged out. The convergence of the MBGD algorithm can be proven similarly to the SGD case.

A good example for demonstrating the above analysis is the mean estimation problem. In particular, given some numbers {xi}i=1n\{x_i\}_{i=1}^n , our goal is to calculate the mean xˉ=i=1nxi/n\bar{x} = \sum_{i=1}^n x_i / n . This problem can be equivalently stated as the following optimization problem:

minwJ(w)=12ni=1nwxi2,\underset {w} {\min} J (w) = \frac {1}{2 n} \sum_ {i = 1} ^ {n} \| w - x _ {i} \| ^ {2},

whose optimal solution is w=xˉw^{*} = \bar{x} . The three algorithms for solving this problem are, respectively,

wk+1=wkαk1ni=1n(wkxi)=wkαk(wkxˉ),(BGD)w _ {k + 1} = w _ {k} - \alpha_ {k} \frac {1}{n} \sum_ {i = 1} ^ {n} (w _ {k} - x _ {i}) = w _ {k} - \alpha_ {k} (w _ {k} - \bar {x}), \qquad (\mathrm {B G D})
wk+1=wkαk1mjIk(wkxj)=wkαk(wkxˉk(m)),(MBGD)w _ {k + 1} = w _ {k} - \alpha_ {k} \frac {1}{m} \sum_ {j \in \mathcal {I} _ {k}} (w _ {k} - x _ {j}) = w _ {k} - \alpha_ {k} \left(w _ {k} - \bar {x} _ {k} ^ {(m)}\right), \qquad (\mathrm {M B G D})
wk+1=wkαk(wkxk),(SGD)w _ {k + 1} = w _ {k} - \alpha_ {k} (w _ {k} - x _ {k}), \qquad (\mathrm {S G D})

where xˉk(m)=jIkxj/m\bar{x}_k^{(m)} = \sum_{j\in \mathcal{I}_k}x_j / m . Furthermore, if αk=1/k\alpha_{k} = 1 / k , the above equations can be solved

as follows:

wk+1=1kj=1kxˉ=xˉ,(BGD)w _ {k + 1} = \frac {1}{k} \sum_ {j = 1} ^ {k} \bar {x} = \bar {x}, \qquad (\mathrm {B G D})
wk+1=1kj=1kxˉj(m),(M B G D)w _ {k + 1} = \frac {1}{k} \sum_ {j = 1} ^ {k} \bar {x} _ {j} ^ {(m)}, \quad (\text {M B G D})
wk+1=1kj=1kxj.(S G D)w _ {k + 1} = \frac {1}{k} \sum_ {j = 1} ^ {k} x _ {j}. \quad (\text {S G D})

The derivation of the above equations is similar to that of (6.3) and is omitted here. It can be seen that the estimate given by BGD at each step is exactly the optimal solution w=xˉw^{*} = \bar{x} . MBGD converges to the mean faster than SGD because xˉk(m)\bar{x}_k^{(m)} is already an average.

A simulation example is given in Figure 6.5 to demonstrate the convergence of MBGD. Let αk=1/k\alpha_{k} = 1 / k . It is shown that all MBGD algorithms with different mini-batch sizes can converge to the mean. The case with m=50m = 50 converges the fastest, while SGD with m=1m = 1 is the slowest. This is consistent with the above analysis. Nevertheless, the convergence rate of SGD is still fast, especially when wkw_{k} is far from ww^{*} .

6.4.5 Convergence of SGD

The rigorous proof of the convergence of SGD is given as follows.

Theorem 6.4 (Convergence of SGD). For the SGD algorithm in (6.13), if the following conditions are satisfied, then wkw_{k} converges to the root of wE[f(w,X)]=0\nabla_w\mathbb{E}[f(w,X)] = 0 almost surely.

(a) 0<c1w2f(w,X)c20 < c_{1} \leq \nabla_{w}^{2}f(w,X) \leq c_{2} ;
(b) k=1ak=\sum_{k=1}^{\infty} a_k = \infty and k=1ak2<\sum_{k=1}^{\infty} a_k^2 < \infty ;
(c) {xk}k=1\{x_{k}\}_{k = 1}^{\infty} are i.i.d.

The three conditions in Theorem 6.4 are discussed below.

\diamond Condition (a) is about the convexity of ff . It requires the curvature of ff to be bounded from above and below. Here, ww is a scalar, and so is w2f(w,X)\nabla_w^2 f(w,X) . This condition can be generalized to the vector case. When ww is a vector, w2f(w,X)\nabla_w^2 f(w,X) is the well-known Hessian matrix.
\diamond Condition (b) is similar to that of the RM algorithm. In fact, the SGD algorithm is a special RM algorithm (as shown in the proof in Box 6.1). In practice, aka_{k} is often selected as a sufficiently small constant. Although condition (b) is not satisfied in this case, the algorithm can still converge in a certain sense [24, Section 1.5].
Condition (c) is a common requirement.

Box 6.1: Proof of Theorem 6.4

We next show that the SGD algorithm is a special RM algorithm. Then, the convergence of SGD naturally follows from the RM theorem.

The problem to be solved by SGD is to minimize J(w)=E[f(w,X)]J(w) = \mathbb{E}[f(w,X)] . This problem can be converted to a root-finding problem. That is, finding the root of wJ(w)=E[wf(w,X)]=0\nabla_wJ(w) = \mathbb{E}[\nabla_wf(w,X)] = 0 . Let

g(w)=wJ(w)=E[wf(w,X)].g (w) = \nabla_ {w} J (w) = \mathbb {E} [ \nabla_ {w} f (w, X) ].

Then, SGD aims to find the root of g(w)=0g(w) = 0 . This is exactly the problem solved by the RM algorithm. The quantity that we can measure is g~=wf(w,x)\tilde{g} = \nabla_w f(w,x) , where xx is a sample of XX . Note that g~\tilde{g} can be rewritten as

g~(w,η)=wf(w,x)=E[wf(w,X)]+wf(w,x)E[wf(w,X)]η.\begin{array}{l} \tilde {g} (w, \eta) = \nabla_ {w} f (w, x) \\ = \mathbb {E} [ \nabla_ {w} f (w, X) ] + \underbrace {\nabla_ {w} f (w , x) - \mathbb {E} [ \nabla_ {w} f (w , X) ]} _ {\eta}. \\ \end{array}

Then, the RM algorithm for solving g(w)=0g(w) = 0 is

wk+1=wkakg~(wk,ηk)=wkakwf(wk,xk),w _ {k + 1} = w _ {k} - a _ {k} \tilde {g} (w _ {k}, \eta_ {k}) = w _ {k} - a _ {k} \nabla_ {w} f (w _ {k}, x _ {k}),

which is the same as the SGD algorithm in (6.13). As a result, the SGD algorithm is a special RM algorithm. We next show that the three conditions in Theorem 6.1 are satisfied. Then, the convergence of SGD naturally follows from Theorem 6.1.

Since wg(w)=wE[wf(w,X)]=E[w2f(w,X)]\nabla_w g(w) = \nabla_w \mathbb{E}[\nabla_w f(w, X)] = \mathbb{E}[\nabla_w^2 f(w, X)] , it follows from c1w2f(w,X)c2c_1 \leq \nabla_w^2 f(w, X) \leq c_2 that c1wg(w)c2c_1 \leq \nabla_w g(w) \leq c_2 . Thus, the first condition in Theorem 6.1 is satisfied.
The second condition in Theorem 6.1 is the same as the second condition in this theorem.
\diamond The third condition in Theorem 6.1 requires E[ηkHk]=0\mathbb{E}[\eta_k|\mathcal{H}_k] = 0 and E[ηk2Hk]<\mathbb{E}[\eta_k^2 |\mathcal{H}_k] < \infty . Since {xk}\{x_{k}\} is i.i.d., we have Exk[wf(w,xk)]=E[wf(w,X)]\mathbb{E}_{x_k}[\nabla_w f(w,x_k)] = \mathbb{E}[\nabla_w f(w,X)] for all kk . Therefore,

E[ηkHk]=E[wf(wk,xk)E[wf(wk,X)]Hk].\mathbb {E} [ \eta_ {k} | \mathcal {H} _ {k} ] = \mathbb {E} [ \nabla_ {w} f (w _ {k}, x _ {k}) - \mathbb {E} [ \nabla_ {w} f (w _ {k}, X) ] | \mathcal {H} _ {k} ].

Since Hk={wk,wk1,}\mathcal{H}_k = \{w_k, w_{k-1}, \ldots\} and xkx_k is independent of Hk\mathcal{H}_k , the first term on the right-hand side becomes E[wf(wk,xk)Hk]=Exk[wf(wk,xk)]\mathbb{E}[\nabla_w f(w_k, x_k) | \mathcal{H}_k] = \mathbb{E}_{x_k}[\nabla_w f(w_k, x_k)] . The second term becomes E[E[wf(wk,X)]Hk]=E[wf(wk,X)]\mathbb{E}[\mathbb{E}[\nabla_w f(w_k, X)] | \mathcal{H}_k] = \mathbb{E}[\nabla_w f(w_k, X)] because E[wf(wk,X)]\mathbb{E}[\nabla_w f(w_k, X)] is

a function of wkw_{k} . Therefore,

E[ηkHk]=Exk[wf(wk,xk)]E[wf(wk,X)]=0.\mathbb {E} [ \eta_ {k} | \mathcal {H} _ {k} ] = \mathbb {E} _ {x _ {k}} [ \nabla_ {w} f (w _ {k}, x _ {k}) ] - \mathbb {E} [ \nabla_ {w} f (w _ {k}, X) ] = 0.

Similarly, it can be proven that E[ηk2Hk]<\mathbb{E}[\eta_k^2 |\mathcal{H}_k] < \infty if wf(w,x)<|\nabla_w f(w,x)| < \infty for all ww given any xx .

Since the three conditions in Theorem 6.1 are satisfied, the convergence of the SGD algorithm follows.