6.3_Dvoretzkys_convergence_theorem

6.3 Dvoretzky's convergence theorem

Until now, the convergence of the RM algorithm has not yet been proven. To do that, we next introduce Dvoretzky's theorem [31, 32], which is a classic result in the field of stochastic approximation. This theorem can be used to analyze the convergence of the RM algorithm and many reinforcement learning algorithms.

This section is slightly mathematically intensive. Readers who are interested in the convergence analyses of stochastic algorithms are recommended to study this section. Otherwise, this section can be skipped.

Theorem 6.2 (Dvoretzky's theorem). Consider a stochastic process

Δk+1=(1αk)Δk+βkηk,\Delta_ {k + 1} = (1 - \alpha_ {k}) \Delta_ {k} + \beta_ {k} \eta_ {k},

where {αk}k=1,{βk}k=1,{ηk}k=1\{\alpha_k\}_{k=1}^{\infty}, \{\beta_k\}_{k=1}^{\infty}, \{\eta_k\}_{k=1}^{\infty} are stochastic sequences. Here αk0,βk0\alpha_k \geq 0, \beta_k \geq 0 for all kk . Then, Δk\Delta_k converges to zero almost surely if the following conditions are satisfied:

(a) k=1αk=\sum_{k=1}^{\infty} \alpha_k = \infty , k=1αk2<\sum_{k=1}^{\infty} \alpha_k^2 < \infty , and k=1βk2<\sum_{k=1}^{\infty} \beta_k^2 < \infty uniformly almost surely;
(b) E[ηkHk]=0\mathbb{E}[\eta_k|\mathcal{H}_k] = 0 and E[ηk2Hk]C\mathbb{E}[\eta_k^2 |\mathcal{H}_k]\leq C almost surely;

where Hk={Δk,Δk1,,ηk1,,αk1,,βk1,}\mathcal{H}_k = \{\Delta_k, \Delta_{k-1}, \ldots, \eta_{k-1}, \ldots, \alpha_{k-1}, \ldots, \beta_{k-1}, \ldots\} .

Before presenting the proof of this theorem, we first clarify some issues.

In the RM algorithm, the coefficient sequence {αk}\{\alpha_{k}\} is deterministic. However, Dvoretzky's theorem allows {αk},{βk}\{\alpha_{k}\},\{\beta_{k}\} to be random variables that depend on Hk\mathcal{H}_k . Thus, it is more useful in cases where αk\alpha_{k} or βk\beta_{k} is a function of Δk\Delta_k .
In the first condition, it is stated as "uniformly almost surely". This is because αk\alpha_{k} and βk\beta_{k} may be random variables and hence the definition of their limits must be in the stochastic sense. In the second condition, it is also stated as "almost surely". This is because Hk\mathcal{H}_k is a sequence of random variables rather than specific values. As a result, E[ηkHk]\mathbb{E}[\eta_k|\mathcal{H}_k] and E[ηk2Hk]\mathbb{E}[\eta_k^2 |\mathcal{H}_k] are random variables. The definition of the conditional expectation in this case is in the "almost sure" sense (Appendix B).
The statement of Theorem 6.2 is slightly different from [32] in the sense that Theorem 6.2 does not require k=1βk=\sum_{k=1}^{\infty} \beta_k = \infty in the first condition. When k=1βk<\sum_{k=1}^{\infty} \beta_k < \infty , especially in the extreme case where βk=0\beta_k = 0 for all kk , the sequence can still converge.

6.3.1 Proof of Dvoretzky's theorem

The original proof of Dvoretzky's theorem was given in 1956 [31]. There are also other proofs. We next present a proof based on quasimartingales. With the convergence results of quasimartingales, the proof of Dvoretzky's theorem is straightforward. More information about quasimartingales can be found in Appendix C.

Proof of Dvoretzky's theorem. Let hkΔk2h_k \doteq \Delta_k^2 . Then,

hk+1hk=Δk+12Δk2=(Δk+1Δk)(Δk+1+Δk)=(αkΔk+βkηk)[(2αk)Δk+βkηk]=αk(2αk)Δk2+βk2ηk2+2(1αk)βkηkΔk.\begin{array}{l} h _ {k + 1} - h _ {k} = \Delta_ {k + 1} ^ {2} - \Delta_ {k} ^ {2} \\ = (\Delta_ {k + 1} - \Delta_ {k}) (\Delta_ {k + 1} + \Delta_ {k}) \\ = (- \alpha_ {k} \Delta_ {k} + \beta_ {k} \eta_ {k}) [ (2 - \alpha_ {k}) \Delta_ {k} + \beta_ {k} \eta_ {k} ] \\ = - \alpha_ {k} (2 - \alpha_ {k}) \Delta_ {k} ^ {2} + \beta_ {k} ^ {2} \eta_ {k} ^ {2} + 2 (1 - \alpha_ {k}) \beta_ {k} \eta_ {k} \Delta_ {k}. \\ \end{array}

Taking expectations on both sides of the above equation yields

E[hk+1hkHk]=E[αk(2αk)Δk2Hk]+E[βk2ηk2Hk]+E[2(1αk)βkηkΔkHk].(6.7)\mathbb {E} \left[ h _ {k + 1} - h _ {k} \mid \mathcal {H} _ {k} \right] = \mathbb {E} \left[ - \alpha_ {k} \left(2 - \alpha_ {k}\right) \Delta_ {k} ^ {2} \mid \mathcal {H} _ {k} \right] + \mathbb {E} \left[ \beta_ {k} ^ {2} \eta_ {k} ^ {2} \mid \mathcal {H} _ {k} \right] + \mathbb {E} \left[ 2 \left(1 - \alpha_ {k}\right) \beta_ {k} \eta_ {k} \Delta_ {k} \mid \mathcal {H} _ {k} \right]. \tag {6.7}

First, since Δk\Delta_k is included and hence determined by Hk\mathcal{H}_k , it can be taken out from the expectation (see property (e) in Lemma B.1). Second, consider the simple case

where αk,βk\alpha_{k},\beta_{k} is determined by Hk\mathcal{H}_k . This case is valid when, for example, {αk}\{\alpha_k\} and {βk}\{\beta_k\} are functions of Δk\Delta_{k} or deterministic sequences. Then, they can also be taken out of the expectation. Therefore, (6.7) becomes

E[hk+1hkHk]=αk(2αk)Δk2+βk2E[ηk2Hk]+2(1αk)βkΔkE[ηkHk].(6.8)\mathbb {E} [ h _ {k + 1} - h _ {k} | \mathcal {H} _ {k} ] = - \alpha_ {k} (2 - \alpha_ {k}) \Delta_ {k} ^ {2} + \beta_ {k} ^ {2} \mathbb {E} [ \eta_ {k} ^ {2} | \mathcal {H} _ {k} ] + 2 (1 - \alpha_ {k}) \beta_ {k} \Delta_ {k} \mathbb {E} [ \eta_ {k} | \mathcal {H} _ {k} ]. \tag {6.8}

For the first term, since k=1αk2<\sum_{k=1}^{\infty} \alpha_k^2 < \infty implies αk0\alpha_k \to 0 almost surely, there exists a finite nn such that αk1\alpha_k \leq 1 almost surely for all knk \geq n . Without loss of generality, we next merely consider the case of αk1\alpha_k \leq 1 . Then, αk(2αk)Δk20-\alpha_k (2 - \alpha_k) \Delta_k^2 \leq 0 . For the second term, we have βk2E[ηk2Hk]βk2C\beta_k^2 \mathbb{E}[\eta_k^2 | \mathcal{H}_k] \leq \beta_k^2 C as assumed. The third term equals zero because E[ηkHk]=0\mathbb{E}[\eta_k | \mathcal{H}_k] = 0 as assumed. Therefore, (6.8) becomes

E[hk+1hkHk]=αk(2αk)Δk2+βk2E[ηk2Hk]βk2C,(6.9)\mathbb {E} \left[ h _ {k + 1} - h _ {k} \mid \mathcal {H} _ {k} \right] = - \alpha_ {k} \left(2 - \alpha_ {k}\right) \Delta_ {k} ^ {2} + \beta_ {k} ^ {2} \mathbb {E} \left[ \eta_ {k} ^ {2} \mid \mathcal {H} _ {k} \right] \leq \beta_ {k} ^ {2} C, \tag {6.9}

and hence

k=1E[hk+1hkHk]k=1βk2C<.\sum_ {k = 1} ^ {\infty} \mathbb {E} [ h _ {k + 1} - h _ {k} | \mathcal {H} _ {k} ] \leq \sum_ {k = 1} ^ {\infty} \beta_ {k} ^ {2} C < \infty .

The last inequality is due to the condition k=1βk2<\sum_{k=1}^{\infty} \beta_k^2 < \infty . Then, based on the quasimartingale convergence theorem in Appendix C, we conclude that hkh_k converges almost surely.

We next determine what value Δk\Delta_{k} converges to. It follows from (6.9) that

k=1αk(2αk)Δk2=k=1βk2E[ηk2Hk]k=1E[hk+1hkHk].\sum_ {k = 1} ^ {\infty} \alpha_ {k} (2 - \alpha_ {k}) \Delta_ {k} ^ {2} = \sum_ {k = 1} ^ {\infty} \beta_ {k} ^ {2} \mathbb {E} [ \eta_ {k} ^ {2} | \mathcal {H} _ {k} ] - \sum_ {k = 1} ^ {\infty} \mathbb {E} [ h _ {k + 1} - h _ {k} | \mathcal {H} _ {k} ].

The first term on the right-hand side is bounded as assumed. The second term is also bounded because hkh_k converges and hence hk+1hkh_{k+1} - h_k is summable. Thus, k=1αk(2αk)Δk2\sum_{k=1}^{\infty} \alpha_k (2 - \alpha_k) \Delta_k^2 on the left-hand side is also bounded. Since we consider the case of αk1\alpha_k \leq 1 , we have

>k=1αk(2αk)Δk2k=1αkΔk20.\infty > \sum_ {k = 1} ^ {\infty} \alpha_ {k} (2 - \alpha_ {k}) \Delta_ {k} ^ {2} \geq \sum_ {k = 1} ^ {\infty} \alpha_ {k} \Delta_ {k} ^ {2} \geq 0.

Therefore, k=1αkΔk2\sum_{k=1}^{\infty} \alpha_k \Delta_k^2 is bounded. Since k=1αk=\sum_{k=1}^{\infty} \alpha_k = \infty , we must have Δk0\Delta_k \to 0 almost surely.

6.3.2 Application to mean estimation

While the mean estimation algorithm, wk+1=wk+αk(xkwk)w_{k + 1} = w_k + \alpha_k(x_k - w_k) , has been analyzed using the RM theorem, we next show that its convergence can also be directly proven by Dvoretzky's theorem.

Proof. Let w=E[X]w^{*} = \mathbb{E}[X] . The mean estimation algorithm wk+1=wk+αk(xkwk)w_{k + 1} = w_{k} + \alpha_{k}(x_{k} - w_{k}) can be rewritten as

wk+1w=wkw+αk(xkw+wwk).w _ {k + 1} - w ^ {*} = w _ {k} - w ^ {*} + \alpha_ {k} \left(x _ {k} - w ^ {*} + w ^ {*} - w _ {k}\right).

Let Δ=˙ww\Delta \dot{=} w - w^{*} . Then, we have

Δk+1=Δk+αk(xkwΔk)=(1αk)Δk+αk(xkw)ηk.\begin{array}{l} \Delta_ {k + 1} = \Delta_ {k} + \alpha_ {k} \left(x _ {k} - w ^ {*} - \Delta_ {k}\right) \\ = (1 - \alpha_ {k}) \Delta_ {k} + \alpha_ {k} \underbrace {\left(x _ {k} - w ^ {*}\right)} _ {\eta_ {k}}. \\ \end{array}

Since {xk}\{x_{k}\} is i.i.d., we have E[xkHk]=E[xk]=w\mathbb{E}[x_k|\mathcal{H}_k] = \mathbb{E}[x_k] = w^* . As a result, E[ηkHk]=E[xkwHk]=0\mathbb{E}[\eta_k|\mathcal{H}_k] = \mathbb{E}[x_k - w^* |\mathcal{H}_k] = 0 and E[ηk2Hk]=E[xk2Hk](w)2=E[xk2](w)2\mathbb{E}[\eta_k^2 |\mathcal{H}_k] = \mathbb{E}[x_k^2 |\mathcal{H}_k] - (w^*)^2 = \mathbb{E}[x_k^2 ] - (w^*)^2 are bounded if the variance of xkx_{k} is finite. Following Dvoretzky's theorem, we conclude that Δk\Delta_{k} converges to zero and hence wkw_{k} converges to w=E[X]w^{*} = \mathbb{E}[X] almost surely.

6.3.3 Application to the Robbins-Monro theorem

We are now ready to prove the Robbins-Monro theorem using Dvoretzky's theorem.

Proof of the Robbins-Monro theorem. The RM algorithm aims to find the root of g(w)=0g(w) = 0 . Suppose that the root is ww^* such that g(w)=0g(w^*) = 0 . The RM algorithm is

wk+1=wkakg~(wk,ηk)=wkak[g(wk)+ηk].\begin{array}{l} w _ {k + 1} = w _ {k} - a _ {k} \tilde {g} \left(w _ {k}, \eta_ {k}\right) \\ = w _ {k} - a _ {k} \left[ g \left(w _ {k}\right) + \eta_ {k} \right]. \\ \end{array}

Then, we have

wk+1w=wkwak[g(wk)g(w)+ηk].w _ {k + 1} - w ^ {*} = w _ {k} - w ^ {*} - a _ {k} \left[ g \left(w _ {k}\right) - g \left(w ^ {*}\right) + \eta_ {k} \right].

Due to the mean value theorem [7,8], we have g(wk)g(w)=wg(wk)(wkw)g(w_{k}) - g(w^{*}) = \nabla_{w}g(w_{k}^{\prime})(w_{k} - w^{*})

where wk[wk,w]w_{k}^{\prime}\in [w_{k},w^{*}] . Let Δk=˙wkw\Delta_k\dot{=} w_k - w^* . The above equation becomes

Δk+1=Δkak[wg(wk)(wkw)+ηk]=Δkakwg(wk)Δk+ak(ηk)=[1akwg(wk)αk]Δk+ak(ηk).\begin{array}{l} \Delta_ {k + 1} = \Delta_ {k} - a _ {k} \left[ \nabla_ {w} g \left(w _ {k} ^ {\prime}\right) \left(w _ {k} - w ^ {*}\right) + \eta_ {k} \right] \\ = \Delta_ {k} - a _ {k} \nabla_ {w} g (w _ {k} ^ {\prime}) \Delta_ {k} + a _ {k} (- \eta_ {k}) \\ = [ 1 - \underbrace {a _ {k} \nabla_ {w} g (w _ {k} ^ {\prime})} _ {\alpha_ {k}} ] \Delta_ {k} + a _ {k} (- \eta_ {k}). \\ \end{array}

Note that wg(w)\nabla_w g(w) is bounded as 0<c1wg(w)c20 < c_1 \leq \nabla_w g(w) \leq c_2 as assumed. Since k=1ak=\sum_{k=1}^{\infty} a_k = \infty and k=1ak2<\sum_{k=1}^{\infty} a_k^2 < \infty as assumed, we know k=1αk=\sum_{k=1}^{\infty} \alpha_k = \infty and k=1αk2<\sum_{k=1}^{\infty} \alpha_k^2 < \infty . Thus, all the conditions in Dvoretzky's theorem are satisfied and hence Δk\Delta_k converges to zero almost surely.

The proof of the RM theorem demonstrates the power of Dvoretzky's theorem. In particular, αk\alpha_{k} in the proof is a stochastic sequence depending on wkw_{k} rather than a deterministic sequence. In this case, Dvoretzky's theorem is still applicable.

6.3.4 An extension of Dvoretzky's theorem

We next extend Dvoretzky's theorem to a more general theorem that can handle multiple variables. This general theorem, proposed by [32], can be used to analyze the convergence of stochastic iterative algorithms such as Q-learning.

Theorem 6.3. Consider a finite set SS of real numbers. For the stochastic process

Δk+1(s)=(1αk(s))Δk(s)+βk(s)ηk(s),\Delta_ {k + 1} (s) = (1 - \alpha_ {k} (s)) \Delta_ {k} (s) + \beta_ {k} (s) \eta_ {k} (s),

it holds that Δk(s)\Delta_k(s) converges to zero almost surely for every sSs \in S if the following conditions are satisfied for sSs \in S :

(a) kαk(s)=,kαk2(s)<,kβk2(s)<,andE[βk(s)Hk]\begin{array}{rl} & {\sum_{k}\alpha_{k}(s) = \infty ,\sum_{k}\alpha_{k}^{2}(s) < \infty ,\sum_{k}\beta_{k}^{2}(s) < \infty ,and\mathbb{E}[\beta_{k}(s)|\mathcal{H}_{k}]\leq} \end{array} E[αk(s)Hk]\mathbb{E}[\alpha_k(s)|\mathcal{H}_k] uniformly almost surely;
(b) E[ηk(s)Hk]γΔk,\| \mathbb{E}[\eta_k(s)|\mathcal{H}_k]\|_{\infty}\leq \gamma \| \Delta_k\|_{\infty}, where γ(0,1)\gamma \in (0,1)
(c) var[ηk(s)Hk]C(1+Δk(s))2\operatorname{var}[\eta_k(s)|\mathcal{H}_k] \leq C(1 + \| \Delta_k(s)\|_\infty)^2 , where CC is a constant.

Here, Hk={Δk,Δk1,,ηk1,,αk1,,βk1,}\mathcal{H}_k = \{\Delta_k,\Delta_{k - 1},\ldots ,\eta_{k - 1},\ldots ,\alpha_{k - 1},\ldots ,\beta_{k - 1},\ldots \} represents the historical information. The term \| \cdot \|_{\infty} refers to the maximum norm.

Proof. As an extension, this theorem can be proven based on Dvoretzky's theorem. Details can be found in [32] and are omitted here. \square

Some remarks about this theorem are given below.