6.3 Dvoretzky's convergence theorem

Until now, the convergence of the RM algorithm has not yet been proven. To do that, we next introduce Dvoretzky's theorem [31, 32], which is a classic result in the field of stochastic approximation. This theorem can be used to analyze the convergence of the RM algorithm and many reinforcement learning algorithms.

This section is slightly mathematically intensive. Readers who are interested in the convergence analyses of stochastic algorithms are recommended to study this section. Otherwise, this section can be skipped.

Theorem 6.2 (Dvoretzky's theorem). Consider a stochastic process

\Delta_ {k + 1} = (1 - \alpha_ {k}) \Delta_ {k} + \beta_ {k} \eta_ {k},

where $\{\alpha_k\}_{k=1}^{\infty}, \{\beta_k\}_{k=1}^{\infty}, \{\eta_k\}_{k=1}^{\infty}$ are stochastic sequences. Here $\alpha_k \geq 0, \beta_k \geq 0$ for all $k$ . Then, $\Delta_k$ converges to zero almost surely if the following conditions are satisfied:

(a) $\sum_{k=1}^{\infty} \alpha_k = \infty$ , $\sum_{k=1}^{\infty} \alpha_k^2 < \infty$ , and $\sum_{k=1}^{\infty} \beta_k^2 < \infty$ uniformly almost surely;
(b) $\mathbb{E}[\eta_k|\mathcal{H}_k] = 0$ and $\mathbb{E}[\eta_k^2 |\mathcal{H}_k]\leq C$ almost surely;

where $\mathcal{H}_k = \{\Delta_k, \Delta_{k-1}, \ldots, \eta_{k-1}, \ldots, \alpha_{k-1}, \ldots, \beta_{k-1}, \ldots\}$ .

Before presenting the proof of this theorem, we first clarify some issues.

In the RM algorithm, the coefficient sequence $\{\alpha_{k}\}$ is deterministic. However, Dvoretzky's theorem allows $\{\alpha_{k}\},\{\beta_{k}\}$ to be random variables that depend on $\mathcal{H}_k$ . Thus, it is more useful in cases where $\alpha_{k}$ or $\beta_{k}$ is a function of $\Delta_k$ .
In the first condition, it is stated as "uniformly almost surely". This is because $\alpha_{k}$ and $\beta_{k}$ may be random variables and hence the definition of their limits must be in the stochastic sense. In the second condition, it is also stated as "almost surely". This is because $\mathcal{H}_k$ is a sequence of random variables rather than specific values. As a result, $\mathbb{E}[\eta_k|\mathcal{H}_k]$ and $\mathbb{E}[\eta_k^2 |\mathcal{H}_k]$ are random variables. The definition of the conditional expectation in this case is in the "almost sure" sense (Appendix B).
The statement of Theorem 6.2 is slightly different from [32] in the sense that Theorem 6.2 does not require $\sum_{k=1}^{\infty} \beta_k = \infty$ in the first condition. When $\sum_{k=1}^{\infty} \beta_k < \infty$ , especially in the extreme case where $\beta_k = 0$ for all $k$ , the sequence can still converge.

6.3.1 Proof of Dvoretzky's theorem

The original proof of Dvoretzky's theorem was given in 1956 [31]. There are also other proofs. We next present a proof based on quasimartingales. With the convergence results of quasimartingales, the proof of Dvoretzky's theorem is straightforward. More information about quasimartingales can be found in Appendix C.

Proof of Dvoretzky's theorem. Let $h_k \doteq \Delta_k^2$ . Then,

\begin{array}{l} h _ {k + 1} - h _ {k} = \Delta_ {k + 1} ^ {2} - \Delta_ {k} ^ {2} \\ = (\Delta_ {k + 1} - \Delta_ {k}) (\Delta_ {k + 1} + \Delta_ {k}) \\ = (- \alpha_ {k} \Delta_ {k} + \beta_ {k} \eta_ {k}) [ (2 - \alpha_ {k}) \Delta_ {k} + \beta_ {k} \eta_ {k} ] \\ = - \alpha_ {k} (2 - \alpha_ {k}) \Delta_ {k} ^ {2} + \beta_ {k} ^ {2} \eta_ {k} ^ {2} + 2 (1 - \alpha_ {k}) \beta_ {k} \eta_ {k} \Delta_ {k}. \\ \end{array}

Taking expectations on both sides of the above equation yields

\mathbb {E} \left[ h _ {k + 1} - h _ {k} \mid \mathcal {H} _ {k} \right] = \mathbb {E} \left[ - \alpha_ {k} \left(2 - \alpha_ {k}\right) \Delta_ {k} ^ {2} \mid \mathcal {H} _ {k} \right] + \mathbb {E} \left[ \beta_ {k} ^ {2} \eta_ {k} ^ {2} \mid \mathcal {H} _ {k} \right] + \mathbb {E} \left[ 2 \left(1 - \alpha_ {k}\right) \beta_ {k} \eta_ {k} \Delta_ {k} \mid \mathcal {H} _ {k} \right]. \tag {6.7}

First, since $\Delta_k$ is included and hence determined by $\mathcal{H}_k$ , it can be taken out from the expectation (see property (e) in Lemma B.1). Second, consider the simple case

where $\alpha_{k},\beta_{k}$ is determined by $\mathcal{H}_k$ . This case is valid when, for example, $\{\alpha_k\}$ and $\{\beta_k\}$ are functions of $\Delta_{k}$ or deterministic sequences. Then, they can also be taken out of the expectation. Therefore, (6.7) becomes

\mathbb {E} [ h _ {k + 1} - h _ {k} | \mathcal {H} _ {k} ] = - \alpha_ {k} (2 - \alpha_ {k}) \Delta_ {k} ^ {2} + \beta_ {k} ^ {2} \mathbb {E} [ \eta_ {k} ^ {2} | \mathcal {H} _ {k} ] + 2 (1 - \alpha_ {k}) \beta_ {k} \Delta_ {k} \mathbb {E} [ \eta_ {k} | \mathcal {H} _ {k} ]. \tag {6.8}

For the first term, since $\sum_{k=1}^{\infty} \alpha_k^2 < \infty$ implies $\alpha_k \to 0$ almost surely, there exists a finite $n$ such that $\alpha_k \leq 1$ almost surely for all $k \geq n$ . Without loss of generality, we next merely consider the case of $\alpha_k \leq 1$ . Then, $-\alpha_k (2 - \alpha_k) \Delta_k^2 \leq 0$ . For the second term, we have $\beta_k^2 \mathbb{E}[\eta_k^2 | \mathcal{H}_k] \leq \beta_k^2 C$ as assumed. The third term equals zero because $\mathbb{E}[\eta_k | \mathcal{H}_k] = 0$ as assumed. Therefore, (6.8) becomes

\mathbb {E} \left[ h _ {k + 1} - h _ {k} \mid \mathcal {H} _ {k} \right] = - \alpha_ {k} \left(2 - \alpha_ {k}\right) \Delta_ {k} ^ {2} + \beta_ {k} ^ {2} \mathbb {E} \left[ \eta_ {k} ^ {2} \mid \mathcal {H} _ {k} \right] \leq \beta_ {k} ^ {2} C, \tag {6.9}

and hence

\sum_ {k = 1} ^ {\infty} \mathbb {E} [ h _ {k + 1} - h _ {k} | \mathcal {H} _ {k} ] \leq \sum_ {k = 1} ^ {\infty} \beta_ {k} ^ {2} C < \infty .

The last inequality is due to the condition $\sum_{k=1}^{\infty} \beta_k^2 < \infty$ . Then, based on the quasimartingale convergence theorem in Appendix C, we conclude that $h_k$ converges almost surely.

We next determine what value $\Delta_{k}$ converges to. It follows from (6.9) that

\sum_ {k = 1} ^ {\infty} \alpha_ {k} (2 - \alpha_ {k}) \Delta_ {k} ^ {2} = \sum_ {k = 1} ^ {\infty} \beta_ {k} ^ {2} \mathbb {E} [ \eta_ {k} ^ {2} | \mathcal {H} _ {k} ] - \sum_ {k = 1} ^ {\infty} \mathbb {E} [ h _ {k + 1} - h _ {k} | \mathcal {H} _ {k} ].

The first term on the right-hand side is bounded as assumed. The second term is also bounded because $h_k$ converges and hence $h_{k+1} - h_k$ is summable. Thus, $\sum_{k=1}^{\infty} \alpha_k (2 - \alpha_k) \Delta_k^2$ on the left-hand side is also bounded. Since we consider the case of $\alpha_k \leq 1$ , we have

\infty > \sum_ {k = 1} ^ {\infty} \alpha_ {k} (2 - \alpha_ {k}) \Delta_ {k} ^ {2} \geq \sum_ {k = 1} ^ {\infty} \alpha_ {k} \Delta_ {k} ^ {2} \geq 0.

Therefore, $\sum_{k=1}^{\infty} \alpha_k \Delta_k^2$ is bounded. Since $\sum_{k=1}^{\infty} \alpha_k = \infty$ , we must have $\Delta_k \to 0$ almost surely.

6.3.2 Application to mean estimation

While the mean estimation algorithm, $w_{k + 1} = w_k + \alpha_k(x_k - w_k)$ , has been analyzed using the RM theorem, we next show that its convergence can also be directly proven by Dvoretzky's theorem.

Proof. Let $w^{*} = \mathbb{E}[X]$ . The mean estimation algorithm $w_{k + 1} = w_{k} + \alpha_{k}(x_{k} - w_{k})$ can be rewritten as

w _ {k + 1} - w ^ {*} = w _ {k} - w ^ {*} + \alpha_ {k} \left(x _ {k} - w ^ {*} + w ^ {*} - w _ {k}\right).

Let $\Delta \dot{=} w - w^{*}$ . Then, we have

\begin{array}{l} \Delta_ {k + 1} = \Delta_ {k} + \alpha_ {k} \left(x _ {k} - w ^ {*} - \Delta_ {k}\right) \\ = (1 - \alpha_ {k}) \Delta_ {k} + \alpha_ {k} \underbrace {\left(x _ {k} - w ^ {*}\right)} _ {\eta_ {k}}. \\ \end{array}

Since $\{x_{k}\}$ is i.i.d., we have $\mathbb{E}[x_k|\mathcal{H}_k] = \mathbb{E}[x_k] = w^*$ . As a result, $\mathbb{E}[\eta_k|\mathcal{H}_k] = \mathbb{E}[x_k - w^* |\mathcal{H}_k] = 0$ and $\mathbb{E}[\eta_k^2 |\mathcal{H}_k] = \mathbb{E}[x_k^2 |\mathcal{H}_k] - (w^*)^2 = \mathbb{E}[x_k^2 ] - (w^*)^2$ are bounded if the variance of $x_{k}$ is finite. Following Dvoretzky's theorem, we conclude that $\Delta_{k}$ converges to zero and hence $w_{k}$ converges to $w^{*} = \mathbb{E}[X]$ almost surely.

6.3.3 Application to the Robbins-Monro theorem

We are now ready to prove the Robbins-Monro theorem using Dvoretzky's theorem.

Proof of the Robbins-Monro theorem. The RM algorithm aims to find the root of $g(w) = 0$ . Suppose that the root is $w^*$ such that $g(w^*) = 0$ . The RM algorithm is

\begin{array}{l} w _ {k + 1} = w _ {k} - a _ {k} \tilde {g} \left(w _ {k}, \eta_ {k}\right) \\ = w _ {k} - a _ {k} \left[ g \left(w _ {k}\right) + \eta_ {k} \right]. \\ \end{array}

Then, we have

w _ {k + 1} - w ^ {*} = w _ {k} - w ^ {*} - a _ {k} \left[ g \left(w _ {k}\right) - g \left(w ^ {*}\right) + \eta_ {k} \right].

Due to the mean value theorem [7,8], we have $g(w_{k}) - g(w^{*}) = \nabla_{w}g(w_{k}^{\prime})(w_{k} - w^{*})$

where $w_{k}^{\prime}\in [w_{k},w^{*}]$ . Let $\Delta_k\dot{=} w_k - w^*$ . The above equation becomes

\begin{array}{l} \Delta_ {k + 1} = \Delta_ {k} - a _ {k} \left[ \nabla_ {w} g \left(w _ {k} ^ {\prime}\right) \left(w _ {k} - w ^ {*}\right) + \eta_ {k} \right] \\ = \Delta_ {k} - a _ {k} \nabla_ {w} g (w _ {k} ^ {\prime}) \Delta_ {k} + a _ {k} (- \eta_ {k}) \\ = [ 1 - \underbrace {a _ {k} \nabla_ {w} g (w _ {k} ^ {\prime})} _ {\alpha_ {k}} ] \Delta_ {k} + a _ {k} (- \eta_ {k}). \\ \end{array}

Note that $\nabla_w g(w)$ is bounded as $0 < c_1 \leq \nabla_w g(w) \leq c_2$ as assumed. Since $\sum_{k=1}^{\infty} a_k = \infty$ and $\sum_{k=1}^{\infty} a_k^2 < \infty$ as assumed, we know $\sum_{k=1}^{\infty} \alpha_k = \infty$ and $\sum_{k=1}^{\infty} \alpha_k^2 < \infty$ . Thus, all the conditions in Dvoretzky's theorem are satisfied and hence $\Delta_k$ converges to zero almost surely.

The proof of the RM theorem demonstrates the power of Dvoretzky's theorem. In particular, $\alpha_{k}$ in the proof is a stochastic sequence depending on $w_{k}$ rather than a deterministic sequence. In this case, Dvoretzky's theorem is still applicable.

6.3.4 An extension of Dvoretzky's theorem

We next extend Dvoretzky's theorem to a more general theorem that can handle multiple variables. This general theorem, proposed by [32], can be used to analyze the convergence of stochastic iterative algorithms such as Q-learning.

Theorem 6.3. Consider a finite set $S$ of real numbers. For the stochastic process

\Delta_ {k + 1} (s) = (1 - \alpha_ {k} (s)) \Delta_ {k} (s) + \beta_ {k} (s) \eta_ {k} (s),

it holds that $\Delta_k(s)$ converges to zero almost surely for every $s \in S$ if the following conditions are satisfied for $s \in S$ :

(a) $\begin{array}{rl} & {\sum_{k}\alpha_{k}(s) = \infty ,\sum_{k}\alpha_{k}^{2}(s) < \infty ,\sum_{k}\beta_{k}^{2}(s) < \infty ,and\mathbb{E}[\beta_{k}(s)|\mathcal{H}_{k}]\leq} \end{array}$ $\mathbb{E}[\alpha_k(s)|\mathcal{H}_k]$ uniformly almost surely;
(b) $\| \mathbb{E}[\eta_k(s)|\mathcal{H}_k]\|_{\infty}\leq \gamma \| \Delta_k\|_{\infty},$ where $\gamma \in (0,1)$
(c) $\operatorname{var}[\eta_k(s)|\mathcal{H}_k] \leq C(1 + \| \Delta_k(s)\|_\infty)^2$ , where $C$ is a constant.

Here, $\mathcal{H}_k = \{\Delta_k,\Delta_{k - 1},\ldots ,\eta_{k - 1},\ldots ,\alpha_{k - 1},\ldots ,\beta_{k - 1},\ldots \}$ represents the historical information. The term $\| \cdot \|_{\infty}$ refers to the maximum norm.

Proof. As an extension, this theorem can be proven based on Dvoretzky's theorem. Details can be found in [32] and are omitted here. $\square$

Some remarks about this theorem are given below.

6.3_Dvoretzkys_convergence_theorem

6.3 Dvoretzky's convergence theorem

6.3.1 Proof of Dvoretzky's theorem

6.3.2 Application to mean estimation

6.3.3 Application to the Robbins-Monro theorem

6.3.4 An extension of Dvoretzky's theorem