Until now, the convergence of the RM algorithm has not yet been proven. To do that, we next introduce Dvoretzky's theorem [31, 32], which is a classic result in the field of stochastic approximation. This theorem can be used to analyze the convergence of the RM algorithm and many reinforcement learning algorithms.
This section is slightly mathematically intensive. Readers who are interested in the convergence analyses of stochastic algorithms are recommended to study this section. Otherwise, this section can be skipped.
Theorem 6.2 (Dvoretzky's theorem). Consider a stochastic process
Δk+1=(1−αk)Δk+βkηk,
where {αk}k=1∞,{βk}k=1∞,{ηk}k=1∞ are stochastic sequences. Here αk≥0,βk≥0 for all k . Then, Δk converges to zero almost surely if the following conditions are satisfied:
(a) ∑k=1∞αk=∞ , ∑k=1∞αk2<∞ , and ∑k=1∞βk2<∞ uniformly almost surely; (b) E[ηk∣Hk]=0 and E[ηk2∣Hk]≤C almost surely;
where Hk={Δk,Δk−1,…,ηk−1,…,αk−1,…,βk−1,…} .
Before presenting the proof of this theorem, we first clarify some issues.
In the RM algorithm, the coefficient sequence {αk} is deterministic. However, Dvoretzky's theorem allows {αk},{βk} to be random variables that depend on Hk . Thus, it is more useful in cases where αk or βk is a function of Δk . In the first condition, it is stated as "uniformly almost surely". This is because αk and βk may be random variables and hence the definition of their limits must be in the stochastic sense. In the second condition, it is also stated as "almost surely". This is because Hk is a sequence of random variables rather than specific values. As a result, E[ηk∣Hk] and E[ηk2∣Hk] are random variables. The definition of the conditional expectation in this case is in the "almost sure" sense (Appendix B). The statement of Theorem 6.2 is slightly different from [32] in the sense that Theorem 6.2 does not require ∑k=1∞βk=∞ in the first condition. When ∑k=1∞βk<∞ , especially in the extreme case where βk=0 for all k , the sequence can still converge.
6.3.1 Proof of Dvoretzky's theorem
The original proof of Dvoretzky's theorem was given in 1956 [31]. There are also other proofs. We next present a proof based on quasimartingales. With the convergence results of quasimartingales, the proof of Dvoretzky's theorem is straightforward. More information about quasimartingales can be found in Appendix C.
Proof of Dvoretzky's theorem. Let hk≐Δk2 . Then,
First, since Δk is included and hence determined by Hk , it can be taken out from the expectation (see property (e) in Lemma B.1). Second, consider the simple case
where αk,βk is determined by Hk . This case is valid when, for example, {αk} and {βk} are functions of Δk or deterministic sequences. Then, they can also be taken out of the expectation. Therefore, (6.7) becomes
For the first term, since ∑k=1∞αk2<∞ implies αk→0 almost surely, there exists a finite n such that αk≤1 almost surely for all k≥n . Without loss of generality, we next merely consider the case of αk≤1 . Then, −αk(2−αk)Δk2≤0 . For the second term, we have βk2E[ηk2∣Hk]≤βk2C as assumed. The third term equals zero because E[ηk∣Hk]=0 as assumed. Therefore, (6.8) becomes
The last inequality is due to the condition ∑k=1∞βk2<∞ . Then, based on the quasimartingale convergence theorem in Appendix C, we conclude that hk converges almost surely.
We next determine what value Δk converges to. It follows from (6.9) that
The first term on the right-hand side is bounded as assumed. The second term is also bounded because hk converges and hence hk+1−hk is summable. Thus, ∑k=1∞αk(2−αk)Δk2 on the left-hand side is also bounded. Since we consider the case of αk≤1 , we have
∞>k=1∑∞αk(2−αk)Δk2≥k=1∑∞αkΔk2≥0.
Therefore, ∑k=1∞αkΔk2 is bounded. Since ∑k=1∞αk=∞ , we must have Δk→0 almost surely.
6.3.2 Application to mean estimation
While the mean estimation algorithm, wk+1=wk+αk(xk−wk) , has been analyzed using the RM theorem, we next show that its convergence can also be directly proven by Dvoretzky's theorem.
Proof. Let w∗=E[X] . The mean estimation algorithm wk+1=wk+αk(xk−wk) can be rewritten as
Since {xk} is i.i.d., we have E[xk∣Hk]=E[xk]=w∗ . As a result, E[ηk∣Hk]=E[xk−w∗∣Hk]=0 and E[ηk2∣Hk]=E[xk2∣Hk]−(w∗)2=E[xk2]−(w∗)2 are bounded if the variance of xk is finite. Following Dvoretzky's theorem, we conclude that Δk converges to zero and hence wk converges to w∗=E[X] almost surely.
6.3.3 Application to the Robbins-Monro theorem
We are now ready to prove the Robbins-Monro theorem using Dvoretzky's theorem.
Proof of the Robbins-Monro theorem. The RM algorithm aims to find the root of g(w)=0 . Suppose that the root is w∗ such that g(w∗)=0 . The RM algorithm is
wk+1=wk−akg~(wk,ηk)=wk−ak[g(wk)+ηk].
Then, we have
wk+1−w∗=wk−w∗−ak[g(wk)−g(w∗)+ηk].
Due to the mean value theorem [7,8], we have g(wk)−g(w∗)=∇wg(wk′)(wk−w∗)
where wk′∈[wk,w∗] . Let Δk=˙wk−w∗ . The above equation becomes
Note that ∇wg(w) is bounded as 0<c1≤∇wg(w)≤c2 as assumed. Since ∑k=1∞ak=∞ and ∑k=1∞ak2<∞ as assumed, we know ∑k=1∞αk=∞ and ∑k=1∞αk2<∞ . Thus, all the conditions in Dvoretzky's theorem are satisfied and hence Δk converges to zero almost surely.
The proof of the RM theorem demonstrates the power of Dvoretzky's theorem. In particular, αk in the proof is a stochastic sequence depending on wk rather than a deterministic sequence. In this case, Dvoretzky's theorem is still applicable.
6.3.4 An extension of Dvoretzky's theorem
We next extend Dvoretzky's theorem to a more general theorem that can handle multiple variables. This general theorem, proposed by [32], can be used to analyze the convergence of stochastic iterative algorithms such as Q-learning.
Theorem 6.3. Consider a finite set S of real numbers. For the stochastic process
Δk+1(s)=(1−αk(s))Δk(s)+βk(s)ηk(s),
it holds that Δk(s) converges to zero almost surely for every s∈S if the following conditions are satisfied for s∈S :
(a) ∑kαk(s)=∞,∑kαk2(s)<∞,∑kβk2(s)<∞,andE[βk(s)∣Hk]≤E[αk(s)∣Hk] uniformly almost surely; (b) ∥E[ηk(s)∣Hk]∥∞≤γ∥Δk∥∞, where γ∈(0,1) (c) var[ηk(s)∣Hk]≤C(1+∥Δk(s)∥∞)2 , where C is a constant.
Here, Hk={Δk,Δk−1,…,ηk−1,…,αk−1,…,βk−1,…} represents the historical information. The term ∥⋅∥∞ refers to the maximum norm.
Proof. As an extension, this theorem can be proven based on Dvoretzky's theorem. Details can be found in [32] and are omitted here. □