README

Preliminaries for Gradient Descent

We next present some preliminaries for the gradient descent method, which is one of the most frequently used optimization methods. The gradient descent method is also the foundation for the stochastic gradient descent method introduced in Chapter 6.

Convexity

Definitions:

  • Convex set: Suppose that D\mathcal{D} is a subset of Rn\mathbb{R}^n . This set is convex if zcx+(1c)yDz \doteq cx + (1 - c)y \in \mathcal{D} for any x,yDx, y \in \mathcal{D} and any c[0,1]c \in [0, 1] .

  • Convex function: Suppose f:DRf: \mathcal{D} \to \mathbb{R} where D\mathcal{D} is convex. Then, the function f(x)f(x) is convex if

f(cx+(1c)y)cf(x)+(1c)f(y)f (c x + (1 - c) y) \leq c f (x) + (1 - c) f (y)

for any x,yDx,y\in \mathcal{D} and c[0,1]c\in [0,1] .

Convex conditions:

  • First-order condition: Consider a function f:DRf: \mathcal{D} \to \mathbb{R} where D\mathcal{D} is convex. Then, ff is convex if [106, 3.1.3]

f(y)f(x)f(x)T(yx),f o r a l lx,yD.(D.1)f (y) - f (x) \geq \nabla f (x) ^ {T} (y - x), \quad \text {f o r a l l} x, y \in \mathcal {D}. \tag {D.1}

When xx is a scalar, f(x)\nabla f(x) represents the slope of the tangent line of f(x)f(x) at xx . The geometric interpretation of (D.1) is that the point (y,f(y))(y, f(y)) is always located above the tangent line.

  • Second-order condition: Consider a function f:DRf: \mathcal{D} \to \mathbb{R} where D\mathcal{D} is convex. Then, ff is convex if

2f(x)0,f o r a l lxD,\nabla^ {2} f (x) \succeq 0, \quad \text {f o r a l l} x \in \mathcal {D},

where 2f(x)\nabla^2 f(x) is the Hessian matrix.

Degree of convexity:

Given a convex function, it is often of interest how strong its convexity is. The Hessian matrix is a useful tool for describing the degree of convexity. If 2f(x)\nabla^2 f(x) is close to rank deficiency at a point, then the function is flat around that point and hence weakly convex. Otherwise, if the minimum singular value of 2f(x)\nabla^2 f(x) is positive and large, the function is curly around that point and hence strongly convex. The degree of convexity influences the step size selection in gradient descent algorithms.

The lower and upper bounds of 2f(x)\nabla^2 f(x) play an important role in characterizing the function convexity.

  • Lower bound of 2f(x)\nabla^2 f(x) : A function is called strongly convex or strictly convex if 2f(x)In\nabla^2 f(x) \succeq \ell I_n , where >0\ell > 0 for all xx .

  • Upper bound of 2f(x)\nabla^2 f(x) : If 2f(x)\nabla^2 f(x) is bounded from above so that 2f(x)LIn\nabla^2 f(x) \preceq LI_n , then the change in the first-order derivative f(x)\nabla f(x) cannot be arbitrarily fast; equivalently, the function cannot be arbitrarily convex at a point.

The upper bound can be implied by a Lipschitz condition of f(x)\nabla f(x) , as shown below.

Lemma D.1. Suppose that ff is a convex function. If f(x)\nabla f(x) is Lipschitz continuous with a constant LL so that

f(x)f(y)Lxy,f o r a l lx,y,\| \nabla f (x) - \nabla f (y) \| \leq L \| x - y \|, \quad \text {f o r a l l} x, y,

then 2f(x)LIn\nabla^2 f(x) \preceq LI_n for all xx . Here, \|\cdot\| denotes the Euclidean norm.

Gradient descent algorithms

Consider the following optimization problem:

minxf(x)\min _ {x} f (x)

where xDRnx\in \mathcal{D}\subseteq \mathbb{R}^n and f:DRf:\mathcal{D}\to \mathbb{R} . The gradient descent algorithm is

xk+1=xkαkf(xk),k=0,1,2,(D.2)x _ {k + 1} = x _ {k} - \alpha_ {k} \nabla f (x _ {k}), \quad k = 0, 1, 2, \dots \tag {D.2}

where αk\alpha_{k} is a positive coefficient that may be fixed or time-varying. Here, αk\alpha_{k} is called the step size or learning rate. Some remarks about (D.2) are given below.

\diamond Direction of change: f(xk)\nabla f(x_{k}) is a vector that points in the direction along which f(xk)f(x_{k}) increases the fastest. Hence, the term αkf(xk)-\alpha_{k}\nabla f(x_{k}) changes xkx_{k} in the direction along which f(xk)f(x_{k}) decreases the fastest.
\diamond Magnitude of change: The magnitude of the change αkf(xk)-\alpha_{k}\nabla f(x_{k}) is jointly determined by the step size αk\alpha_{k} and the magnitude of f(xk)\nabla f(x_{k}) .

  • Magnitude of f(xk)\nabla f(x_k) :

When xkx_{k} is close to the optimum xx^{*} where f(x)=0\nabla f(x^{*}) = 0 , the magnitude f(xk)\| \nabla f(x_k)\| is small. In this case, the update process of xkx_{k} is slow, which is reasonable because we do not want to update xx too aggressively and miss the optimum.

When xkx_{k} is far from the optimum, the magnitude of f(xk)\nabla f(x_{k}) may be large, and hence the update process of xkx_{k} is fast. This is also reasonable because we hope that the estimate can approach the optimum as quickly as possible.

  • Step size αk\alpha_{k} :

If αk\alpha_{k} is small, the magnitude of αkf(xk)-\alpha_{k}\nabla f(x_{k}) is small, and hence the convergence process is slow. If αk\alpha_{k} is too large, the update process of xkx_{k} is aggressive, which leads to either fast convergence or divergence.

How to select αk\alpha_{k} ? The selection of αk\alpha_{k} should depend on the degree of convexity of f(xk)f(x_{k}) . If the function is curly around the optimum (the degree of convexity is strong), then the step size αk\alpha_{k} should be small to guarantee convergence. If the function is flat around the optimum (the degree of convexity is weak), then the step size could be large so that xkx_{k} can quickly approach the optimum. The above intuition will be verified in the following convergence analysis.

Convergence analysis

We next present a proof of the convergence of the gradient descent algorithm in (D.2). That is to show xkx_{k} converges to the optimum xx^{*} where f(x)=0\nabla f(x^{*}) = 0 . First of all, we make some assumptions.

\diamond Assumption 1: f(x)f(x) is strongly convex such that

2f(x)I,\nabla^ {2} f (x) \succeq \ell I,

where >0\ell >0

\diamond Assumption 2: f(x)\nabla f(x) is Lipschitz continuous with a constant LL . This assumption implies the following inequality according to Lemma D.1:

2f(x)LIn.\nabla^ {2} f (x) \preceq L I _ {n}.

The convergence proof is given below.

Proof. For any xk+1x_{k+1} and xkx_k , it follows from [106, Section 9.1.2] that

f(xk+1)=f(xk)+f(xk)T(xk+1xk)+12(xk+1xk)T2f(zk)(xk+1xk),(D.3)f (x _ {k + 1}) = f (x _ {k}) + \nabla f (x _ {k}) ^ {T} (x _ {k + 1} - x _ {k}) + \frac {1}{2} (x _ {k + 1} - x _ {k}) ^ {T} \nabla^ {2} f (z _ {k}) (x _ {k + 1} - x _ {k}), \quad (\mathrm {D .} 3)

where zkz_{k} is a convex combination of xkx_{k} and xk+1x_{k + 1} . Since it is assumed that 2f(zk)LIn\nabla^2 f(z_k)\preceq LI_n , we have 2f(zk)L\| \nabla^{2}f(z_{k})\| \leq L . (D.3) implies

f(xk+1)f(xk)+f(xk)T(xk+1xk)+122f(zk)xk+1xk2f(xk)+f(xk)T(xk+1xk)+L2xk+1xk2.\begin{array}{l} f (x _ {k + 1}) \leq f (x _ {k}) + \nabla f (x _ {k}) ^ {T} (x _ {k + 1} - x _ {k}) + \frac {1}{2} \| \nabla^ {2} f (z _ {k}) \| \| x _ {k + 1} - x _ {k} \| ^ {2} \\ \leq f (x _ {k}) + \nabla f (x _ {k}) ^ {T} (x _ {k + 1} - x _ {k}) + \frac {L}{2} \| x _ {k + 1} - x _ {k} \| ^ {2}. \\ \end{array}

Substituting xk+1=xkαkf(xk)x_{k + 1} = x_{k} - \alpha_{k}\nabla f(x_{k}) into the above inequality yields

f(xk+1)f(xk)+f(xk)T(αkf(xk))+L2αkf(xk)2=f(xk)αkf(xk)2+αk2L2f(xk)2=f(xk)αk(1αkL2)ηkf(xk)2.(D.4)\begin{array}{l} f (x _ {k + 1}) \leq f (x _ {k}) + \nabla f (x _ {k}) ^ {T} (- \alpha_ {k} \nabla f (x _ {k})) + \frac {L}{2} \| \alpha_ {k} \nabla f (x _ {k}) \| ^ {2} \\ = f (x _ {k}) - \alpha_ {k} \| \nabla f (x _ {k}) \| ^ {2} + \frac {\alpha_ {k} ^ {2} L}{2} \| \nabla f (x _ {k}) \| ^ {2} \\ = f \left(x _ {k}\right) - \underbrace {\alpha_ {k} \left(1 - \frac {\alpha_ {k} L}{2}\right)} _ {\eta_ {k}} \| \nabla f \left(x _ {k}\right) \| ^ {2}. \tag {D.4} \\ \end{array}

We next show that if we select

0<αk<2L,(D.5)0 < \alpha_ {k} < \frac {2}{L}, \tag {D.5}

then the sequence {f(xk)}k=1\{f(x_k)\}_{k=1}^{\infty} converges to f(x)f(x^*) where f(x)=0\nabla f(x^*) = 0 . First, (D.5) implies that ηk>0\eta_k > 0 . Then, (D.4) implies that f(xk+1)f(xk)f(x_{k+1}) \leq f(x_k) . Therefore, {f(xk)}\{f(x_k)\} is a nonincreasing sequence. Second, since f(xk)f(x_k) is always bounded from below by f(x)f(x^*) , we know that {f(xk)}\{f(x_k)\} converges as kk \to \infty according to the monotone convergence theorem in Theorem C.1. Suppose that the limit of the sequence is ff^* . Then, taking the limit on both sides of (D.4) gives

limkf(xk+1)limkf(xk)limkηkf(xk)2fflimkηkf(xk)20limkηkf(xk)2.\begin{array}{l} \lim _ {k \rightarrow \infty} f (x _ {k + 1}) \leq \lim _ {k \rightarrow \infty} f (x _ {k}) - \lim _ {k \rightarrow \infty} \eta_ {k} \| \nabla f (x _ {k}) \| ^ {2} \\ \Leftrightarrow f ^ {*} \leq f ^ {*} - \lim _ {k \rightarrow \infty} \eta_ {k} \| \nabla f (x _ {k}) \| ^ {2} \\ \Leftrightarrow 0 \leq - \lim _ {k \rightarrow \infty} \eta_ {k} \| \nabla f (x _ {k}) \| ^ {2}. \\ \end{array}

Since ηkf(xk)20\eta_k\| \nabla f(x_k)\|^2 \geq 0 , the above inequality implies that limkηkf(xk)2=0\lim_{k \to \infty} \eta_k\| \nabla f(x_k)\|^2 = 0 . As a result, xx converges to xx^* where f(x)=0\nabla f(x^*) = 0 . The proof is complete. The above proof is inspired by [107].

The inequality in (D.5) provides valuable insights into how αk\alpha_{k} should be selected. If the function is flat ( LL is small), the step size can be large; otherwise, if the function is strongly convex ( LL is large), then the step size must be sufficiently small to ensure convergence. There are also many other ways to prove the convergence such as the contraction mapping theorem [108, Lemma 3]. A comprehensive introduction to convex optimization can be found in [106].