Preliminaries for Gradient Descent

We next present some preliminaries for the gradient descent method, which is one of the most frequently used optimization methods. The gradient descent method is also the foundation for the stochastic gradient descent method introduced in Chapter 6.

Convexity

Definitions:

Convex set: Suppose that $\mathcal{D}$ is a subset of $\mathbb{R}^n$ . This set is convex if $z \doteq cx + (1 - c)y \in \mathcal{D}$ for any $x, y \in \mathcal{D}$ and any $c \in [0, 1]$ .
Convex function: Suppose $f: \mathcal{D} \to \mathbb{R}$ where $\mathcal{D}$ is convex. Then, the function $f(x)$ is convex if

f (c x + (1 - c) y) \leq c f (x) + (1 - c) f (y)

for any $x,y\in \mathcal{D}$ and $c\in [0,1]$ .

Convex conditions:

First-order condition: Consider a function $f: \mathcal{D} \to \mathbb{R}$ where $\mathcal{D}$ is convex. Then, $f$ is convex if [106, 3.1.3]

f (y) - f (x) \geq \nabla f (x) ^ {T} (y - x), \quad \text {f o r a l l} x, y \in \mathcal {D}. \tag {D.1}

When $x$ is a scalar, $\nabla f(x)$ represents the slope of the tangent line of $f(x)$ at $x$ . The geometric interpretation of (D.1) is that the point $(y, f(y))$ is always located above the tangent line.

Second-order condition: Consider a function $f: \mathcal{D} \to \mathbb{R}$ where $\mathcal{D}$ is convex. Then, $f$ is convex if

\nabla^ {2} f (x) \succeq 0, \quad \text {f o r a l l} x \in \mathcal {D},

where $\nabla^2 f(x)$ is the Hessian matrix.

Degree of convexity:

Given a convex function, it is often of interest how strong its convexity is. The Hessian matrix is a useful tool for describing the degree of convexity. If $\nabla^2 f(x)$ is close to rank deficiency at a point, then the function is flat around that point and hence weakly convex. Otherwise, if the minimum singular value of $\nabla^2 f(x)$ is positive and large, the function is curly around that point and hence strongly convex. The degree of convexity influences the step size selection in gradient descent algorithms.

The lower and upper bounds of $\nabla^2 f(x)$ play an important role in characterizing the function convexity.

Lower bound of $\nabla^2 f(x)$ : A function is called strongly convex or strictly convex if $\nabla^2 f(x) \succeq \ell I_n$ , where $\ell > 0$ for all $x$ .
Upper bound of $\nabla^2 f(x)$ : If $\nabla^2 f(x)$ is bounded from above so that $\nabla^2 f(x) \preceq LI_n$ , then the change in the first-order derivative $\nabla f(x)$ cannot be arbitrarily fast; equivalently, the function cannot be arbitrarily convex at a point.

The upper bound can be implied by a Lipschitz condition of $\nabla f(x)$ , as shown below.

Lemma D.1. Suppose that $f$ is a convex function. If $\nabla f(x)$ is Lipschitz continuous with a constant $L$ so that

\| \nabla f (x) - \nabla f (y) \| \leq L \| x - y \|, \quad \text {f o r a l l} x, y,

then $\nabla^2 f(x) \preceq LI_n$ for all $x$ . Here, $\|\cdot\|$ denotes the Euclidean norm.

Gradient descent algorithms

Consider the following optimization problem:

\min _ {x} f (x)

where $x\in \mathcal{D}\subseteq \mathbb{R}^n$ and $f:\mathcal{D}\to \mathbb{R}$ . The gradient descent algorithm is

x _ {k + 1} = x _ {k} - \alpha_ {k} \nabla f (x _ {k}), \quad k = 0, 1, 2, \dots \tag {D.2}

where $\alpha_{k}$ is a positive coefficient that may be fixed or time-varying. Here, $\alpha_{k}$ is called the step size or learning rate. Some remarks about (D.2) are given below.

$\diamond$ Direction of change: $\nabla f(x_{k})$ is a vector that points in the direction along which $f(x_{k})$ increases the fastest. Hence, the term $-\alpha_{k}\nabla f(x_{k})$ changes $x_{k}$ in the direction along which $f(x_{k})$ decreases the fastest.
$\diamond$ Magnitude of change: The magnitude of the change $-\alpha_{k}\nabla f(x_{k})$ is jointly determined by the step size $\alpha_{k}$ and the magnitude of $\nabla f(x_{k})$ .

Magnitude of $\nabla f(x_k)$ :

When $x_{k}$ is close to the optimum $x^{*}$ where $\nabla f(x^{*}) = 0$ , the magnitude $\| \nabla f(x_k)\|$ is small. In this case, the update process of $x_{k}$ is slow, which is reasonable because we do not want to update $x$ too aggressively and miss the optimum.

When $x_{k}$ is far from the optimum, the magnitude of $\nabla f(x_{k})$ may be large, and hence the update process of $x_{k}$ is fast. This is also reasonable because we hope that the estimate can approach the optimum as quickly as possible.

Step size $\alpha_{k}$ :

If $\alpha_{k}$ is small, the magnitude of $-\alpha_{k}\nabla f(x_{k})$ is small, and hence the convergence process is slow. If $\alpha_{k}$ is too large, the update process of $x_{k}$ is aggressive, which leads to either fast convergence or divergence.

How to select $\alpha_{k}$ ? The selection of $\alpha_{k}$ should depend on the degree of convexity of $f(x_{k})$ . If the function is curly around the optimum (the degree of convexity is strong), then the step size $\alpha_{k}$ should be small to guarantee convergence. If the function is flat around the optimum (the degree of convexity is weak), then the step size could be large so that $x_{k}$ can quickly approach the optimum. The above intuition will be verified in the following convergence analysis.

Convergence analysis

We next present a proof of the convergence of the gradient descent algorithm in (D.2). That is to show $x_{k}$ converges to the optimum $x^{*}$ where $\nabla f(x^{*}) = 0$ . First of all, we make some assumptions.

$\diamond$ Assumption 1: $f(x)$ is strongly convex such that

\nabla^ {2} f (x) \succeq \ell I,

where $\ell >0$

$\diamond$ Assumption 2: $\nabla f(x)$ is Lipschitz continuous with a constant $L$ . This assumption implies the following inequality according to Lemma D.1:

\nabla^ {2} f (x) \preceq L I _ {n}.

The convergence proof is given below.

Proof. For any $x_{k+1}$ and $x_k$ , it follows from [106, Section 9.1.2] that

f (x _ {k + 1}) = f (x _ {k}) + \nabla f (x _ {k}) ^ {T} (x _ {k + 1} - x _ {k}) + \frac {1}{2} (x _ {k + 1} - x _ {k}) ^ {T} \nabla^ {2} f (z _ {k}) (x _ {k + 1} - x _ {k}), \quad (\mathrm {D .} 3)

where $z_{k}$ is a convex combination of $x_{k}$ and $x_{k + 1}$ . Since it is assumed that $\nabla^2 f(z_k)\preceq LI_n$ , we have $\| \nabla^{2}f(z_{k})\| \leq L$ . (D.3) implies

\begin{array}{l} f (x _ {k + 1}) \leq f (x _ {k}) + \nabla f (x _ {k}) ^ {T} (x _ {k + 1} - x _ {k}) + \frac {1}{2} \| \nabla^ {2} f (z _ {k}) \| \| x _ {k + 1} - x _ {k} \| ^ {2} \\ \leq f (x _ {k}) + \nabla f (x _ {k}) ^ {T} (x _ {k + 1} - x _ {k}) + \frac {L}{2} \| x _ {k + 1} - x _ {k} \| ^ {2}. \\ \end{array}

Substituting $x_{k + 1} = x_{k} - \alpha_{k}\nabla f(x_{k})$ into the above inequality yields

\begin{array}{l} f (x _ {k + 1}) \leq f (x _ {k}) + \nabla f (x _ {k}) ^ {T} (- \alpha_ {k} \nabla f (x _ {k})) + \frac {L}{2} \| \alpha_ {k} \nabla f (x _ {k}) \| ^ {2} \\ = f (x _ {k}) - \alpha_ {k} \| \nabla f (x _ {k}) \| ^ {2} + \frac {\alpha_ {k} ^ {2} L}{2} \| \nabla f (x _ {k}) \| ^ {2} \\ = f \left(x _ {k}\right) - \underbrace {\alpha_ {k} \left(1 - \frac {\alpha_ {k} L}{2}\right)} _ {\eta_ {k}} \| \nabla f \left(x _ {k}\right) \| ^ {2}. \tag {D.4} \\ \end{array}

We next show that if we select

0 < \alpha_ {k} < \frac {2}{L}, \tag {D.5}

then the sequence $\{f(x_k)\}_{k=1}^{\infty}$ converges to $f(x^*)$ where $\nabla f(x^*) = 0$ . First, (D.5) implies that $\eta_k > 0$ . Then, (D.4) implies that $f(x_{k+1}) \leq f(x_k)$ . Therefore, $\{f(x_k)\}$ is a nonincreasing sequence. Second, since $f(x_k)$ is always bounded from below by $f(x^*)$ , we know that $\{f(x_k)\}$ converges as $k \to \infty$ according to the monotone convergence theorem in Theorem C.1. Suppose that the limit of the sequence is $f^*$ . Then, taking the limit on both sides of (D.4) gives

\begin{array}{l} \lim _ {k \rightarrow \infty} f (x _ {k + 1}) \leq \lim _ {k \rightarrow \infty} f (x _ {k}) - \lim _ {k \rightarrow \infty} \eta_ {k} \| \nabla f (x _ {k}) \| ^ {2} \\ \Leftrightarrow f ^ {*} \leq f ^ {*} - \lim _ {k \rightarrow \infty} \eta_ {k} \| \nabla f (x _ {k}) \| ^ {2} \\ \Leftrightarrow 0 \leq - \lim _ {k \rightarrow \infty} \eta_ {k} \| \nabla f (x _ {k}) \| ^ {2}. \\ \end{array}

Since $\eta_k\| \nabla f(x_k)\|^2 \geq 0$ , the above inequality implies that $\lim_{k \to \infty} \eta_k\| \nabla f(x_k)\|^2 = 0$ . As a result, $x$ converges to $x^*$ where $\nabla f(x^*) = 0$ . The proof is complete. The above proof is inspired by [107].

The inequality in (D.5) provides valuable insights into how $\alpha_{k}$ should be selected. If the function is flat ( $L$ is small), the step size can be large; otherwise, if the function is strongly convex ( $L$ is large), then the step size must be sufficiently small to ensure convergence. There are also many other ways to prove the convergence such as the contraction mapping theorem [108, Lemma 3]. A comprehensive introduction to convex optimization can be found in [106].

README

Preliminaries for Gradient Descent

Convexity

Gradient descent algorithms

Convergence analysis