We next present some preliminaries for the gradient descent method, which is one of the most frequently used optimization methods. The gradient descent method is also the foundation for the stochastic gradient descent method introduced in Chapter 6.
Convexity
Definitions:
Convex set: Suppose that D is a subset of Rn . This set is convex if z≐cx+(1−c)y∈D for any x,y∈D and any c∈[0,1] .
Convex function: Suppose f:D→R where D is convex. Then, the function f(x) is convex if
f(cx+(1−c)y)≤cf(x)+(1−c)f(y)
for any x,y∈D and c∈[0,1] .
Convex conditions:
First-order condition: Consider a function f:D→R where D is convex. Then, f is convex if [106, 3.1.3]
f(y)−f(x)≥∇f(x)T(y−x),f o r a l lx,y∈D.(D.1)
When x is a scalar, ∇f(x) represents the slope of the tangent line of f(x) at x . The geometric interpretation of (D.1) is that the point (y,f(y)) is always located above the tangent line.
Second-order condition: Consider a function f:D→R where D is convex. Then, f is convex if
∇2f(x)⪰0,f o r a l lx∈D,
where ∇2f(x) is the Hessian matrix.
Degree of convexity:
Given a convex function, it is often of interest how strong its convexity is. The Hessian matrix is a useful tool for describing the degree of convexity. If ∇2f(x) is close to rank deficiency at a point, then the function is flat around that point and hence weakly convex. Otherwise, if the minimum singular value of ∇2f(x) is positive and large, the function is curly around that point and hence strongly convex. The degree of convexity influences the step size selection in gradient descent algorithms.
The lower and upper bounds of ∇2f(x) play an important role in characterizing the function convexity.
Lower bound of ∇2f(x) : A function is called strongly convex or strictly convex if ∇2f(x)⪰ℓIn , where ℓ>0 for all x .
Upper bound of ∇2f(x) : If ∇2f(x) is bounded from above so that ∇2f(x)⪯LIn , then the change in the first-order derivative ∇f(x) cannot be arbitrarily fast; equivalently, the function cannot be arbitrarily convex at a point.
The upper bound can be implied by a Lipschitz condition of ∇f(x) , as shown below.
Lemma D.1. Suppose that f is a convex function. If ∇f(x) is Lipschitz continuous with a constant L so that
∥∇f(x)−∇f(y)∥≤L∥x−y∥,f o r a l lx,y,
then ∇2f(x)⪯LIn for all x . Here, ∥⋅∥ denotes the Euclidean norm.
Gradient descent algorithms
Consider the following optimization problem:
xminf(x)
where x∈D⊆Rn and f:D→R . The gradient descent algorithm is
xk+1=xk−αk∇f(xk),k=0,1,2,…(D.2)
where αk is a positive coefficient that may be fixed or time-varying. Here, αk is called the step size or learning rate. Some remarks about (D.2) are given below.
⋄ Direction of change: ∇f(xk) is a vector that points in the direction along which f(xk) increases the fastest. Hence, the term −αk∇f(xk) changes xk in the direction along which f(xk) decreases the fastest. ⋄ Magnitude of change: The magnitude of the change −αk∇f(xk) is jointly determined by the step size αk and the magnitude of ∇f(xk) .
Magnitude of ∇f(xk) :
When xk is close to the optimum x∗ where ∇f(x∗)=0 , the magnitude ∥∇f(xk)∥ is small. In this case, the update process of xk is slow, which is reasonable because we do not want to update x too aggressively and miss the optimum.
When xk is far from the optimum, the magnitude of ∇f(xk) may be large, and hence the update process of xk is fast. This is also reasonable because we hope that the estimate can approach the optimum as quickly as possible.
Step size αk :
If αk is small, the magnitude of −αk∇f(xk) is small, and hence the convergence process is slow. If αk is too large, the update process of xk is aggressive, which leads to either fast convergence or divergence.
How to select αk ? The selection of αk should depend on the degree of convexity of f(xk) . If the function is curly around the optimum (the degree of convexity is strong), then the step size αk should be small to guarantee convergence. If the function is flat around the optimum (the degree of convexity is weak), then the step size could be large so that xk can quickly approach the optimum. The above intuition will be verified in the following convergence analysis.
Convergence analysis
We next present a proof of the convergence of the gradient descent algorithm in (D.2). That is to show xk converges to the optimum x∗ where ∇f(x∗)=0 . First of all, we make some assumptions.
⋄ Assumption 1: f(x) is strongly convex such that
∇2f(x)⪰ℓI,
where ℓ>0
⋄ Assumption 2: ∇f(x) is Lipschitz continuous with a constant L . This assumption implies the following inequality according to Lemma D.1:
∇2f(x)⪯LIn.
The convergence proof is given below.
Proof. For any xk+1 and xk , it follows from [106, Section 9.1.2] that
then the sequence {f(xk)}k=1∞ converges to f(x∗) where ∇f(x∗)=0 . First, (D.5) implies that ηk>0 . Then, (D.4) implies that f(xk+1)≤f(xk) . Therefore, {f(xk)} is a nonincreasing sequence. Second, since f(xk) is always bounded from below by f(x∗) , we know that {f(xk)} converges as k→∞ according to the monotone convergence theorem in Theorem C.1. Suppose that the limit of the sequence is f∗ . Then, taking the limit on both sides of (D.4) gives
Since ηk∥∇f(xk)∥2≥0 , the above inequality implies that limk→∞ηk∥∇f(xk)∥2=0 . As a result, x converges to x∗ where ∇f(x∗)=0 . The proof is complete. The above proof is inspired by [107].
The inequality in (D.5) provides valuable insights into how αk should be selected. If the function is flat ( L is small), the step size can be large; otherwise, if the function is strongly convex ( L is large), then the step size must be sufficiently small to ensure convergence. There are also many other ways to prove the convergence such as the contraction mapping theorem [108, Lemma 3]. A comprehensive introduction to convex optimization can be found in [106].