Descent methods for unconstrained problems

Agenda today

Reading:

The problem we want to solve

\[ \text{minimize}_x \quad f(x) \]

Descent Methods

General algorithm:

Start with a point \(x\)

Repeat

Until the stopping criterion is satisfied, usually \(\|\nabla f(x)\|_2 \le \epsilon\).

Gradient descent

In gradient descent, we take \(\Delta x = - \nabla f(x)\).

Overall algorithm:

Start with a point \(x\)

Repeat

Until the stopping criterion is satisfied, usually \(\|\nabla f(x)\|_2 \le \epsilon\).

Convergence time for gradient descent

If \(f\) is strongly convex, we have the result \[ f(x^{(k)}) - p^\star \le c^k (f(x^{(0)}) - p^\star) \]

Gradient descent example

Iterates of gradient descent with backtracking line search, for minimizing \(f(x_1, x_2) = \exp(x_1 + 3 x_2 - .1) + \exp(x_1 - 3 x_2 - .1) + \exp(-x_1 - .1)\)

Contours represent the boundaries of the sublevel sets of the function: \(\{x : f(x) \le a\}\).

Steepest descent

Steepest descent: modification of the descent direction.

The normalized steepest descent direction is defined as \[ \Delta x_{nsd} = \text{argmin}_x \{\nabla f(x)^T v : \|v\| = 1\} \] for some norm \(\|\cdot \|\).

Note: Steepest descent with the standard norm (\(\|\cdot\|_2\)) is the same as gradient descent.

Steepest descent algorithm

The same as gradient descent, but with a different descent direction:

Start with a point \(x\)

Repeat

Until the stopping criterion is satisfied, usually \(\|\nabla f(x)\|_2 \le \epsilon\).

Normalized steepest descent direction for a quadratic norm

Normalized steepest descent direction for the \(\ell_1\) norm

Examples of the effect of the norm

When can we expect these methods to do well?

From the pictures, we saw that

Condition number of convex sets

Let \(C \subseteq \mathbb R^n\), and let \(q\) be a vector in \(\mathbb R^n\) specifying a direction.

Convergence bounds and condition number

Recall for gradient descent we had the following result: if \(f\) is strongly convex, \[ f(x^{(k)}) - p^\star \le c^k (f(x^{(0)}) - p^\star) \] where \(x^{(k)}\) is the \(k\)th gradient descent iterate.

Quadratic norms as a gradient descent in a transformed coordinate system

The quadratic norm \(\|\cdot\|_P\) is defined as \(\|x\|_P = x^T P x\), where \(P\) is a positive definite matrix.

Steepest descent with the \(\|\cdot\|_P\) norm is the same as gradient descent after a change of coordinates:

For example

Re-interpretation of Newton's method