More optimization…

So far

Today: modifications of Newton’s method

Reading:

Fisher Scoring

Idea: Use the expected information, \(I(\theta)= E[-d^2 \ell(\theta)]\) instead of the observed information, \(d^2 \ell(\theta)\).

Algorithm:

\(I(\theta)\) often coincides with \(-d^2 \ell(\theta)\), in which case Fisher Scoring is exactly the same as Newton’s method.

Sometimes \(I(\theta)\) is easier to compute than \(-d^2 \ell(\theta)\).

Example: Non-linear least squares

Inputs:

Model: \[ y_i \sim N(\mu_i(\theta), \sigma^2) \]

Log likelihood: \[ \ell(\theta) = - \frac{1}{2} \sum_{i=1}^n (y_i - \mu_i(\theta))^2 + C \]

Gradient/score: \[ \begin{align*} d\ell(\theta) &= \sum_{i=1}^n (y_i - \mu_i(\theta)) d\mu_i(\theta)\\ d\mu_i(\theta) &= \begin{pmatrix} \frac{\theta_3 e^{-\theta_1 - \theta_2 x}}{(1 + e^{-\theta_1 - \theta_2 x})^2} \\ \frac{x\theta_3 e^{-\theta_1 - \theta_2 x}}{(1 + e^{-\theta_1 - \theta_2 x})^2} \\ \frac{1}{(1 + e^{-\theta_1 - \theta_2 x})^2} \end{pmatrix} \end{align*} \]

Hessian: \[ d^2 \ell(\theta) = -\sum_{i=1}^n d \mu_i(\theta) d \mu_i(\theta)^T + \sum_{i=1}^n (y_i - \mu_i(\theta))d^2 \mu_i(\theta) \]

Information: \[ I(\theta) = E[-d^2 \ell(\theta)] = \sum_{i=1}^n d\mu_i(\theta) d\mu_i(\theta)^T \]

Example

fisher_scoring_iterate = function(x, y, theta_current) {
    score = compute_score(x, y, theta_current)
    information = compute_information(x, theta_current)
    theta_new = theta_current + solve(information) %*% score
}
compute_score = function(x, y, theta) {
    fitted = nonlin_function(x, theta)
    grad_mu = compute_grad_mu(x, theta)
    rowSums(sweep(grad_mu, 2, STATS = y - fitted, FUN = "*"))
}
compute_information = function(x, theta) {
    ## a 3 x n matrix
    grad_mu = compute_grad_mu(x, theta)
    grad_mu %*% t(grad_mu)
}
compute_grad_mu = function(x, theta) {
    denom = 1 + exp(-theta[1] - theta[2] * x)
    g1 = theta[3] * exp(-theta[1] - theta[2] * x) / denom^2
    g2 = x * theta[3] * exp(-theta[1] - theta[2] * x) / denom^2
    g3 = 1 / denom
    return(rbind(g1, g2, g3))
}
nonlin_function = function(x, theta) {
    theta[3] / (1 + exp(-theta[1] - theta[2] * x))
}

At the starting values:

library(NISTnls)
data(Ratkowsky3)
x = Ratkowsky3$x
y = Ratkowsky3$y
theta = c(-5, 1, 700)
xgrid = seq(0, 15, length.out = 1000)
fitted = nonlin_function(xgrid, theta)
plot(fitted ~ xgrid, type = 'l')
points(y ~ x)

After one iteration:

(theta = fisher_scoring_iterate(x, y, theta))
##           [,1]
## g1  -3.3298463
## g2   0.4649027
## g3 677.8340519
fitted = nonlin_function(xgrid, theta)
plot(fitted ~ xgrid, type = 'l')
points(y ~ x)

After two iterations:

(theta = fisher_scoring_iterate(x, y, theta))
##           [,1]
## g1  -4.2780124
## g2   0.6775608
## g3 664.2494602
fitted = nonlin_function(xgrid, theta)
plot(fitted ~ xgrid, type = 'l')
points(y ~ x)

After several more iterations

for(i in 1:5) {
    theta = fisher_scoring_iterate(x, y, theta)
    print(theta)
}
##          [,1]
## g1  -4.438590
## g2   0.687286
## g3 702.939738
##           [,1]
## g1  -4.4435690
## g2   0.6887401
## g3 702.8457366
##           [,1]
## g1  -4.4424684
## g2   0.6885486
## g3 702.8741477
##           [,1]
## g1  -4.4425736
## g2   0.6885677
## g3 702.8711538
##           [,1]
## g1  -4.4425628
## g2   0.6885657
## g3 702.8714589
fitted = nonlin_function(xgrid, theta)
plot(fitted ~ xgrid, type = 'l')
points(y ~ x)

Compare with

nls(y ~ b3 / ((1+exp(-b1-b2*x))), data = Ratkowsky3,
    start = c(b1 = -5, b2 = 0.75, b3 = 700),
    trace = TRUE)
## 12935.59 :   -5.00   0.75 700.00
## 8971.367 :   -4.3638666   0.6765795 703.9695386
## 8930.131 :   -4.4470558   0.6894499 702.6779778
## 8929.885 :   -4.4420175   0.6884684 702.8858827
## 8929.883 :   -4.4426184   0.6885758 702.8698817
## 8929.883 :   -4.4425582   0.6885649 702.8715881
## Nonlinear regression model
##   model: y ~ b3/((1 + exp(-b1 - b2 * x)))
##    data: Ratkowsky3
##       b1       b2       b3 
##  -4.4426   0.6886 702.8716 
##  residual sum-of-squares: 8930
## 
## Number of iterations to convergence: 5 
## Achieved convergence tolerance: 5.737e-06

Quasi-Newton Methods

Idea: If you don’t move very far in one step, the Hessian shouldn’t change that much either.

Instead of recomputing the Hessian at each step, compute an approximate update.

\(A^{(n)}\) are approximations to the Hessian.

Idea behind Hessian update: Taylor series again:

\[ d\ell(\theta^{(n)}) \approx d\ell(\theta^{(n+1)}) + d^2 \ell(\theta^{(n+1)})(\theta^{(n)} - \theta^{(n+1)}) \]

Rearranging: \[ d\ell(\theta^{(n)}) - d\ell(\theta^{(n+1)})\approx d^2 \ell(\theta^{(n+1)})(\theta^{(n)} - \theta^{(n+1)}) \]

Finding an approximation \(A^{(n+1)}\) of \(-d^2\ell(\theta^{(n+1)})\) that satisfies the equation above is called the secant condition.

Several different ways to make the approximation:

For notation, let \[ \begin{align*} g^{(n)} &= d\ell(\theta^{(n)}) - d \ell(\theta^{(n+1)}) \\ s^{(n)} &= \theta^{(n)} - \theta^{(n+1)} \end{align*} \]

We can rewrite the secant condition \[ d\ell(\theta^{(n)}) - d\ell(\theta^{(n+1)})\approx d^2 \ell(\theta^{(n+1)})(\theta^{(n)} - \theta^{(n+1)}) \] as \[ -A^{(n+1)} s^{(n)} = g^{(n)} \]

Davidson’s method is a symmetric rank-1 update.

\[ A^{(n+1)} = A^{(n)} - c^{(n)} v^{(n)} (v^{(n)})^T \]

where \[ c^{(n)} = \frac{1}{(g^{(n)} + A^{(n)} s^{(n)})^T s^{(n)}} \\ v^{(n)} = g^{(n)} + A^{(n)} s^{(n)} \]

(verify on your own that this satisfies the secant condition)

BFGS is a symmetric rank-2 update.

\[ A^{(n+1)} = A^{(n)} + \alpha u u^T + \beta v v^T \]

\(u = y^{(k)}\), \(v = A^{(n)} s^{(n)}\), \(\alpha = -1 / (g^{(k)})^T s^{(k)}\), \(\beta = - 1 / (s^{(k)})^T B^{(k)} s^{(k)}\)

Why are these useful?

\[ (A + UCV)^{-1}= A^{-1} - A^{-1} U(C^{-1} + VA^{-1} U)^{-1} V A^{-1} \]

Gradient descent

Our problem:

\[ \text{minimize}_x \quad f(x) \]

Note that we’re doing minimization instead of maximization now so that the notation matches the reading, but any minimization problem can be recast as a maximization and vice versa.

Descent Methods

General algorithm:

Start with a point \(x\)

Repeat

Until the stopping criterion is satisfied, usually \(\|\nabla f(x)\|_2 \le \epsilon\).

Gradient descent

In gradient descent, we take \(\Delta x = - \nabla f(x)\).

Overall algorithm:

Start with a point \(x\)

Repeat

Until the stopping criterion is satisfied, usually \(\|\nabla f(x)\|_2 \le \epsilon\).

Gradient descent example

Iterates of gradient descent with backtracking line search, for minimizing \(f(x_1, x_2) = \exp(x_1 + 3 x_2 - .1) + \exp(x_1 - 3 x_2 - .1) + \exp(-x_1 - .1)\)

Contours represent the boundaries of the sublevel sets of the function: \(\{x : f(x) \le a\}\).

Steepest descent

Steepest descent: modification of the descent direction.

The normalized steepest descent direction is defined as \[ \Delta x_{nsd} = \text{argmin}_x \{\nabla f(x)^T v : \|v\| = 1\} \] for some norm \(\|\cdot \|\).

Note: Steepest descent with the standard norm (\(\|\cdot\|_2\)) is the same as gradient descent.

Steepest descent algorithm

The same as gradient descent, but with a different descent direction:

Start with a point \(x\)

Repeat

Until the stopping criterion is satisfied, usually \(\|\nabla f(x)\|_2 \le \epsilon\).

Normalized steepest descent direction for a quadratic norm

Examples of the effect of the norm

When can we expect these methods to do well?

From the pictures, we saw that

Quadratic norms as a gradient descent in a transformed coordinate system

The quadratic norm \(\|\cdot\|_P\) is defined as \(\|x\|_P = x^T P x\), where \(P\) is a positive definite matrix.

Steepest descent with the \(\|\cdot\|_P\) norm is the same as gradient descent after a change of coordinates:

For example

Convergence theorem

Suppose \(f\) is strongly convex, so that there are constants \(m\) and \(M\) such that \(mI \preceq d^2 f \preceq MI\).

Let \(x^{(k)}\) be the solution after \(k\) steps of gradient descent with exact line search, and let \(p^\star\) be the minimizing value of \(f\).

Then \[ f(x^{(k)}) - p^\star \le (1 - m/M)^k (f(x^{(0)}) - p^\star) \]

(Equation 9.18 in Boyd and Vandenberghe)

Re-interpretation of Newton’s method

Summing up