Newton's method variations

Agenda today

Variations on Newton's method
Iteratively reweighted least squares
BFGS, other Hessian approximations

Reading:

Kenneth Lange, Numerical Analysis for Statisticians, Sections 13.3-13.6

From last time

Newton's method:

Relationship with statitsical quantities

Derivative needed in Newton's method is the score
Expected value of the Hessian in Newton's method is the expected information

Potential problems:

Hessian difficult to compute

Fisher Scoring

Idea: Use the expected information, \(J(\theta)= E[-d^2 \ell(\theta)]\) instead of the observed information, \(d^2 \ell(\theta)\).

Algorithm:

Pick a starting parameter estimate \(\theta_0\)
Set \(\theta_{n+1} = \theta_n + J(\theta)^{-1} d\ell(\theta_n)\)

\(J(\theta)\) often coincides with \(-d^2 \ell(\theta)\), in which case Fisher Scoring is exactly the same as Newton's method.

Sometimes \(J(\theta)\) is easier to compute than \(-d^2 \ell(\theta)\).

Exponential families and generalized linear models

Exponential families are families of probability distributions whose densities take the form \[ f(y_i) = exp(y_i \theta_i - b(\theta_i) + c(y_i, \phi)) \]

\(b\) and \(c\) are known functions that describe the family.

\(\theta_i\) is the natural parameter.

Properties that we'll need later:

\(E(y_i) = \mu_i = b'(\theta_i)\)
\(\text{var}(y_i) = \sigma_i^2 = b''(\theta_i)\)

Why do we like them?

Many commonly-used distributions: normal, exponential, Poisson, binomial, multinomial, etc.
Easy to analyze
\(J(\theta)\) available analytically
They describe the stochasticity in generalized linear models

Generalized linear models

Models for independent observations, \(y_i, i = 1,\ldots, n\)

Three components:

Random component: \[ y_i \sim f(y_i) \] where \(f\) is an exponential family

Systematic component \[ \eta_i = x_i^T \beta \]

Link function: links the mean to the natural parameter \[ \eta_i = g(\mu_i) \] \(g\) is the link function. The relationship between the means and the predictors is then \[ \mu_i = g^{-1} (x_i^T \beta) \]

Choice of link functions

The canonical link is the one that maps the mean to the natural parameter.

Normal: Canonical link is the identity: \(g(x) = x\)
Bernoulli: Canonical link is logit: \(g(x) = \log(x / (1 - x))\)
Poisson: Canonical link is the log: \(g(x) = \log(x)\)
For generalized linear models with the canonical link function, Fisher Scoring and Newton-Raphson are equivalent.

Iteratively Reweighted Least Squares

Equivalent to Fisher scoring, gives maximum likelihood estimates for generalized linear models

Start with an estimate of the parameters \(\hat \beta\).

Find \(\hat \eta_i = x_i^T \hat \beta\), \(i = 1,\ldots, n\)
Find \(\hat \mu_i = g^{-1} (\hat \eta_i)\)

Compute the "working dependent variable" \[ z_i = \hat \eta_i + (y_i - \hat \mu_i) d\eta_i / d \mu_i \]

Compute iterative weights: \[ w_i = (b''(\theta_i) (d\eta_i / d \mu_i)^2)^{-1} \] Here \(b'''(\theta_i)\) is the second derivative of \(b(\theta_i)\) evaluated at \(\hat \beta\), and is inversely proportional to the variance of \(z_i\) given the current parameter estimates.

Obtain improved estimate of \(\beta\) by regressing the working dependent variable \(z_i\) on the predictors \(x_i\) using weights \(w_i\): \[ \hat \beta_{new} = (X^T W X)^{-1} X^T W z \]

Iterate until convergence

Example: Linear regression

Normal data, identity link, \(\eta_i = \mu_i\).

Derivative \(d \eta_i / d \mu_i = 1\)

Working dependent variable \[ z_i = \hat \mu_i + (y _i - \hat \mu_i) = y_i \]

Iterative weights: all equal to 1

Estimate is \(\hat \beta_{new} = (X^T X)^{-1} X^T y\)

Example: logistic regression

Model \(x_i^T \beta = \eta_i = \text{logit}(\pi_i)\)

Note that we have \(\eta_i = \log(\pi_i / (1 - \pi_i)) = \log(\mu_i / (n_i - \mu_i))\)

For the update formulas we need \(d \eta_i / d \mu_i = 1 / \mu_i + 1 / (n_i - \mu_i)\)

Working dependent variables: \[ z_i = \eta_i + (y_i - \mu_i) \frac{d\eta_i}{d \mu_i}\\ = \eta_i + \frac{y_i - n_i \pi_i}{n_i \pi_i(1 - \pi_i)} \]

Iterative weights: \[ w_i = (b''(\theta_i) \frac{d\eta_i}{d\mu_i}^2)^{-1} \\ = n_i \pi_i(1 - \pi_i) \]

Quasi-Newton Methods

Idea: If you don't move very far in one step, the Hessian shouldn't change that much either.

Instead of recomputing the Hessian at each step, compute an approximate update.

Start with an initial guess at a parameter \(\theta_0\).
Let \(A_0 = d^2 \ell(\theta)\).
Set \(\theta_{n+1} = \theta_n - A_n^{-1} d \ell(\theta_n)\)
Set \(A_{n+1} = A_n - c_n v_n v_n^T\)

\(A_n\) are approximations to the Hessian.

Idea behind Hessian update: Taylor series again:

\[ d\ell(\theta_n) \approx d\ell(\theta_{n+1}) + d^2 \ell(\theta_{n+1})(\theta_n - \theta_{n+1}) \]

Rearranging: \[ d\ell(\theta_n) - d\ell(\theta_{n+1})\approx d^2 \ell(\theta_{n+1})(\theta_n - \theta_{n+1}) \]

Finding an approximation \(A_{n+1}\) \(\ell(\theta_{n+1})\) that satisfies the equation above is called the secant condition.

Several different ways to make the approximation:

Symmetric rank-1 update is Davidson's method, described in Lange.
Symmetric rank-2 update is BFGS (Broyden–Fletcher–Goldfarb–Shanno).

Davidson's method is a symmetric rank-1 update.

\[ A_{n+1} = A_n - c_n v_n v_n^T \]

where \[ c_n = \frac{1}{(g_n + A_n s_n)^T s_n} \\ v_n = g_n + A_n s_n \]

(verify on your own that this satisfies the secant condition)

BFGS is a symmetric rank-2 update.

\[ A_{n_1} = A_n + \alpha u u^T + \beta v v^T \]

\(u = y_k\), \(v = A_n s_n\), \(\alpha = -1 / g_k^T s_k\), \(\beta = - 1 / s_k^T B_k s_k\)

BFGS is in R's optim.

Considerations:

The rank-1 updating method doesn't ensure that the approximate Hessian remains positive definite, while the rank-2 updating method (BFGS) does.