Newton's method variations

Agenda today

Reading:

From last time

Newton's method:

Relationship with statitsical quantities

Potential problems:

Fisher Scoring

Idea: Use the expected information, \(J(\theta)= E[-d^2 \ell(\theta)]\) instead of the observed information, \(d^2 \ell(\theta)\).

Algorithm:

\(J(\theta)\) often coincides with \(-d^2 \ell(\theta)\), in which case Fisher Scoring is exactly the same as Newton's method.

Sometimes \(J(\theta)\) is easier to compute than \(-d^2 \ell(\theta)\).

Exponential families and generalized linear models

Exponential families are families of probability distributions whose densities take the form \[ f(y_i) = exp(y_i \theta_i - b(\theta_i) + c(y_i, \phi)) \]

\(b\) and \(c\) are known functions that describe the family.

\(\theta_i\) is the natural parameter.

Properties that we'll need later:

Why do we like them?

Generalized linear models

Models for independent observations, \(y_i, i = 1,\ldots, n\)

Three components:

Iteratively Reweighted Least Squares

Equivalent to Fisher scoring, gives maximum likelihood estimates for generalized linear models

Example: Linear regression

Normal data, identity link, \(\eta_i = \mu_i\).

Example: logistic regression

Model \(x_i^T \beta = \eta_i = \text{logit}(\pi_i)\)

Note that we have \(\eta_i = \log(\pi_i / (1 - \pi_i)) = \log(\mu_i / (n_i - \mu_i))\)

For the update formulas we need \(d \eta_i / d \mu_i = 1 / \mu_i + 1 / (n_i - \mu_i)\)

Working dependent variables: \[ z_i = \eta_i + (y_i - \mu_i) \frac{d\eta_i}{d \mu_i}\\ = \eta_i + \frac{y_i - n_i \pi_i}{n_i \pi_i(1 - \pi_i)} \]

Iterative weights: \[ w_i = (b''(\theta_i) \frac{d\eta_i}{d\mu_i}^2)^{-1} \\ = n_i \pi_i(1 - \pi_i) \]

Quasi-Newton Methods

Idea: If you don't move very far in one step, the Hessian shouldn't change that much either.

Instead of recomputing the Hessian at each step, compute an approximate update.

\(A_n\) are approximations to the Hessian.

Idea behind Hessian update: Taylor series again:

\[ d\ell(\theta_n) \approx d\ell(\theta_{n+1}) + d^2 \ell(\theta_{n+1})(\theta_n - \theta_{n+1}) \]

Rearranging: \[ d\ell(\theta_n) - d\ell(\theta_{n+1})\approx d^2 \ell(\theta_{n+1})(\theta_n - \theta_{n+1}) \]

Finding an approximation \(A_{n+1}\) \(\ell(\theta_{n+1})\) that satisfies the equation above is called the secant condition.

Several different ways to make the approximation:

Davidson's method is a symmetric rank-1 update.

\[ A_{n+1} = A_n - c_n v_n v_n^T \]

where \[ c_n = \frac{1}{(g_n + A_n s_n)^T s_n} \\ v_n = g_n + A_n s_n \]

(verify on your own that this satisfies the secant condition)

BFGS is a symmetric rank-2 update.

\[ A_{n_1} = A_n + \alpha u u^T + \beta v v^T \]

\(u = y_k\), \(v = A_n s_n\), \(\alpha = -1 / g_k^T s_k\), \(\beta = - 1 / s_k^T B_k s_k\)

BFGS is in R's optim.

Considerations: