Agenda today
Variations on Newton's method
Iteratively reweighted least squares
BFGS, other Hessian approximations
Reading:
Newton's method:
Relationship with statitsical quantities
Derivative needed in Newton's method is the score
Expected value of the Hessian in Newton's method is the expected information
Potential problems:
Idea: Use the expected information, \(J(\theta)= E[-d^2 \ell(\theta)]\) instead of the observed information, \(d^2 \ell(\theta)\).
Algorithm:
Pick a starting parameter estimate \(\theta_0\)
Set \(\theta_{n+1} = \theta_n + J(\theta)^{-1} d\ell(\theta_n)\)
\(J(\theta)\) often coincides with \(-d^2 \ell(\theta)\), in which case Fisher Scoring is exactly the same as Newton's method.
Sometimes \(J(\theta)\) is easier to compute than \(-d^2 \ell(\theta)\).
Exponential families are families of probability distributions whose densities take the form \[ f(y_i) = exp(y_i \theta_i - b(\theta_i) + c(y_i, \phi)) \]
\(b\) and \(c\) are known functions that describe the family.
\(\theta_i\) is the natural parameter.
Properties that we'll need later:
\(E(y_i) = \mu_i = b'(\theta_i)\)
\(\text{var}(y_i) = \sigma_i^2 = b''(\theta_i)\)
Why do we like them?
Many commonly-used distributions: normal, exponential, Poisson, binomial, multinomial, etc.
Easy to analyze
\(J(\theta)\) available analytically
They describe the stochasticity in generalized linear models
Models for independent observations, \(y_i, i = 1,\ldots, n\)
Three components:
The canonical link is the one that maps the mean to the natural parameter.
Normal: Canonical link is the identity: \(g(x) = x\)
Bernoulli: Canonical link is logit: \(g(x) = \log(x / (1 - x))\)
Poisson: Canonical link is the log: \(g(x) = \log(x)\)
For generalized linear models with the canonical link function, Fisher Scoring and Newton-Raphson are equivalent.
Equivalent to Fisher scoring, gives maximum likelihood estimates for generalized linear models
Find \(\hat \eta_i = x_i^T \hat \beta\), \(i = 1,\ldots, n\)
Find \(\hat \mu_i = g^{-1} (\hat \eta_i)\)
Normal data, identity link, \(\eta_i = \mu_i\).
Model \(x_i^T \beta = \eta_i = \text{logit}(\pi_i)\)
Note that we have \(\eta_i = \log(\pi_i / (1 - \pi_i)) = \log(\mu_i / (n_i - \mu_i))\)
For the update formulas we need \(d \eta_i / d \mu_i = 1 / \mu_i + 1 / (n_i - \mu_i)\)
Working dependent variables: \[ z_i = \eta_i + (y_i - \mu_i) \frac{d\eta_i}{d \mu_i}\\ = \eta_i + \frac{y_i - n_i \pi_i}{n_i \pi_i(1 - \pi_i)} \]
Iterative weights: \[ w_i = (b''(\theta_i) \frac{d\eta_i}{d\mu_i}^2)^{-1} \\ = n_i \pi_i(1 - \pi_i) \]
Idea: If you don't move very far in one step, the Hessian shouldn't change that much either.
Instead of recomputing the Hessian at each step, compute an approximate update.
Start with an initial guess at a parameter \(\theta_0\).
Let \(A_0 = d^2 \ell(\theta)\).
Set \(\theta_{n+1} = \theta_n - A_n^{-1} d \ell(\theta_n)\)
Set \(A_{n+1} = A_n - c_n v_n v_n^T\)
\(A_n\) are approximations to the Hessian.
Idea behind Hessian update: Taylor series again:
\[ d\ell(\theta_n) \approx d\ell(\theta_{n+1}) + d^2 \ell(\theta_{n+1})(\theta_n - \theta_{n+1}) \]
Rearranging: \[ d\ell(\theta_n) - d\ell(\theta_{n+1})\approx d^2 \ell(\theta_{n+1})(\theta_n - \theta_{n+1}) \]
Finding an approximation \(A_{n+1}\) \(\ell(\theta_{n+1})\) that satisfies the equation above is called the secant condition.
Several different ways to make the approximation:
Symmetric rank-1 update is Davidson's method, described in Lange.
Symmetric rank-2 update is BFGS (Broyden–Fletcher–Goldfarb–Shanno).
Davidson's method is a symmetric rank-1 update.
\[ A_{n+1} = A_n - c_n v_n v_n^T \]
where \[ c_n = \frac{1}{(g_n + A_n s_n)^T s_n} \\ v_n = g_n + A_n s_n \]
(verify on your own that this satisfies the secant condition)
BFGS is a symmetric rank-2 update.
\[ A_{n_1} = A_n + \alpha u u^T + \beta v v^T \]
\(u = y_k\), \(v = A_n s_n\), \(\alpha = -1 / g_k^T s_k\), \(\beta = - 1 / s_k^T B_k s_k\)
BFGS is in R's optim
.
Considerations: