Stat 470/670 Lecture 8: Flexible modeling for bivariate data

Julia Fukuyama

Today

Regression review

Task 1: How do we use multiple regression to fit non-linear functions of predictors?

Task 2: How can we use weights in regression to do local fits?

Transformations of the predictor variables: Polynomial regression

One example of multiple regression to fit a non-linear mean, where we use as predictors carat plus the second and third powers of carat.

This gives us the best-fitting third-degree polynomial.

library(tidyverse)
library(broom)
diamonds = diamonds %>% mutate(log_price = log10(price))
lm.poly = lm(
    log_price ~ carat + I(carat^2) + I(carat^3),
    data = diamonds, subset = clarity == "VS1")
ggplot(augment(lm.poly)) +
    geom_point(aes(x = carat, y = log_price), size = .5, alpha = .2) +
    geom_line(aes(x = carat, y = .fitted), color = "blue")

Transformations of the predictor variables: Non-smooth functions of the predictors

Here we augment our initial predictors with several step functions of the predictors:

\[ f_a(x) = \begin{cases} 1 & x \ge a\\ 0 & x < a \end{cases} \]

step = function(x, step_position) {
    return(ifelse(x >= step_position, 0, 1))
}
lm.steps = lm(
    log_price ~
        carat + I(carat^2) + I(carat^3) +
        step(carat, .3) + step(carat, .5) + step(carat, 1) +
        step(carat, 1.5) + step(carat, 2),
    data = diamonds, subset = clarity == "VS1")
ggplot(augment(lm.steps)) +
    geom_point(aes(x = carat, y = log_price), size = .5, alpha = .2) +
    geom_line(aes(x = carat, y = .fitted), color = "red")

Transformations of the predictor variables: Dummy variables for factors

The dummy or indicator function, \(\mathbf I_a\) can be used to transform factor variables into numeric predictors:

\[ \mathbf I_a(x) = \begin{cases} 1 & x = a \\ 0 & x \ne a \end{cases} \]

This is what R is doing behind the scenes when you use a factor variable in a linear model.

We can inspect the transformation it uses with model.matrix:

head(model.matrix(~ 0 + color, data = diamonds))
##   colorD colorE colorF colorG colorH colorI colorJ
## 1      0      1      0      0      0      0      0
## 2      0      1      0      0      0      0      0
## 3      0      1      0      0      0      0      0
## 4      0      0      0      0      0      1      0
## 5      0      0      0      0      0      0      1
## 6      0      0      0      0      0      0      1
head(diamonds$color)
## [1] E E E I J J
## Levels: D < E < F < G < H < I < J
lm(price ~ 0 + color, data = diamonds)
## 
## Call:
## lm(formula = price ~ 0 + color, data = diamonds)
## 
## Coefficients:
## colorD  colorE  colorF  colorG  colorH  colorI  colorJ  
##   3170    3077    3725    3999    4487    5092    5324

Generalizations of the regression problem: Weighted least squares

If we have more faith in some points than others, or if we simply want to exclude some points, we can perform weighted linear regression.

Let \(w_i\), \(i = 1,\ldots, n\) be non-negative values.

In weighted regression, we find \(\beta_0, \ldots, \beta_p\) to minimize

\[ \sum_{i=1}^n w_i (y_i - (\beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}))^2 \]

Or, in matrix notation: \[ \|\mathbf W^{1/2} (\mathbf y - \mathbf X \mathbf \beta)\|_2^2 \] where \(\mathbf W^{1/2}\) is an \(n \times n\) diagonal matrix with \(w_i^{1/2}\) as the \(i\)th diagonal element.

Properties:

References

If you feel like you need to brush up on this, a good reference is Weisberg, Applied Linear Regression.

Smoothing

Reading: Cleveland pp. 91-110

Why do we want to smooth?

LOESS

LOESS, or local regression, builds on standard regression. The setup is:

The idea is that since the mean function is smooth, it can be approximated with a linear or low-order polynomial function in small regions.

LOESS: details

The way we transform this intuition into a concrete procedure is to use weighted least squares.

LOESS has two parameters, \(\alpha\) (the span), and \(\lambda\), the degree of the local polynomial.

To find the value of the LOESS smoother at a point \(x_0\), we first define weights for all of the samples: \[ w_i(x_0) = T(\Delta_i(x_0) / \Delta_{(q)}(x_0)) \] where \(\Delta_i(x_0) = |x_i - x_0|\), \(\Delta_{(i)}(x_0)\) are the ordered values of \(\Delta_{i}(x_0)\), and \(q = \alpha n\), rounded to the nearest integer.

\(T\) is the tricube weight function (invented by Tukey!): \[ T(u) = \begin{cases} (1 - |u|^3)^3 & |u| \le 1 \\ 0 & |u| > 1 \end{cases} \]

These weights are then used in a local regression.

If \(\lambda = 1\), we find \(\hat \beta_0\), \(\hat \beta_1\) to minimize the weighted least squares criterion, \[ \sum_{i=1}^n w_i (y_i - (\beta_0 + \beta_1 x_i))^2, \]

and the fitted value for the LOESS smoother at \(x_0\) is \(\hat \beta_0 + \hat \beta_1 x_0\).

If \(\lambda = 2\), we use quadratic regression, e.g. find \(\hat \beta_0\), \(\hat \beta_1\), \(\hat \beta_2\) to minimize the weighted least squares criterion, \[ \sum_{i=1}^n w_i (y_i - (\beta_0 + \beta_1 x_i + \beta_2 x_i^2))^2, \]

and the fitted value for the LOESS smoother at \(x_0\) is \(\hat \beta_0 + \hat \beta_1 x_0 + \hat \beta_2 x_0^2\).

The analogous procedure works for any integer value of \(\lambda\).

The procedure described above gives a fitted value of the smoother at one point; and we need to do all the weight and coefficient computations for every point at which we want to evaluate the smoother.

LOESS in R

We’ll look at the economics dataset (in ggplot2).

It looks like this:

economics
## # A tibble: 574 × 7
##    date         pce    pop psavert uempmed unemploy date_numeric
##    <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>        <dbl>
##  1 1967-07-01  507. 198712    12.6     4.5     2944         -915
##  2 1967-08-01  510. 198911    12.6     4.7     2945         -884
##  3 1967-09-01  516. 199113    11.9     4.6     2958         -853
##  4 1967-10-01  512. 199311    12.9     4.9     3143         -823
##  5 1967-11-01  517. 199498    12.8     4.7     3066         -792
##  6 1967-12-01  525. 199657    11.8     4.8     3018         -762
##  7 1968-01-01  531. 199808    11.7     5.1     2878         -731
##  8 1968-02-01  534. 199920    12.3     4.5     3001         -700
##  9 1968-03-01  544. 200056    11.7     4.1     2877         -671
## 10 1968-04-01  544  200208    12.3     4.6     2709         -640
## # … with 564 more rows

Let’s look at how psavert changes over time using a scatterplot:

ggplot(economics) + geom_point(aes(x = date, y = psavert))

We’ll try smoothing here, but first note two tricky things about this particular example:

economics = economics %>% mutate(date_numeric = as.numeric(date))
l.out = loess(psavert ~ date_numeric, data = economics)
ggplot(augment(l.out, data = economics), aes(x = date, y = .fitted)) +
    geom_point(aes(y = psavert)) + geom_line(color = "red")