Stat 470/670 Lecture 26: Linear Discriminant Analysis

Julia Fukuyama

Today: Linear discriminant analysis

Problem Setup

Potential questions: Is there a relationship between the response and the predictors? If so, what is that relationship? How good of a job can the predictor variables do in explaining the response? Which predictors are most important?

LDA: Probabilistic Interpretation

Reading: Section 4.3 in ESL

LDA is based on a simple probabilistic model, where we assume that within each group, the predictor variables come from a multivariate normal distribution with a group-specific mean and a covariance matrix that is the same for each group.

In math: Let \(\mathbf x_i \in \mathbb R^p\) be the values for the predictor variables for case \(i\), and let \(y_i \in \{1,2,\ldots, K\}\) describe the group membership for case \(i\). Then our model is \[ \mathbf x_i \sim \mathcal N(\mathbf \mu_{y_i}, \mathbf \Sigma) \]

In pictures:

If we knew \(\mathbf \mu_{y_i}\) and \(\mathbf \Sigma\), we could use Bayes’ rule to obtain \(\mathbf P(y_i \mid \mathbf x_i)\).

However, we don’t know either, and so we estimate them from the data and use the estimates to predict which group each case belongs to.

This procedure turns out to give linear decision boundaries, hence the name linear discriminant analysis.

LDA: Projection Interpretation

Reading: ESL 4.3.3

The probabilistic interpretation of LDA is nice, but what makes it particularly useful is that it also gives us an informative low-dimensional projection of the data.

In PCA, we projected the data so as to maximize the variance of the projection, but we noticed that this projection won’t necessarily be informative about the groups we are interested in.

In LDA, if we have \(K\) groups, we can project the data into \(K-1\) dimensions, the space spanned by the estimated group means, and retain all the information necessary for classification using the model we defined above.

Fisher’s linear discriminant

Fisher posed a different problem and ended up with the same solution. He was interested in projecting the predictors in such a way as to maximize the ratio of the within-class to between-class variance.

If you want more details you can read about it in Section 4.3.3 of ESL, but the main point to take away here is that in this derivation of LDA, there is no reference to a probabilistic model or to Gaussian distributions. Therefore, we don’t need to assume anything about our predictors for LDA to be a reasonable tool for predicting group membership.

LDA biplots

Reading: Greenacre Chapter 11

Since LDA gives a projection of the data onto a lower-dimensional space, LDA biplots work the same way as PCA biplots:

LDA: Practical considerations

LDA will fail/not work well if