Monte Carlo methods: integration and cross validation

Agenda today

Reading: Lange 21.1, 21.2

Monte Carlo Integration

We have:

We would like to approximate \[ E[f(X)] = \int f(x) \, d\mu(x) \]

Why not numerical integration?

Monte Carlo Integration

To estimate/approximate \[ E[f(X)] = \int f(x) \, d\mu(x) \]

Why is this reasonable?

Monte Carlo Integration: Accuracy

If \(f\) is square integrable, we can apply the central limit theorem, which tells us that the estimator is approximately distributed \[ \mathcal N \left( E[f(X)], \sqrt{\frac{\text{Var}[f(X)]}{n}} \right) \]

Importance Sampling

Importance sampling is based on the following equality: \[ \int f(x) g(x)\, d \nu (x) = \int \frac{f(x) g(x)}{h(x)} h(x) \,d \nu(x) \] where

Interpretation:

When is this useful?

The importance sampling estimator has a smaller variance than the naive estimator iff: \[ \int \left[ \frac{f(x) g(x)}{h(x)} \right]^2 h(x) d\nu(x) \le \int f(x)^2 g(x) d\nu(x) \]

A contrived example

We are playing a terrible game:

What is your expected return to playing this game?

Naive Monte Carlo estimate:

set.seed(1)
f = function(x) {
    if(x < .01)
        return(-100)
    return(0)
}
mc_samples = runif(1000)
mean(sapply(mc_samples, f))
## [1] -0.7

Recall Beta distributions: Supported on the interval \([0,1]\), can tune so that they put more weight in the middle or at the edges.

Beta(1,10) has density:

x = seq(0, 1, length.out = 200)
beta_density = dbeta(x, shape1 = 1, shape2 = 10)
plot(beta_density ~ x, type = 'l')

Importance sampling from Beta(1,10):

mc_importance_samples = rbeta(1000, shape1 = 1, shape2 = 10)
importance_function = function(x) {
    return(f(x) / dbeta(x, shape1 = 1, shape2 = 10))
}
mean(sapply(mc_importance_samples, importance_function))
## [1] -1.044691

More realistic examples

Part 2: Cross validation

We have:

We would like to choose one model from the set of candidate models indexed by \(\theta\).

Example: Regression

We want to choose a subset of the predictors that do the best job of explaining the response.

Naive solution: Find the model that has the lowest value for the squared-error loss.

Why doesn't this work?

Example: Mixture models

We want to choose the number of mixture components that best explains the data.

Naive solution: Choose the number of mixture components that minimizes the negative log likelihood of the data.

Better Solution: Cross validation

Idea: Instead of measuring model quality on the same data we used to fit the model, we estimate model quality on new data.

If we knew the true distribution of the data, we could simulate new data and use a Monte Carlo estimate based on the simulations.

We can't actually get new data, and so we hold some back when we fit the model and then pretend that the held back data is new data.

Procedure:

Example

n = 100
p = 20
X = matrix(rnorm(n * p), nrow = n)
y = rnorm(n)
get_rss_submodels = function(n_predictors, y, X) {
    if(n_predictors == 0) {
        lm_submodel = lm(y ~ 0)
    } else {
        lm_submodel = lm(y ~ 0 + X[,1:n_predictors, drop = FALSE])
    }
    return(sum(residuals(lm_submodel)^2))
}
p_vec = 0:p
rss = sapply(p_vec, get_rss_submodels, y, X)
plot(rss ~ p_vec)
get_cv_error = function(n_predictors, y, X, folds) {
    cv_vec = numeric(length(unique(folds)))
    for(f in unique(folds)) {
        cv_vec[f] = rss_on_held_out(
                  n_predictors,
                  y_train = y[folds != f],
                  X_train = X[folds != f,],
                  y_test = y[folds == f],
                  X_test = X[folds == f,])
    }
    return(mean(cv_vec))
}

rss_on_held_out = function(n_predictors, y_train, X_train, y_test, X_test) {
    if(n_predictors == 0) {
        lm_submodel = lm(y_train ~ 0)
        preds_on_test = rep(0, length(y_test))
    } else {
        lm_submodel = lm(y_train ~ 0 + X_train[,1:n_predictors, drop = FALSE])
        preds_on_test = X_test[,1:n_predictors, drop= FALSE] %*% coef(lm_submodel)
    }

    return(sum((y_test - preds_on_test)^2))
}
K = 5
## normally you would do this at random
folds = rep(1:K, each = n / K)
p_vec = 0:p
cv_errors = sapply(p_vec, get_cv_error, y, X, folds)
plot(cv_errors ~ p_vec)

Choice of \(K\)

Considerations:

Summing up