Reading:
General approach to problem solving.
Identify the problem you want to solve.
Break the problem up into a couple of smaller pieces.
Assume you can work out each of the pieces, and put them together so that they solve the problem.
Solve each of the smaller pieces in the same way.
For example, if you’re designing a house, you might think:
Problem to solve: Create a layout for the house.
Assume that we have layouts for each of the rooms (sub-problems = layout for each room).
Decide how to fit the rooms together (assemble the smaller pieces).
Then think about the detailed layout of each of the rooms.
In programming:
big_function <- function(all_arguments) {
piece_1 <- get_piece_1(some_arguments)
piece_2 <- get_piece_2(other_arguments)
output <- put_pieces_together(piece_1, piece_2)
}
Then fill out the functions get_piece_1
,
get_piece_2
, put_pieces_together
.
Advantages:
Easier to read
Tends to create small functions
Functions have a definite goal, so easier to test and debug
Goes well with test-driven design: you know what the sub-functions should do, and so you can write tests for them before you fill them out.
Let \(x_1, \dots, x_n\) be a sample of size \(n\) from a population, and let \(f\) be an estimator of some population parameter \(\theta\) of interest, so that \(\hat \theta = f(x_1, \ldots, x_n)\) is our point estimate of the population parameter.
We want to know the sampling distribution of \(\hat \theta\).
If we could draw more samples from the population, we could draw from the sampling distribution of \(\hat \theta\):
Draw a new sample of size \(n\), \(x_1^*, \ldots, x_n^*\) from the population.
Compute our estimate \(\hat \theta^* = f(x_1^*, \ldots, x_n^*)\) on that sample.
We can’t do this because we can’t draw new samples from the population, but…
Idea behind the bootstrap is to pretend that the observed data \(x_1, \ldots, x_n\) is the population.
Draw \(x_1^*, \ldots, x_n^*\) by sampling from \(x_1, \ldots, x_n\) with replacement.
Compute \(\hat \theta^* = f(x_1^*, \ldots, x_n^*)\).
Repeat many times.
The distribution of the \(\hat \theta^*\)’s can be used to approximate the sampling distribution of \(\hat \theta\).
In particular, if we have \(\hat \theta^*_1, \ldots, \hat \theta^*_{B}\), where \(B\) is the number of “bootstrap replicates”, then
We’ll try to compute bootstrap confidence intervals today.
First try at a function to compute bootstrap confidence intervals.
bootstrap_mean_ci <- function(data, alpha, B) {
n <- length(data)
boot_means <- replicate(B, {
boot_indices <- sample(x = 1:n, size = n, replace = TRUE)
boot_data <- data[boot_indices]
boot_mean <- mean(boot_data)
})
q_lo <- alpha / 2
q_hi <- 1 - (alpha / 2)
boot_ci <- quantile(boot_means, probs = c(q_lo, q_hi))
return(boot_ci)
}
Top-down design way of building the function:
The confidence interval we want comes from the quantiles of the resampled estimates, so we:
Get the resampled estimates of the mean.
Compute the quantiles of the resampled estimates of the mean.
Return the quantiles.
Notice that now would be a good time to write tests for the
get_ci
and get_boot_means
functions.
In the bootstrap_mean_ci
function, we assume that
get_boot_means
returns a vector of length
B
.get_ci
returns a vector of length 2.We might also want to check that
get_ci
and get_boot_means
work on a simple
test case.get_ci
is smaller than the second
(the output is in the format we expect).test_that("get_boot_means gives correct results on small datasets", {
## if we only have one sample, all the bootstrap samples are identical
expect_equal(get_boot_means(data = 1, B = 100), rep(1, 100))
## If we only have two samples, there are three possible bootstrap samples
expect_true(length(unique(get_boot_means(data = c(1,2), B = 100))) <= 3)
expect_true(all(unique(get_boot_means(data = c(1,2), B = 100)) %in% c(1, 1.5, 2)))
})
test_that("get_ci gives correct results on simple cases", {
## the .025 and .975 quantiles of a dataset that is half 0's and half 1's should be 0 and 1
expect_equivalent(get_ci(c(rep(0, 50), rep(1,50)), .05), c(0, 1))
## CI should be of length 2
expect_equal(length(get_ci(rnorm(100), .05)), 2)
test_ci = get_ci(rnorm(50), .05)
## first element of CI should be less than the second element
expect_true(test_ci[1] < test_ci[2])
})
Then we write the two sub-functions:
Go through the informal tests:
## [1] 20
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1] 2.0 1.5 1.5 2.0 1.5 1.0 1.0 2.0 1.5 1.5 1.0 1.0 1.5 1.5 1.5 1.5 1.0 1.0 2.0
## [20] 1.5
## 2.5% 97.5%
## 0 1
Check on the tests we wrote earlier:
library(testthat)
test_that("get_boot_means gives correct results on small datasets", {
## if we only have one sample, all the bootstrap samples are identical
expect_equal(get_boot_means(data = 1, B = 100), rep(1, 100))
## If we only have two samples, there are three possible bootstrap samples
expect_true(length(unique(get_boot_means(data = c(1,2), B = 100))) <= 3)
expect_true(all(unique(get_boot_means(data = c(1,2), B = 100)) %in% c(1, 1.5, 2)))
})
## Test passed 🥇
test_that("get_ci gives correct results on simple cases", {
## the .025 and .975 quantiles of a dataset that is half 0's and half 1's should be 0 and 1
expect_equivalent(get_ci(c(rep(0, 50), rep(1,50)), .05), c(0, 1))
## CI should be of length 2
expect_equal(length(get_ci(rnorm(100), .05)), 2)
test_ci = get_ci(rnorm(50), .05)
## first element of CI should be less than the second element
expect_true(test_ci[1] < test_ci[2])
})
## Test passed 😸
Informal tests for the whole function:
## 2.5% 97.5%
## -0.5714356 0.8196715
## 2.5% 97.5%
## 1 2
bootstrap_mean_ci <- function(data, alpha, B) {
n <- length(data)
boot_means <- replicate(B, {
boot_indices <- sample(x = 1:n, size = n, replace = TRUE)
boot_data <- data[boot_indices]
boot_mean <- mean(boot_data)
})
q_lo <- alpha / 2
q_hi <- 1 - (alpha / 2)
boot_ci <- quantile(boot_means, probs = c(q_lo, q_hi))
return(boot_ci)
}
bootstrap_lm_ci <- function(data, formula, alpha, B) {
n <- nrow(data)
boot_sampled_coefs <- replicate(B, {
boot_indices <- sample(x = 1:n, size = n, replace = TRUE)
boot_data <- data[boot_indices,]
boot_lm <- lm(formula = formula, data = boot_data)
boot_coefs <- coef(boot_lm)
})
q_lo <- alpha / 2
q_hi <- 1 - (alpha / 2)
boot_ci <- plyr::aaply(boot_sampled_coefs, 1, function(x) quantile(x, probs = c(q_lo, q_hi)))
return(boot_ci)
}
Notice:
A lot of the code is very similar between the two functions.
Later on, we might want to make bootstrap confidence intervals for other estimates.
This is a good candidate for refactoring, pulling out the common tasks into single functions.
Within each function, we have two steps:
Compute the set of bootstrap statistics.
Compute confidence intervals from the bootstrap set.
bootstrap_ci <- function(data, estimator, alpha, B) {
boot_estimates <- get_boot_estimates(data, estimator, B)
boot_ci <- get_ci(boot_estimates, alpha)
return(boot_ci)
}
We’re assuming:
data
is either a vector or a data frame.
estimator
is a function that takes data
and returns an estimate of a parameter (so something like
mean
, sd
, or a function that computes
coefficients in a linear model)
alpha
is a number in (0, .5)
B
is the number of bootstrap samples we
want.
This would be a good time to write tests for the functions we’ve defined.
Tests for return types:
get_boot_estimates
should have number of rows equal to
B
or else length equal to B
get_ci
should
return a vector of length 2.get_ci
should return a matrix with number of columns equal to 2 (upper and
lower confidence bounds) and number of rows equal to the number of
things we’re estimating.Test for actual values:
get_ci
function works on a simple case
where you know the answer.get_boot_estimates
gives the correct results
when you have a small number of samples and can easily enumerate the
possible values.Next step: fill out the functions
get_ci <- function(estimates, alpha) {
q_lo <- alpha / 2
q_hi <- 1 - (alpha / 2)
if(!is.null(dim(estimates))) {
## if we have multi-dimensional estimates
cis <- plyr::aaply(estimates, 1, function(x) quantile(x, probs = c(q_lo, q_hi)))
} else {
## if we have one-dimensional estimates
cis <- quantile(estimates, probs = c(q_lo, q_hi))
}
return(cis)
}
We made another problem for ourselves! Still need
get_bootstrap_sample
.
Fill out get_bootstrap_sample
get_bootstrap_sample <- function(data) {
if(!is.null(dim(data))) {
## in this case, data is rectangular, and we want to sample rows
n <- nrow(data)
boot_idx <- sample(1:n, size = n, replace = TRUE)
bootstrap_sample <- data[boot_idx,]
} else {
## in this case, data is a vector and we want to sample elements of the vector
n <- length(data)
boot_idx <- sample(1:n, size = n, replace = TRUE)
bootstrap_sample <- data[boot_idx]
}
return(bootstrap_sample)
}
Could make an argument for breaking into two functions:
get_bootstrap_sample <- function(data) {
if(!is.null(dim(data))) {
boot_sample <- bootstrap_sample_rows(data)
} else {
boot_sample <- bootstrap_sample_elements(data)
}
return(boot_sample)
}
bootstrap_sample_rows <- function(data) {
n <- nrow(data)
boot_idx <- sample(1:n, size = n, replace = TRUE)
bootstrap_sample <- data[boot_idx,]
return(bootstrap_sample)
}
bootstrap_sample_elements <- function(data) {
n <- length(data)
boot_idx <- sample(1:n, size = n, replace = TRUE)
bootstrap_sample <- data[boot_idx]
return(bootstrap_sample)
}
Does it work?
## 2.5% 97.5%
## -0.8760293 0.1134295
## 2.5% 97.5%
## 1.672592 2.206522
The more complicated example: bootstrap confidence intervals for coefficients from a linear model:
data(iris)
iris_coef_estimator = function(d) {
iris_lm = lm(Sepal.Length ~ Sepal.Width + Petal.Length, data = d)
iris_coef = coef(iris_lm)
return(iris_coef)
}
##
## X1 2.5% 97.5%
## (Intercept) 1.8054505 2.714468
## Sepal.Width 0.4638994 0.716341
## Petal.Length 0.4377103 0.503800
## compare with
iris_lm <- lm(Sepal.Length ~ Sepal.Width + Petal.Length, data = iris)
confint(iris_lm)
## 2.5 % 97.5 %
## (Intercept) 1.7590943 2.7391860
## Sepal.Width 0.4585161 0.7325334
## Petal.Length 0.4380915 0.5057486