Today: Approximate Bayesian Computation
Reading:Sisson, Fan, Beaumont, "Overview of Approximate Bayesian Computation"
Next two weeks:
Methods for sampling from arbitrary probability distributions.
Main application: sampling from posterior distributions
Suppose we have data \(y_1,\ldots, y_n\) that we believe can be described by a probability model with parameters \(\theta\).
We also have a prior distribution on the parameters \(\theta\), describing our belief about the values of those parameters before seeing any of the data.
\[ \begin{align*} y_i \mid \theta &\sim P(y_i \mid \theta), \quad i = 1,\ldots, n\\ \theta & \sim \pi(\theta) \end{align*} \]
For example:
Data \(y_i\) are heights of men entering the military
We think that the \(y_i\) are distributed \(\mathcal N(\theta, \sigma^2)\)
We have some prior belief about \(\theta\), maybe that it is around 70".
We quantify our prior belief about \(\theta\) as \(\theta \sim \mathcal N(70, 5)\)
Once we have seen the actual heights, we can "update" our belief about \(\theta\) by computing the posterior distribution \(P(\theta \mid y_1,\ldots, y_n)\)
Or, as in population genetics, something more complicated:
Data are gene frequencies (e.g., what fraction of the population has a gene for blue eyes, what fraction of the population has a certain variant of the alcohol dehydrogenase enzyme, and so on)
Gene frequencies come from a complicated probabilistic model with parameters having to do with ancestral population sizes, recombination rates, migration
We have prior beliefs about parameters (the ancestral population sizes, recombination rates, etc)
In principle, if we write these explicitly as probability models, we can compute posterior distributions of the parameters given the data that we actually see.
By applying Bayes' rule, we can compute the posterior distribution of the parameters given the data: \[ \begin{align*} P(\theta \mid y_1,\ldots, y_n) &= \frac{P(y_1,\ldots, y_n \mid \theta)\pi(\theta)}{P(y_1,\ldots, y_n)} \end{align*} \]
We want to know as much about this distribution as possible.
For simple cases it is available in closed form
For more complicated cases our best hope is to draw samples of it
From those samples we can compute posterior means, variances, etc. using the Monte Carlo methods from last class.
Inputs:
A target posterior: \(P(\theta \mid y_{\text{obs}}) \propto P(y_{\text{obs}} \mid \theta) \pi(\theta)\)
A way of simulating from \(P(y_{\text{obs}} \mid \theta)\)
A prior on the parameters \(\pi(\theta)\)
Sampling: for \(i = 1,\ldots, N\):
Generate \(\theta^{(i)} \sim \pi(\theta)\)
Generate \(y^{(i)} \sim p(y \mid \theta^{(i)})\)
If \(y^{(i)} = y_{\text{obs}}\), accept \(\theta^{(i)}\)
Why does this work?
Our draws \((\theta^{(i)}, y^{(i)})\) are samples from the joint distribution \(P(\theta, y)\)
Keeping only the values for which \(y^{(i)} = y_{\text{obs}}\) is the definition of conditioning on \(y_{\text{obs}}\).
Bayesian analysis for a Poisson random variable.
Model is: \[ \begin{align*} Y_i &\sim \text{Poisson}(\theta), \quad i = 1,\ldots, n \\ \theta &\sim \text{Gamma}(\alpha, \beta) \end{align*} \]
By Bayes rule, we can find in closed form that the posterior, \(P(\theta \mid Y_1, \ldots, Y_n)\) has a \(\text{Gamma}(\sum_{i=1}^n Y_i + \alpha, n + \beta)\) distribution.
Let's pretend we can't do that though, and try using ABC.
Set up the function:
generate_abc_sample = function(observed_data,
prior_distribution,
data_generating_function) {
while(TRUE) {
theta = prior_distribution()
y = data_generating_function(theta)
if(all(y == observed_data)) {
return(theta)
}
}
}
Analysis for:
Prior distribution: \(\theta \sim \text{Gamma}(1, 1)\)
Likelihood: \(y \mid \theta \sim \text{Poisson}(\theta)\)
Observed data: \(y = 3\)
prior_distribution = function() rgamma(n = 1, shape = 1, rate = 1)
data_generating_function = function(theta) rpois(n = 1, lambda = theta)
observed_data = 3
generate_abc_sample(observed_data, prior_distribution, data_generating_function)
## [1] 3.463678
posterior_samples = replicate(n = 1000, generate_abc_sample(observed_data, prior_distribution, data_generating_function))
## our posterior should be gamma(y + alpha, 1 + beta) or gamma(4, 2)
## The mean of a gamma distribution is alpha / beta, so should be 2
mean(posterior_samples)
## [1] 2.037032
## The variance of a gamma distribution is alpha / beta^2, so should be 1
var(posterior_samples)
## [1] 1.10589
qplot(posterior_samples)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
x_vec = seq(0, 6, length.out = 1000)
plot(dgamma(x_vec, shape = 4, rate = 2) ~ x_vec, type = 'l', ylab="true posterior density")
What if you have more than one sample?
We still have
Prior distribution: \(\theta \sim \text{Gamma}(1, 1)\)
Likelihood: \(y_i \mid \theta \sim \text{Poisson}(\theta)\)
Observed data: \(y_1 = 3, y_2 = 3\)
n_samples = 2
data_generating_function = function(theta) rpois(n = n_samples, lambda = theta)
observed_data = rep(3, n_samples)
generate_abc_sample(observed_data, prior_distribution, data_generating_function)
## [1] 1.445692
system.time(replicate(n = 1000, generate_abc_sample(observed_data, prior_distribution, data_generating_function)))
## user system elapsed
## 0.767 0.119 0.889
(Try changing n_samples
to something bigger on your own computer...)
Only works for discrete data.
Very low acceptance probabilities, so it can take a very long time.
Therefore:
Modify the acceptance parameters.
This makes the method give approximate samples from the posterior instead of exact samples.
Inputs:
A target posterior: \(\pi(\theta \mid y_{\text{obs}}) \propto p(y_{\text{obs}} \mid \theta) \pi(\theta)\)
A way of simulating from \(p(y_{\text{obs}} \mid \theta)\)
A prior on the parameters \(\pi(\theta)\)
A summary statistic function \(s\)
A tolerance \(\epsilon\)
Sampling: for \(i = 1,\ldots, N\):
Generate \(\theta^{(i)} \sim g(\theta)\)
Generate \(y^{(i)} \sim p(y \mid \theta^{(i)})\)
If \(\|s(y^{(i)}) - s(y_{\text{obs}})\| < \epsilon\), accept \(\theta^{(i)}\)
This method generates approximate samples from the posterior distribution
Set up a function for the approximate version of ABC:
generate_abc_sample_2 = function(observed_data,
summary_statistic,
prior_distribution,
data_generating_function,
epsilon) {
while(TRUE) {
theta = prior_distribution()
y = data_generating_function(theta)
if(abs(summary_statistic(y) - summary_statistic(observed_data)) < epsilon) {
return(theta)
}
}
}
Let's see what happens with the approximate version:
We still have
Prior distribution: \(\theta \sim \text{Gamma}(1, 1)\)
Likelihood: \(y_i \mid \theta \sim \text{Poisson}(\theta)\)
Observed data: \(y_i = 3\), \(i = 1,\ldots, 10\)
Summary statistic \(s\) is the mean function, so \(s(y_1,\ldots, y_n) = \frac{1}{n} \sum_{i=1}^n y_i\)
Our tolerance is \(\epsilon = .1\)
n_samples = 10
prior_distribution = function() rgamma(n = 1, shape = 1, rate = 1)
data_generating_function = function(theta) rpois(n = n_samples, lambda = theta)
observed_data = rep(3, n_samples)
summary_statistic = mean
epsilon = .1
generate_abc_sample_2(observed_data, summary_statistic, prior_distribution, data_generating_function, epsilon)
## [1] 1.7603
posterior_samples = replicate(n = 1000,
generate_abc_sample_2(observed_data,
summary_statistic,
prior_distribution,
data_generating_function,
epsilon))
Checking on the posterior means and variances:
(true_posterior_mean = (1 + sum(observed_data)) / (n_samples + 1))
## [1] 2.818182
mean(posterior_samples)
## [1] 2.81916
(true_posterior_variance = (1 + sum(observed_data)) / (n_samples + 1)^2)
## [1] 0.2561983
var(posterior_samples)
## [1] 0.2605666
Checking on the posterior distributions:
qplot(posterior_samples) + xlim(c(0, max(x_vec)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot(dgamma(x_vec, shape = 1 + sum(observed_data), rate = n_samples + 1) ~ x_vec, type = 'l')
Difficulties: choice of summary statistics:
Theory says they should be sufficient statistics for the model.
In practice, they are chosen by expert opinion to be features of the data that are thought to be informative about the underlying parameters.
Advantage: you need to know hardly anything about the likelihood, you just need to be able to simulate data from it.
Interpretation of Bayesian inference: parameters with higher posterior probability are simply those that make the observed data match data that we simulate under those parameters.
Next week we'll talk about more exact methods for sampling from posteriors, but they will require us to know more about the functions