Lecture 16: Fitting probability models

Today: Transitioning away from R/software engineering, towards algorithms for statistics

Agenda

Very short review of probability

We need to know about random variables and their distributions.

What is a random variable?

In R, you can draw a random variable from a distribution with functions of the form rdist.

… and so on

Syntax is rdist(n, param1, param2,..., paramn)

Examples

rnorm(n = 10, mean = 5, sd = .1)
##  [1] 4.841065 4.970267 5.123135 5.070469 4.904124 5.097460 4.928799 5.009924
##  [9] 4.954670 5.130668
rbinom(n = 5, size = 5, prob = .8)
## [1] 5 4 5 3 4
rpois(n = 20, lambda = 10)
##  [1] 10 10 11  7  3 13 13  8 12  9  4  7 12  9 10 11 10 15 13  9

A random variable is characterized by its cumulative distribution function (CDF).

In R, the cumulative distribution functions for common distributions are available as pdist, so

Syntax is pdist(q, param1, ..., paramn)

For example:

pnorm(q = 0, mean = 0, sd = 1)
## [1] 0.5
pbinom(q = .1, size = 1, prob = .2)
## [1] 0.8
pbinom(q = 1, size = 1, prob = .2)
## [1] 1

Normal CDF:

qvec = seq(-3, 3, length.out = 1000)
Fx = sapply(qvec, pnorm, mean = 0, sd = 1)
plot(Fx ~ qvec, type = 'l')

Binomial CDF:

qvec = seq(0, 5, length.out = 1001)
Fx = sapply(qvec, pbinom, size = 5, p = .5)
plot(Fx ~ qvec, type = 'l')

Remember that rdist draws random variables from dist, and pdist(q) computes the probability that a random variable with distribution dist takes a value less than or equal to q?

Let’s check:

## draw 100 random variables from a normal with mean 0 and sd 1
x = rnorm(n = 1000, mean = 0, sd = 1)
## compute what fraction of the random variables are at most -.5
q = -.5
mean(x <= q)
## [1] 0.32
## compute what fraction of the time the random variables should be less than or equal to -.5
pnorm(q = q, mean = 0, sd = 1)
## [1] 0.3085375

Not exactly the same, but pretty close!

## try again with a binomial distribution
x = rbinom(n = 1000, size = 5, prob = .2)
## compute what fraction of the random variables are 1 or less
q = 1
mean(x <= q)
## [1] 0.749
## compute what fraction of the time the random variables should be 1 or less
pbinom(q = q, size = 5, prob = .2)
## [1] 0.73728

Again, pretty close! You can check for other values of q and other distributions.

Final concept: probability mass functions and probability density functions

Discrete random variables:

Definition of probability mass function: If \(f_X\) is the probability mass function for a random variable \(X\), \(f_X(x) = P(X = x)\).

In R: probability mass functions for common distributions are given by functions of the form ddist.

Syntax: ddist(x, param1, ..., paramn) computes \(f_X(x)\) for the a random variable \(X\) following distribution dist with parameters param1, …, paramn

For example:

dbinom(x = 1, size = 1, prob = .5)
## [1] 0.5
dbinom(x = 2, size = 2, prob = .5)
## [1] 0.25
dbinom(x = .5, size = 2, prob = .5)
## Warning in dbinom(x = 0.5, size = 2, prob = 0.5): non-integer x = 0.500000
## [1] 0
x_vec = 0:2
fx = sapply(x_vec, dbinom, size = 2, prob = .5)
plot(fx ~ x_vec, type = 'h', ylim = c(0, .5))

x_vec = 0:5
fx = sapply(x_vec, dbinom, size = 5, prob = .8)
plot(fx ~ x_vec, type = 'h', ylim = c(0, .5))

As before, we can check that our definitions are consistent:

## generate random variables from a binomial distribution with size = 2 and prob = .5
X = rbinom(n = 1000, size = 2, prob = .5)
head(X)
## [1] 2 1 1 1 0 1
## compute the fraction of the random variables that took value exactly equal to 2
x = 2
mean(X == x)
## [1] 0.272
## compute the pmf for the distribution at x = 2
dbinom(x = x, size = 2, prob = .5)
## [1] 0.25

Apologies for the notation, but the norm is to denote random variables by capital letters and to denote the actual values they take by lower-case letters.

You’ll often see things like \(P(X = x)\), which means the probability that a random variable \(X\) takes value \(x\).

Instead of looking at just one value of \(x\), we can look at the whole distribution:

X = rbinom(n = 1000, size = 5, prob = .5)
x_vec = 0:5
expected_probability = sapply(x_vec, dbinom, size = 5, prob = .5)
empirical_probability = sapply(x_vec, function(x) mean(X == x))
plot(expected_probability ~ x_vec, type = 'h',
     ## we need this because otherwise some of the points might be outside of the plotting region
     ylim = c(0, max(c(expected_probability, empirical_probability))))
points(empirical_probability ~ x_vec, col = 'red')

Continuous random variables:

Probability density function formally: If \(X\) is a continuous random variable with cumulative distribution function \(F_X\), the probability density function of \(X\), \(f_X(x)\), is defined as \(f_X(x) = F_X'(x)\).

Think of as analogous to probability mass functions

In R: probability density functions for common distributions are given by functions of the form ddist (the same as for probability mass functions)

Syntax: ddist(x, param1, ..., paramn) computes \(f_X(x)\) for the a random variable \(X\) following distribution dist with parameters param1, …, paramn

For example:

dnorm(x = 0, mean = 0, sd = 1)
## [1] 0.3989423
dnorm(x = 1, mean = 0, sd = 1)
## [1] 0.2419707
dnorm(x = 50, mean = 0, sd = 1)
## [1] 0
x_vec = seq(-3, 3, length.out = 1000)
fx = sapply(x_vec, dnorm, mean = 0, sd = 1)
plot(fx ~ x_vec, type = 'l')

Summing up: probability

Fitting probability models to data

Setup: We have a set of data points \(x_1, \ldots, x_n\), and we want to find a probability distribution that describes the data well.

Why do we want to do this?

How do we fit probability models?

Two main strategies:

Many variations on these themes

Maximum likelihood

Problem: We have a family of probability distributions, indexed by a parameter \(\theta\), and we need to choose one to describe the data.

Solution, heuristically:

Formally:

\[ L(\theta)=\prod_{i=1}^n f(x_i;\theta) \]

Recall that the probability density/mass function describes how likely a random variable is to take a given value.

Therefore: find the value of \(\theta\) that maximizes the likelihood.

In practice, we work with the log likelihood instead of the likelihood:

\[ \ell(\theta) = \sum_{i=1}^n \log f(x_i; \theta) \]

For example: we have data points \(x_1, \ldots, x_n\), and we want to find the \(N(\theta, 1)\) distribution that fits the data the best.

The likelihood is \[ L(x; \theta) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}}\exp(-\frac{1}{2}(x_i - \theta)^2) \]

and the log likelihood is \[ \ell(x; \theta) = \sum_{i=1}^n\log \left( \frac{1}{\sqrt{2\pi}}\exp(-\frac{1}{2}(x_i - \theta)^2) \right) \]

We can use dnorm in R to compute the log likelihood for any \(x\) and \(\theta\) we want:

## create a function that computes the log likelihood
likelihood = function(theta, x) {
    sum(log(dnorm(x, mean = theta, sd = 1)))
}
x = c(5.5, 4, 3.2, 4.7, 4.3, 3.5)
theta_vec = seq(0, 10, length.out = 100)
l_of_theta = sapply(theta_vec, likelihood, x)
plot(l_of_theta ~ theta_vec, type = 'l')

What is the maximum?

plot(l_of_theta ~ theta_vec, type = 'l')
abline(v = mean(x))

Alternately, just search over the grid:

max_idx = which.max(l_of_theta)
theta_vec[max_idx]
## [1] 4.242424
## compare with
mean(x)
## [1] 4.2

Another example: Binomial, five trials, unknown success probability.

likelihood = function(theta, x) {
    sum(log(dbinom(x = x, size = 5, prob = theta)))
}

Compute the likelihoods for many possible values of prob

x = rbinom(n = 100, size = 5, prob = .2)
theta_vec = seq(0, 1, length.out = 100)
log_likelihoods = sapply(theta_vec, likelihood, x)
plot(log_likelihoods ~ theta_vec, type = 'l')
abline(v = .2)

We see that the maximum is pretty close to the true value, \(.2\)

max_idx = which.max(log_likelihoods)
theta_vec[max_idx]
## [1] 0.1919192

In the binomial and normal examples, we used grid search.

Two reasons why those examples are silly:

One reason why getting the numbers is non-trivial:

Distribution fitting in R

For the simple case where all the data come from the same distribution, can use fitdistr

Syntax: fitdistr(x, densfun, start) with

Example: negative binomial

Negative binomial distribution:

Probability mass function of \(X\): \[ \begin{align*} P(X = x) &= \begin{pmatrix}x+r-1 \\x \end{pmatrix}(1-p)^r p^x \\ &= \frac{(x+r-1)!}{(r-1)!x!}(1-p)^r p^x \end{align*} \]

Suppose we have \(x_1,\ldots, x_n\), and we want to fit a negative binomial distribution, i.e., find the values of \(r\) and \(p\) that make \(x_1,\ldots, x_n\) the most likely.

The log likelihood of \(x_1,\ldots, x_n\) is \[ \sum_{i=1}^n \log((x_i + r - 1)!) - \log((r-1)!) - \log(x_i!) + r \log(1-p) + x \log p \]

If you go through the calculus, you can find a closed-form solution for the maximizing value of \(p\): \(\hat p = \frac{\sum_{i=1}^n x_i}{n r + \sum_{i=1}^n x_i}\).

There is no closed-form solution for \(r\).

Note: there are multiple parameterizations of the negative binomial, and R uses a slightly different one than what we defined above.

library(MASS)
x = rnbinom(n = 1000, size = 4, mu = 5)
nb_fit = fitdistr(x, densfun = dnbinom, start = list(size = 1, mu = 1))
nb_fit
##      size         mu    
##   4.2539593   4.9914592 
##  (0.3662529) (0.1041624)

We can check informally that these values of mu and size are maximizers:

fitted_log_lik = dnbinom(x, size = nb_fit$estimate["size"], mu = nb_fit$estimate["mu"], log = TRUE)
sum(fitted_log_lik)
## [1] -2511.054
## the log likelihood at a couple other values of mu and size
sum(dnbinom(x, size = 2, mu = 3, log = TRUE))
## [1] -2754.772
sum(dnbinom(x, size = 10, mu = 4, log = TRUE))
## [1] -2626.488

More generally

The examples so far have all been for the simple case where we assume that all the data come from the same distribution, but:

In general, we can let different data points come from different distributions, but the setup is the same:

Concretely: Linear models

We have data \(y_i, \ldots, y_n\) and \(x_1,\ldots, x_n\).

We assume that the \(y_i\)’s are realizations of random variables: \[ \begin{align*} y_i \sim N(\beta_0 + \beta_1 x_i, \sigma^2) \end{align*} \]

In this case:

Summing up