Stat 470/670 Lecture 2: Univariate Data Visualization

Julia Fukuyama

The basics/Our goals

Goals:

Even if you have multivariate data, you should start out looking at the variables one by one, as if they were univariate.

Describing a single univariate dataset

Empirical CDF definition:

Let \(x_1, \ldots, x_n\) be our dataset.

The empirical cumulative distribution function (ecdf) is defined as \[ \text{ecdf}(x) = \frac{\text{\# of elements in the dataset with value } \le x}{\text{\# of elements in the dataset}} \]

Properties:

Let’s try this out in R

## lattice has the singer data that we're going to use
library(lattice)
library(ggplot2)
library(dplyr)
library(magrittr)
library(stringr)

Let’s try out ggplot

## nothing gets plotted! why not?
ggplot(singer, aes(x = height))

## we need to tell it not just that we want to plot height, but how to plot it. here we're saying to plot height as an ecdf
ggplot(singer, aes(x = height)) + stat_ecdf()

## another way of doing the same thing, ggplots can come in pieces
singer.gg = ggplot(singer, aes(x = height))
singer.gg + stat_ecdf()

## and if we want to label the axes
singer.gg + stat_ecdf() +
    xlab("This is the x-axis") +
    ylab("This is the y-axis") + 
    ggtitle("Here's a title")

Quantile function: Definition

The \(f\) quantile, \(q(f)\), of a set of data is a value with the property that approximately a fraction \(f\) of the data are less than or equal to \(q(f)\).

Note: This definition doesn’t completely specify the quantile function.

For concreteness, we use Cleveland’s definition of the quantile function:

Now we have a partial specification of a quantile function.

Create the remainder by linear interpolation of the points we do have.

There isn’t a nice R function for making quantile plots using Cleveland’s definition, but we can still work it out by hand.

## quantile plots by hand
Tenor1 = singer %>%
    subset(voice.part == "Tenor 1") %>%
    arrange(height)
## exactly the same as
Tenor1 = arrange(
    subset(singer, voice.part == "Tenor 1"), height)
## close to the same as
sort(singer$height[singer$voice.part == "Tenor 1"])
##  [1] 64 64 65 66 66 66 67 67 68 68 68 69 70 70 71 71 72 72 73 74 76
nTenor1 = nrow(Tenor1)
f.value = (0.5:(nTenor1 - 0.5))/nTenor1
Tenor1$f.value = f.value
ggplot(Tenor1, aes(x = f.value, y = height)) +
    geom_line() +
    geom_point()

Exercise: Do this for all the voice parts (either by hand, one at a time, or preferably programmatically), and plot all of the quantile plots faceted out by voice part or plotted all on the same color scale

Other interpretations of the ECDF

Let \(X\) be a random variable obtained by drawing uniformly at random from the dataset \(x_1, \ldots, x_n\).

\[ \text{ecdf}(x) = P(X \le x) \]

Why is the ECDF a good representation of the data?

Remember from your other statistics courses the definition of a cumulative distribution function:

Let \(X\) be a random variable taking values in \(\mathbb R\), the cumulative distribution function \(F_X\) is defined as \[ F_X(x) = P(X \le x) \]

The empirical CDF is the analogous quantity for our data, and is the nonparametric maximum likelihood estimate of the population CDF.

Histograms and Density Estimates

Histogram: Definition

Let \(x_{(1)}, \ldots, x_{(n)}\) be the ordered data.

Histogram/Density estimate demonstration

ggplot(singer, aes(x = height)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(singer, aes(x = height)) + geom_density()

ggplot(singer, aes(x = height)) +
    geom_density(adjust = .5) +
    geom_rug(aes(y = 0), sides = "b",
             position = position_jitter(height = 0))

Exercise: play around with the jittering on the rug and the adjust parameter in the density. What do you like best? Again, try faceting out the histograms or plotting them over one another. Do different versions bring out different features of the data? What do you notice in the different plots?

set.seed(0)
df = data.frame(x = c(rnorm(25, 0, 1), rnorm(25, 5, 1)))
ggplot(df) + geom_histogram(aes(x = x))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(df) + geom_histogram(aes(x = x), bins = 50)

ggplot(df) + stat_ecdf(aes(x = x))

Drawbacks of histograms:

Q-Q Plots

Goal: Compare two univariate samples to each other

Quantile-Quantile (q-q) plot definition

Suppose we have two sets of univariate measurements, \(x_{(1)}, \ldots, x_{(n)}\) and \(y_{(1)}, \ldots, y_{(m)}\), with \(m \le n\).

For each \(i = 1,\ldots, m\), plot the \((i - .5) / m\) quantile of the \(y\) dataset against the \((i - .5) / m\) quantile of the \(x\) dataset.

Note:

If \(m = n\), then

Q-Q plot demonstration

There isn’t a nice way to do q-q plots in ggplot, but there is a function in base R that will return a data frame with the values we need to plot.

Tenor1 = singer %>% subset(voice.part == "Tenor 1")
Bass2 = singer %>% subset(voice.part == "Bass 2")
qq.df = as.data.frame(qqplot(Tenor1$height, Bass2$height,
    plot.it = FALSE))
ggplot(qq.df, aes(x = x, y = y)) +
    geom_point() +
    geom_abline(slope = 1, intercept = 0) +
    xlab("Tenor 1") + ylab("Bass 2")

Alternative: compare two histograms or density estimates

Tenor1_or_Bass2 = singer %>% subset(voice.part %in% c("Tenor 1", "Bass 2"))
ggplot(Tenor1_or_Bass2) +
    geom_histogram(aes(x = height, fill = voice.part),
                   alpha = .5, position = "dodge")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(Tenor1_or_Bass2) +
    geom_density(aes(x = height, fill = voice.part), alpha = .5)

Q-Normal plots

Goal: Check how well a normal distribution approximates the data

Definition: Theoretical quantiles

Let \[ q_{\mu, \sigma}(f) = \{ x : P(\mathcal N(\mu, \sigma^2) \le x = f) \} \]

In words: the value \(x\) such that the probability that a \(\mathcal N(\mu, \sigma^2)\) random variable takes value at most \(x\) is equal to \(f\).

Note the analogy to data quantiles, \(q_x(f)\) defined before.

Definition: Q-Normal plot

Let \(x_{(1)}, \ldots, x_{(n)}\) be our ordered data.

Recall that we defined the sample quantile function at values \(f_i = (i - .5) / n\) as \(q_x(f_i) = x_{(i)}\).

For each value \(f_i\), \(i = 1,\ldots, n\), compute

A Q-normal plot shows sample quantiles on the \(y\)-axis and theoretical quantiles on the \(x\)-axis.

Properties:

Q-Normal plot demonstration

Let’s first see what a Q-normal plot looks like when the data really come from a normal distribution.

ggplot(data.frame(x = rnorm(100))) +
    stat_qq(aes(sample = x)) +
    geom_abline(aes(slope = 1, intercept = 0))

Then we have a reference for how closely to the line the points should lie when we’re looking at real data.

ggplot(singer) + stat_qq(aes(sample = height), distribution = qnorm)

Summing up

To visualize one distribution:

To compare two distributions:

To compare a distribution to a theoretical distribution: