Stat 470/670 Lecture 20: EDA and the Problem of Multiple Comparisons

Julia Fukuyama

EDA and Multiple Comparisons

Optional reading: Gelman and Loken

For this lecture, we will talk a little about \(p\)-values.

The standard multiple comparison problem

Suppose a researcher is interested in whether married and unmarried men have different opinions on gun laws.

He asks married and unmarried men whether they are in favor of stricter gun laws.

Overall he didn’t see a difference between the two groups, but he also looked at his data state-by-state and gets the following results (only the 15 states with the largest absolute differences are shown):

##    State Unmarried Married Difference
## 36    NJ      0.48    0.42      -0.06
## 37    CO      0.50    0.57       0.07
## 38    CT      0.57    0.50      -0.07
## 39    OR      0.53    0.60       0.07
## 40    AK      0.49    0.42      -0.07
## 41    MD      0.51    0.59       0.08
## 42    VA      0.48    0.57       0.09
## 43    SC      0.44    0.53       0.09
## 44    TX      0.42    0.51       0.09
## 45    IL      0.56    0.47      -0.09
## 46    OH      0.51    0.61       0.10
## 47    WV      0.42    0.52       0.10
## 48    ND      0.41    0.59       0.18
## 49    IN      0.39    0.58       0.19
## 50    MI      0.38    0.59       0.21

When he tests for whether the state-by-state differences are significant, he sees that three are: North Dakota, Indiana, and Michigan, and in each case married men are more likely to favor stricter gun laws.

What are some possible explanations for these results?

In this example, we might worry that there is actually no difference in any of the states, and that we saw the three small \(p\)-values simply by chance.

We can do a little simulation to see what would happen if the true effect was zero in all of the states: In that case, for each state we have a \(.05\) chance of getting a \(p\)-value less than \(.05\) (the definition of a \(p\)-value).

set.seed(2)
# simulate 50 hypothesis tests at .05
rejections = rbinom(50, 1, prob = .05)
# a value of 1 means we rejected the null, a value of 0 means we failed to reject
# here all of the nulls were true, but we rejected 4 of them
rejections
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## [39] 0 0 1 0 0 0 0 0 1 0 0 0

We can repeat the example above many times to see how many times we expect to reject three or more hypotheses out of 50, and we see that even when all of the nulls are true, we would expect to reject at least three null hypotheses almost half of the time

n_rejections = rep(NA, 5000)
for(i in 1:5000) {
    rejections = rbinom(50, 1, prob = .05)
    n_rejections[i] = sum(rejections)
}
table(n_rejections)
## n_rejections
##    0    1    2    3    4    5    6    7    8    9   10   11 
##  377  979 1340 1102  695  326  119   45   13    1    2    1
mean(n_rejections >= 3)
## [1] 0.4608

Testing the global null hypothesis

The hypothesis that all of the individual null hypotheses are true is referred to as the global null: if \(H_{0i}\) refers to the \(i\)th null hypotheses, the global null is

\[ H_0 = \cap_{i=1}^n H_{0i}, \] the intersection of all the individual null hypotheses.

Suppose we are interested in testing hypotheses at level \(\alpha\) (we reject with a \(p\)-values less than \(\alpha\))

One of the simplest methods for testing the global null hypothesis is called Bonferroni’s Method.

The procedure is as follows:

Notes

Multiple comparisons and EDA

Let’s set up a toy example.

Suppose we are interested in whether two variables, \(x\) and \(y\), are related to each other.

In reality they are not, but we are equipped with the full power of our EDA course and we are able to try out different transformations of the variables and different methods to check for the relationship.

## full disclosure: I started the seed at 0 and went up
## until I got a result with a p-value < .05
set.seed(4)
library(broom)
library(MASS)
x = rgamma(20, 1, 1)
y = rgamma(20, 10, 10)

We start off by just plotting \(x\) vs. \(y\):

ggplot(data.frame(x, y), aes(x = x, y = y))+ geom_point() + stat_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

It doesn’t look like there’s much there. However, \(x\) looks like it has heavy tails, and so maybe we should transform it to reduce the skewness.

Let’s plot \(y\) vs \(\text{log}(x)\):

ggplot(data.frame(x, y), aes(x = log(x), y = y))+ geom_point() + stat_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

That’s better, but maybe we should also use robust regression because we think there might be outliers in \(y\).

We plot the robust smooth as well:

ggplot(data.frame(x, y), aes(x = log(x), y = y))+ geom_point() + stat_smooth(method = "rlm")
## `geom_smooth()` using formula = 'y ~ x'

In the end, we fit three models. Maybe we changed our minds and decided that robust regression wasn’t necessary and that we think the linear model with \(y\) predicted by \(\text{log}(x)\) is the best.

We then want a \(p\)-value for our collaborators or to publish, and we are happy to see that the coefficient is significant:

tidy(lm(y ~ log(x)))
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    0.885    0.0595     14.9  1.51e-11
## 2 log(x)        -0.141    0.0583     -2.43 2.60e- 2

What is the problem here?

Let’s look at the \(p\)-values for the other two models that we rejected:

tidy(lm(y ~ x))
## # A tibble: 2 × 5
##   term        estimate std.error statistic       p.value
##   <chr>          <dbl>     <dbl>     <dbl>         <dbl>
## 1 (Intercept)   1.03      0.0901     11.4  0.00000000116
## 2 x            -0.0859    0.0577     -1.49 0.154
tidy(rlm(y ~ log(x)))
## # A tibble: 2 × 4
##   term        estimate std.error statistic
##   <chr>          <dbl>     <dbl>     <dbl>
## 1 (Intercept)    0.880    0.0645     13.6 
## 2 log(x)        -0.111    0.0631     -1.76
## the corresponding p-value for rlm:
pnorm(abs(tidy(rlm(y ~ log(x)))[2, "statistic", drop = TRUE]), lower.tail = FALSE) * 2
## [1] 0.07833073

Neither of the other two models would have given us a significant result at the \(.05\) level: the \(p\)-values were \(.15\) and \(.08\).

The problem is that if the other models had looked better to us, we might have used the \(p\)-values from them instead of from the model we chose, and so we are implicitly or unconsciously performing multiple tests.

To see why this invalidates our \(p\)-values, suppose we ungenerously describe the procedure we used to come up with our model as follows:

What fraction of the time will you reject the null hypothesis?

We can see what happens by simulation:

pvals = rep(NA, 5000)
for(i in 1:5000) {
    x = rgamma(20, 1, 1)
    y = rgamma(20, 10, 10)
    out.lm = lm(y ~ x)
    pval = tidy(out.lm)[2,"p.value", drop=TRUE]
    if(pval > .05) {
        out.lm = lm(y ~ log(x))
        pval = tidy(out.lm)[2,"p.value", drop=TRUE]
    }
    if(pval > .05) {
        out.lm = rlm(y ~ log(x))
        pval = pnorm(-abs(tidy(out.lm)[2, "statistic", drop=TRUE])) * 2
    }
    pvals[i] = pval
}

We reject at the \(.05\) level about twice as much as we should, and so reporting the nominal \(p\)-value would be misleading here.

mean(pvals <= .05)
## [1] 0.103
ggplot(data.frame(pval = pvals)) +
    stat_ecdf(aes(x = pval)) +
    scale_x_continuous(breaks = seq(0, 1, by = .05)) +
    scale_y_continuous(breaks = seq(0, 1, by = .05)) + 
    theme(axis.text.x = element_text(angle = 90, vjust = .5))

In our simulation, we made modeling decisions explicitly based on \(p\)-values. You can invalidate your \(p\)-values even if you make the decisions without actually looking at the \(p\)-values, it’s just easier to simulate this way.

Note 1

Notice that this isn’t quite as bad as running three independent tests: in that case we would expect about \(14\)% of the \(p\)-values to be less than \(.05\), but here we only get about \(10\)%.

## .95^3 is the probability that you fail to reject
## three null hypotheses at the .05 level, and so the
## following is the probability that you reject at
## least one of three null hypotheses at the .05 level
1 - .95^3
## [1] 0.142625

Note 2

We only looked at three models/potential tests in the simulation here, but the problem can be much, much worse. We have:

This can lead to an enormous number of potential tests: Suppose you have two predictors, four ways of transforming predictors, three potential outliers, two coding choices, and five potential models. This gives \[ 4 \times 4 \times 4 \times 2^3 \times 2 \times 5 = 5120 \] distinct combinations of analysis choices.

In principle, we could count up all the possible tests and do a Bonferroni correction, but in practice the Bonferroni cutoff will be too strict for you to ever reject.

Note 3

Exploration doesn’t always invalidate your \(p\)-values, only if you do it while looking at the relationship you want to test.

Let’s try another simulation to illustrate the point above.

Suppose we have the same setup as in the first simulation, but now we decide whether or not to transform \(x\) and \(y\) based on how skewed they are instead of of based on the significance of the relationship.

Our procedure:

pvals2 = rep(NA, 5000)
for(i in 1:5000){
    x = rgamma(20, 1, 1)
    y = rgamma(20, 10, 10)
    if(skewness(x) > 1.5)
        x = log(x)
    if(skewness(y) > 1)
        y = log(y)
    out.lm = lm(y ~ x)
    pvals2[i] = tidy(out.lm)[2,"p.value", drop = TRUE]
}

Now we see that the \(p\)-values are distributed the way they should be. In this case, looking at the variables and transforming them didn’t invalidate our \(p\)-values.

mean(pvals2 <= .05)
## [1] 0.0536
ggplot(data.frame(pval = pvals2)) +
    stat_ecdf(aes(x = pval)) +
    scale_x_continuous(breaks = seq(0, 1, by = .05)) +
    scale_y_continuous(breaks = seq(0, 1, by = .05)) + 
    theme(axis.text.x = element_text(angle = 90, vjust = .5))

Remedies