Julia Fukuyama
We have a data set containing clinical data on patients who were about to receive a radical prostatectomy. The relevant variables are:
Variables about sizes:
lcavol
: log cancer volumelweight
: log prostate weightlbph
: log of the amount of benign prostatic
hyperplasiaVariables about the individual:
age
: Age in yearsVariables measured by the pathologist:
svi
: Seminal vesicle invasion, a measure of how
advanced the cancer is.lcp
: log of capsular penetrationgleason
: A numeric vector giving the Gleason
score. In theory can range from 2-10, in practice ranges from
6-10.pgg45
: Percent of cells with Gleason score 4 or 5.The “response” variable:
lpsa
: log of the concentration of prostate-stimulating
antigen.We are primarily interested in lpsa
, which is used as a
marker for prostate cancer. We would like to know whether and how it is
related to the other variables we have available to us.
## lcavol lweight age lbph
## Min. :-1.3471 Min. :2.375 Min. :41.00 Min. :-1.3863
## 1st Qu.: 0.5128 1st Qu.:3.376 1st Qu.:60.00 1st Qu.:-1.3863
## Median : 1.4469 Median :3.623 Median :65.00 Median : 0.3001
## Mean : 1.3500 Mean :3.629 Mean :63.87 Mean : 0.1004
## 3rd Qu.: 2.1270 3rd Qu.:3.876 3rd Qu.:68.00 3rd Qu.: 1.5581
## Max. : 3.8210 Max. :4.780 Max. :79.00 Max. : 2.3263
## svi lcp gleason pgg45
## Min. :0.0000 Min. :-1.3863 Min. :6.000 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:-1.3863 1st Qu.:6.000 1st Qu.: 0.00
## Median :0.0000 Median :-0.7985 Median :7.000 Median : 15.00
## Mean :0.2165 Mean :-0.1794 Mean :6.753 Mean : 24.38
## 3rd Qu.:0.0000 3rd Qu.: 1.1787 3rd Qu.:7.000 3rd Qu.: 40.00
## Max. :1.0000 Max. : 2.9042 Max. :9.000 Max. :100.00
## lpsa train
## Min. :-0.4308 Mode :logical
## 1st Qu.: 1.7317 FALSE:30
## Median : 2.5915 TRUE :67
## Mean : 2.4784
## 3rd Qu.: 3.0564
## Max. : 5.5829
From the ggpairs
plot, we see that the distributions of
the variables are reasonably symmetrical, not that skewed, and that
there is at least some relationship between most of the variables and
lpsa
.
And a side note: this data set has already been transformed for you. Many of the variables are logged versions of what were presumably the raw measurements. This is actually a really important step: models with the logged variables perform substantially better than models with the raw variables, as we can see if we un-transform:
prostate_unlogged = prostate %>%
mutate(cavol = exp(lcavol), weight = exp(lweight), bph = exp(lbph), cp = exp(lcp)) %>%
select(-"lcavol", -"lweight", -"lbph", -"lcp")
summary(lm(lpsa ~ ., data = prostate))
##
## Call:
## lm(formula = lpsa ~ ., data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76644 -0.35510 -0.00328 0.38087 1.55770
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.181561 1.320568 0.137 0.89096
## lcavol 0.564341 0.087833 6.425 6.55e-09 ***
## lweight 0.622020 0.200897 3.096 0.00263 **
## age -0.021248 0.011084 -1.917 0.05848 .
## lbph 0.096713 0.057913 1.670 0.09848 .
## svi 0.761673 0.241176 3.158 0.00218 **
## lcp -0.106051 0.089868 -1.180 0.24115
## gleason 0.049228 0.155341 0.317 0.75207
## pgg45 0.004458 0.004365 1.021 0.31000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6995 on 88 degrees of freedom
## Multiple R-squared: 0.6634, Adjusted R-squared: 0.6328
## F-statistic: 21.68 on 8 and 88 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = lpsa ~ ., data = prostate_unlogged)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.78876 -0.40705 -0.00634 0.46725 1.82792
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.759219 1.210444 0.627 0.53214
## age -0.008545 0.012087 -0.707 0.48144
## svi 0.769166 0.270646 2.842 0.00557 **
## gleason 0.130183 0.166701 0.781 0.43694
## pgg45 0.005412 0.004567 1.185 0.23927
## cavol 0.073661 0.014479 5.087 2.03e-06 ***
## weight 0.012511 0.005322 2.351 0.02095 *
## bph 0.056512 0.034547 1.636 0.10546
## cp -0.040047 0.034216 -1.170 0.24499
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.764 on 88 degrees of freedom
## Multiple R-squared: 0.5985, Adjusted R-squared: 0.562
## F-statistic: 16.4 on 8 and 88 DF, p-value: 1.329e-14
If we had started off with the raw data, we would have seen that we should log-transform some of the variables by looking at their marginal distributions: the variables that were transformed started off quite skewed, and the transformation got rid of the skewness.
As a first step, we can fit a linear model with all the predictors and look at the results. We see that a lot of the coefficients are within the margin of error of zero, which suggests to us that a model with fewer predictors would do better.
There are a lot of ways of doing variable selection for linear models.
With eight predictors and a laptop, we can actually just look through all the subsets and see which model performs best.
This is what the regsubsets
in the package
leaps
does for you.
library(leaps)
prostate_best_subset = regsubsets(lpsa ~ ., data = prostate)
summary(prostate_best_subset)$which
## (Intercept) lcavol lweight age lbph svi lcp gleason pgg45
## 1 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
## 4 TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
## 5 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## 6 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
## 7 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## 8 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
This shows us that the best one-predictor model uses
lcavol
, the best two-predictor model uses
lcavol
and lweight
, the best three-predictor
model uses lcavol
, lweight
, and
svi
, and so on.
We can use this set of models as guidance for what variables to include, and build up an interpretable model using some of the tools we’ve seen earlier in the course.
regsubsets
also does forward and backward stepwise
regression. In this case they all give the same results.
prostate_forward = regsubsets(lpsa ~ ., data = prostate, method = "forward")
summary(prostate_forward)$which
## (Intercept) lcavol lweight age lbph svi lcp gleason pgg45
## 1 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
## 4 TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
## 5 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## 6 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
## 7 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## 8 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
prostate_backward = regsubsets(lpsa ~ ., data = prostate, method = "backward")
summary(prostate_backward)$which
## (Intercept) lcavol lweight age lbph svi lcp gleason pgg45
## 1 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
## 4 TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
## 5 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## 6 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
## 7 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## 8 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1] TRUE
## [1] TRUE
We start off looking at the best one-predictor model, with
lcavol
predicting lpsa
.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
We see that the relationship is pretty close to linear (a line would
go through the entire confidence band of the smoother), and so we’re ok
with using a linear function of lcavol
to predict
lpsa
.
If there had been a major non-linearity here, we would have wanted to ditch the linear modeling approach and do something non-parametric, maybe loess.
Then we can move to the two-predictor model, and see what the
relationship between lcavol
, lweight
, and
lpsa
looks like.
We can make a coplot to examine the relationship between
lpsa
and lcavol
, with lweight
as
the given variable
ggplot(prostate, aes(x = lcavol, y = lpsa)) + geom_point() + geom_smooth(span = 1) +
facet_grid(~cut_number(lweight, n = 3))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Here it seems like there is some non-linearity, but only for the
observations with a high value of lweight
.
If we had a lot more observations, this might prompt us to move to loess, but since the non-linearity is based on just a few points, we want to see first whether the non-linearity shows up in other graphs as well.
The best three-predictor model identified by leaps
included svi
in addition to lcavol
and
lweight
, so we next look at those four variables
together.
svi
is binary and there are only 21 cases where
svi
is equal to 1, we can’t make a lot of facets and we
don’t trust curves that much.
##
## 0 1
## 76 21
prostate$svi = recode(prostate$svi, `1` = "Yes", `0` = "No")
ggplot(prostate, aes(x = lcavol, y = lpsa, group = svi, color = svi)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) + facet_wrap(~ cut_number(lweight, n = 2))
## `geom_smooth()` using formula = 'y ~ x'
The different slopes suggest an interaction, but again, we don’t have that much data.
Next we try looking at the predictors from the four-predictor model. This is pushing the limits of the number of variables we can look at all at once, but we will try.
We can look at the relationship between lpsa
,
lcavol
, and svi
, with lweight
and
lbph
as the given variables.
ggplot(prostate, aes(x = lcavol, y = lpsa, group = svi, color = svi)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_grid(cut_number(lweight, n = 2) ~ cut(lbph, breaks = c(-2, -1, 3)))
## `geom_smooth()` using formula = 'y ~ x'
Based on this visualization, we might not be confident about an interaction: the blue lines vary in slope, but there’s based on very small samples. The red lines have different heights but are similar in slope.
Automatic ways of choosing model sizes:
With each of these, we choose the model with the lowest value of the selection criterion.
In EDA we’re not always required to find a “best” model, and even if we were we can decide on what best means subjectively.
If you wanted to fit a linear model with lcavol
,
lweight
, and svi
as predictors plus
interactions, you’re free to do so and then call that “best” because of
the complexity you can get out of a relatively small number of
variables.
We can also check what the model selection criteria tell us about how many variables to use:
cp_df = data.frame(value = summary(prostate_best_subset)$cp,
n_params = seq_along(summary(prostate_best_subset)$cp),
type = "Cp")
bic_df = data.frame(value = summary(prostate_best_subset)$bic,
n_params = seq_along(summary(prostate_best_subset)$bic),
type = "BIC")
model_selection_criterion_df = rbind(cp_df, bic_df)
ggplot(model_selection_criterion_df, aes(x = n_params, y = value)) +
geom_point() + facet_wrap(~ type, scales = "free_y")
Even though the lowest value of Cp is for 5 predictors, the Cp value for 3 predictors is very close to the value for 5 predictors and BIC has a strong minimum at 5 predictors. The two plots suggest that if we don’t believe we need any interactions, we should use a model with an intercept and two predictors.
We would like to do a coplot of lpsa
as a function of
lcavol
conditioned on lweight
and
svi
that has fits and predicted values. There are a couple
of issues:
cut_number
).lweight
, so we need to choose one value to compute a fit at
in each of the bins (natural thing to do is use the midpoint of the
bin).separate
and mutate
, along with
regular expressions).merge
).lm_final = lm(lpsa ~ lcavol + lweight + svi, data = prostate)
final_fits = augment(lm_final)
## get the bins and the bin means
prostate_with_lweight_bins = prostate %>%
mutate(lweight_bins = cut_number(lweight, n = 3)) %>%
separate(lweight_bins, into = c(NA, "lo", "hi", NA), remove = FALSE, sep = "\\[|\\(|\\)|\\]|,") %>%
mutate(bin_mean = (as.numeric(lo) + as.numeric(hi)) / 2)
## get the grid for prediction
prostate_grid_final = expand.grid(lweight = unique(prostate_with_lweight_bins$bin_mean),
lcavol = quantile(prostate$lcavol),
svi = c("Yes", "No")) %>% data.frame()
## predictions on the grid
prostate_fits_on_grid = augment(lm_final, newdata = prostate_grid_final)
## add information about the bins to the fits
prostate_fits_on_grid = merge(prostate_fits_on_grid,
unique(prostate_with_lweight_bins[,c("lweight_bins", "bin_mean")]),
by.x = "lweight", by.y = "bin_mean")
## subset so we don't plot the fits outside the range of the data
prostate_fits_on_grid = prostate_fits_on_grid %>% subset(!(svi == "Yes" & lcavol < 1))
ggplot(prostate_with_lweight_bins, aes(x = lcavol, y = lpsa, color = svi)) +
geom_point() +
geom_line(aes(y = .fitted), data = prostate_fits_on_grid) +
facet_wrap(~ lweight_bins) +
scale_x_continuous("Log Cancer Volume") + scale_y_continuous("Log PSA") +
scale_color_discrete("SVI") +
ggtitle("Log PSA as a function of log cancer volume,\nfaceted by log weight")
ggplot(prostate_with_lweight_bins, aes(x = exp(lcavol), y = exp(lpsa), color = factor(svi))) +
geom_point() +
facet_wrap(~ lweight_bins) +
geom_line(aes(y = exp(.fitted)), data = prostate_fits_on_grid) +
scale_x_continuous("Cancer Volume") + scale_y_continuous("PSA") +
scale_color_discrete("SVI") +
ggtitle("PSA as a function of cancer volume,\nfaceted by log weight")
The final plots highlight some of the deficiencies of the additive model.
Next model to consider might be additive for the low log weights and non-linear with no SVI effect for the high values of log weight.