Julia Fukuyama
Reading: Wickham (2010), A Layered Grammar of Graphics, JCGS.
ggplot2 and the grammar of graphics
What does “grammar of graphics” mean?
The analogy with English grammar, or any language’s grammar, is that it allows you to put together component parts
Better than “grammar of graphics” might be the “orthogonal components of graphics,” but that doesn’t have the same alliterative appeal.
The power of the grammar of graphics is that it is modular: different aspects of the plot can be specified independently of each other.
As an example, the coordinate system is specified separately from the geometric object used to represent the points.
Here we have three representations of the same data, the only difference between them being the coordinate system used to represent them.
Again, the same dataset, three different coordinate systems, very different representations:
geom_*
) object, one statistical transformation
(stat_*
), one position adjustment
(position_*
), and one dataset and set of aesthetic
mappings.coord_*
).facet_*
).Layers are the most important and involved part of the plot.
ggplots are composed of one or more layers
geom_*
or stat_*
commands create a layer
for you.layer
command, and we’ll see that today for teaching purposes, but in practice
you make layers with geom_*
or stat_*
ggplot
function on its own does not create a
layer.Data: self evident. For ggplot the data needs to be formatted as a tibble or a data.frame.
Aesthetic mapping:
Data and aesthetic mapping go together because they are not at all independent of each other: the aesthetic mapping takes variables in your data and maps them to aesthetics/perceivable parts of the plot and is therefore specific to a dataset.
A statistical transformation is some transformation of the data.
There’s some overlap between these and position adjustments: often there is more than one way to create the same plot.
Geometric objects (geom_*
) control the type of plot you
create. Examples are
## data for errorbar geom
df <- data.frame(
trt = factor(c(1, 1, 2, 2)),
resp = c(1, 5, 3, 4),
group = factor(c(1, 2, 1, 2)),
upper = c(1.1, 5.3, 3.3, 4.2),
lower = c(0.8, 4.6, 2.4, 3.6)
)
## errorbar geom
ggplot(df, aes(x = trt, y = resp, colour = group)) + geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2)
Every statistic has a default geometric object, and every geometric object has a default statistic.
Stats and geoms are not completely orthogonal: not every combination is valid (although many are).
For example:
stat_bin
and geom_bar
is valid and
standard for a histogram.stat_bin
and geom_point
or
stat_bin
+ geom_line
are valid but
non-standard combinations. They give a plot that is similar to a
histogram and that is interpretable in the same way.stat_identity
and geom_boxplot
is invalid,
because boxplot needs certain computed quantities from the data.Used to avoid “collisions” in the plot objects:
## [[1]]
## mapping: x = ~color, y = ~price
## geom_boxplot: outlier.colour = NULL, outlier.fill = NULL, outlier.shape = 19, outlier.size = 1.5, outlier.stroke = 0.5, outlier.alpha = NULL, notch = FALSE, notchwidth = 0.5, varwidth = FALSE, na.rm = FALSE, orientation = NA
## stat_boxplot: na.rm = FALSE, orientation = NA
## position_dodge2
ggplot(diamonds) + geom_boxplot(aes(x = color, y = price, color = clarity), position = position_dodge(width = 1))
position = "dodge"
is the default for boxplots, so you
don’t need to specify it.
So far, we’ve defined aesthetic mappings that specify which perceivable aspects of the plot correspond to which variables.
However, there any many ways to map variables to perceivable aspects of the plot.
For example, if we have a categorical variable that takes values “A” and “B” to the color aesthetic, that means that color is going to represent whether variable took value “A” or “B”. But we could do that in a practically infinite number of ways, e.g.
A maps to “red”, B maps to “black”
A maps to “green”, B maps to “blue”
A maps to “purple”, B maps to “gold”
… You probably get the picture
ggplot has good default mappings from values into aesthetic space*, but you will sometimes want to set them by hand.
To do so, you use the scale_*
functions.
*This is true now: the old version of ggplot had poor mappings from continuous values to colors, and the viridis color scheme was much better. The most recent version of ggplot uses viridis by default for both continuous values and ordered factors.
Another aspect of the plot that can be specified independently of everything else is the coordinate system.
coord_cartesian
is the default, and is almost always
what you want.coord_flip
is sometimes useful: for example, boxplots
require the explanatory variable to be mapped to x, so if you want a
horizontal boxplot you need to use coord_flip
.coord_polar
will often make your plots look cooler and
more difficult to read. Not usually recommended.Allows you to make small multiples of plots.
Other grammars/plotting systems think of this as just a fancy coordinate system, but it turns out that it’s easier to use if you think about it separately.
Each facet plots a subset of the data, and it takes as input what variable(s) to use to make the subsets and how to lay out the subsets.
The two options are:
facet_wrap
: Lays out the plots for each subset
sequentially.facet_grid
: Subsets the data according to two separate
variables. The facet position along the \(x\)-axis represents levels of one variable,
and the facet position along the \(y\)-axis represents levels of the other
variable.One way to specify a ggplot is to specify all of the components we’ve seen.
If you understand all the parts, this is probably the least confusing way to specify a ggplot.
The problem is that it’s very verbose. Suppose we want to make a plot with points and a smoother from the diamonds dataset. We can specify data, mapping, geom, stat, and positions for each layer, along with scales and the coordinate system as follows:
ggplot() +
layer(
data = diamonds, mapping = aes(x = carat, y = price),
geom = "point", stat = "identity", position = "identity") +
layer(
data = diamonds, mapping = aes(x = carat, y = price),
geom = "smooth", position = "identity", stat = "smooth", params = list(method = "lm")) +
scale_x_log10() + scale_y_log10() + coord_cartesian()
## `geom_smooth()` using formula = 'y ~ x'
The more standard way of writing the same plot would be:
p = ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point() +
stat_smooth(method = "lm") +
scale_x_log10() +
scale_y_log10()
This is still fairly long, but we don’t have to specify
geom_point
and
stat_smooth
is position = "identity"
.geom_point
: The default stat for
geom_point
is stat = "identity"
.stat_smooth
: The default geom for
stat_smooth
is geom_smooth
.coord_cartesian
is always the
default.You can check what stat, geom, and position is used for each of the layers:
## [1] "data" "layers" "scales" "mapping" "theme"
## [6] "coordinates" "facet" "plot_env" "labels"
## [[1]]
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
##
## [[2]]
## geom_smooth: se = TRUE, na.rm = FALSE, orientation = NA
## stat_smooth: method = lm, formula = NULL, se = TRUE, n = 80, fullrange = FALSE, level = 0.95, na.rm = FALSE, orientation = NA, method.args = list(), span = 0.75
## position_identity
One of the most famous statistical graphics, created by Charles Minard depicts Napoleon’s 1812 invasion of and retreat from Russia.
## Rows: 48 Columns: 5
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): direction
## dbl (4): long, lat, surviv, division
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 20 Columns: 3
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): city
## dbl (2): long, lat
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s add another layer for the cities:
plot_both = plot_troops +
geom_text(aes(x = long, y = lat, label = city), data = minard_cities)
plot_both
Notice: we have a new dataset for this layer.
Some things that we still don’t like about this plot:
A “final” version of the plot, with better scales:
ggplot(minard) +
geom_path(aes(x = long, y = lat, color = direction, size = surviv^2, group = division), lineend = "round") +
geom_text(aes(x = long, y = lat, label = city), data = minard_cities, size = 3) +
scale_size(range = c(.18, 15), breaks = (1:3 * 10^5)^2, labels = paste(" ", scales::comma(1:3 * 10^5)), "Survivors") +
scale_color_manual(values = c("grey50","red"), breaks = c("A", "R"), labels = c("Advance", "Retreat"), "") +
xlab(NULL) + ylab(NULL) + theme(axis.text = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank())
Remember our passage from Tukey:
Exploratory data analysis is detective work… As all detective stories remind us, many of the circumstances surrounding a crime are accidental or misleading. Equally, many of the indications to be discerned in bodies of data are accidental or misleading. To accept all appearances as conclusive would be destructively foolish, either in crime detection or in data analysis. To fail to collect all appearances because some – or even most – are only accidents would, however, be gross misfeasance deserving (and often receiving) appropriate punishment.
The flexibility in the grammar of graphics allows us to collect many more “appearances” in the data than we would be able to if we just have access to a handful of named plots.
Many of the plots that we can make with ggplot are not useful, but the point is to try visualizing the data in many different ways. ggplot opens up a very large space of statistical graphics to us for not very much effort.