Reading: Hadley Wickham, “The Split/Apply/Combine Strategy for Data Analysis”.
Split/apply/combine with base R, lots of different functions for different tasks
lapply
/s
apply for applying to
lists/vectors, giving different kinds of output
apply
for applying row-by-row to matrices or data
frames
mapply
for applying to multiple vectors
tapply
for splitting a vector and then applying a
function to the groups
by
for splitting a data frame and then applying a
function to the groups
split
for just splitting a vector or data frame
up
plyr/dplyr is going to clean this up for us. The plyr functions have the goals as base R functions, but with
more consistent interface
guaranteed output type
All plyr functions named **ply
First position stands for the input and split type
Second position stands for the output type
The possible vaues for the positions are
l
for list input/chopping the list into its
individual elements
a
for array input/slicing by dimension into
lower-dimensional pieces
d
for data frame/subsetting by combinations of
variables
What you will have you specify:
Data the function should be applied to
How the data should be split
The function to be applied to each split (the processing function)
l*ply
syntax: l*ply(data, fun)
data
is a list containing the data you want the
function to be applied to.
fun
is a function that will be applied to each
element of the list.
Notice that there is no specification for the split:
l*ply
assumes that you split the list into its individual
elements.
Example:
l*ply
will also work on vectors, e.g.:
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
## [8] 2.0794415 2.1972246 2.3025851
a*ply
syntax: a*ply(data, margin, fun)
data
is the data to apply the function to. Should be
an array/matrix (well, almost, we’ll complicate in a couple
slides).
margin
describes how the data should be split: for
matrices this is either by row or by column. 1 indicates split by row, 2
indicates split by column.
fun
is the function to apply to each split of the
data (generally a row or column vector).
Example:
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## 1 2
## 9 12
## X1 V1
## 1 1 9
## 2 2 12
## $`1`
## [1] 9
##
## $`2`
## [1] 12
##
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
## X1
## 1 1
## 2 2
A couple of extras:
Works on higher-dimensional arrays, in which case margin can be 1,…,p, where p is the dimension of the array.
Margin can also be a vector, in which case the array is split on the combination of the dimensions.
The function technically works on anything with dimensions and multi-dimensional indexing, so you can pass data frames as well as arrays/matrices.
Example on a 3-dimensional array:
## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
## [1] 4 4 2
## $Hair
## [1] "Black" "Brown" "Red" "Blond"
##
## $Eye
## [1] "Brown" "Blue" "Hazel" "Green"
##
## $Sex
## [1] "Male" "Female"
## Hair Eye V1
## 1 Black Brown 68
## 2 Brown Brown 119
## 3 Red Brown 26
## 4 Blond Brown 7
## 5 Black Blue 20
## 6 Brown Blue 84
## 7 Red Blue 17
## 8 Blond Blue 94
## 9 Black Hazel 15
## 10 Brown Hazel 54
## 11 Red Hazel 14
## 12 Blond Hazel 10
## 13 Black Green 5
## 14 Brown Green 29
## 15 Red Green 14
## 16 Blond Green 16
Play around with replacing the margin vector with others, and convince yourself of why you get the output you do.
d*ply
syntax: d*ply(data, variables, fun)
data
is the data to apply the function to. Should be
a data frame, but it will be ok if you pass a matrix.
variables
describes the variables used to split the
data, and you specify them as .(var1, var2, ... , varN)
You can give it a single factor variable, in which case the
data is split by the levels of that factor, or several factor variables,
in which case the data is split on all combinations of those factors.
The syntax is special to indicate that the variables are taken
first from the data frame in data and then, if they aren’t found there,
from the global environment.
fun
is the function to apply to each split of the
data.
For example: Star Wars
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Suppose we want to know the average height of individuals on each world.
First ask:
What variable(s) are we splitting on?
What function do we apply to each subset of the data?
First define a function that will take a subset of the data and find return to us the average height:
get_avg_height <- function(data_subset) {
avg_height <- mean(data_subset$height)
return(avg_height)
}
Then split starwars
on the homeworld
variable and apply get_avg_height
to each data subset:
## homeworld V1
## 1 Alderaan 176.3333
## 2 Aleen Minor 79.0000
## 3 Bespin 175.0000
## 4 Bestine IV 180.0000
## 5 Cato Neimoidia 191.0000
## 6 Cerea 198.0000
## 7 Champala 196.0000
## 8 Chandrila 150.0000
## 9 Concord Dawn 183.0000
## 10 Corellia 175.0000
## 11 Coruscant 173.6667
## 12 Dathomir 175.0000
## 13 Dorin 188.0000
## 14 Endor 88.0000
## 15 Eriadu 180.0000
## 16 Geonosis 183.0000
## 17 Glee Anselm 196.0000
## 18 Haruun Kal 188.0000
## 19 Iktotch 188.0000
## 20 Iridonia 171.0000
## 21 Kalee 216.0000
## 22 Kamino 208.3333
## 23 Kashyyyk 231.0000
## 24 Malastare 112.0000
## 25 Mirial 168.0000
## 26 Mon Cala 180.0000
## 27 Muunilinst 191.0000
## 28 Naboo 177.2727
## 29 Nal Hutta 175.0000
## 30 Ojom 198.0000
## 31 Quermia 264.0000
## 32 Rodia 173.0000
## 33 Ryloth 179.0000
## 34 Serenno 193.0000
## 35 Shili 178.0000
## 36 Skako 193.0000
## 37 Socorro 177.0000
## 38 Stewjon 182.0000
## 39 Sullust 160.0000
## 40 Tatooine 169.8000
## 41 Toydaria 137.0000
## 42 Trandosha 190.0000
## 43 Troiken 122.0000
## 44 Tund 163.0000
## 45 Umbara 178.0000
## 46 Utapau 206.0000
## 47 Vulpter 94.0000
## 48 Zolan 168.0000
## 49 <NA> NA
Slightly more complicated example: we want to know the fraction of individuals with yellow eyes on each world.
What variable are we splitting on?
What function do we apply to each subset of the data?
get_frac_yellow_eyes <- function(data_subset) {
num_yellow_eyes <- sum(data_subset$eye_color == "yellow")
num_individuals <- nrow(data_subset)
return(num_yellow_eyes / num_individuals)
}
## homeworld V1
## 1 Alderaan 0.00000000
## 2 Aleen Minor 0.00000000
## 3 Bespin 0.00000000
## 4 Bestine IV 0.00000000
## 5 Cato Neimoidia 0.00000000
## 6 Cerea 1.00000000
## 7 Champala 0.00000000
## 8 Chandrila 0.00000000
## 9 Concord Dawn 0.00000000
## 10 Corellia 0.00000000
## 11 Coruscant 0.00000000
## 12 Dathomir 1.00000000
## 13 Dorin 0.00000000
## 14 Endor 0.00000000
## 15 Eriadu 0.00000000
## 16 Geonosis 1.00000000
## 17 Glee Anselm 0.00000000
## 18 Haruun Kal 0.00000000
## 19 Iktotch 0.00000000
## 20 Iridonia 0.00000000
## 21 Kalee 0.00000000
## 22 Kamino 0.00000000
## 23 Kashyyyk 0.00000000
## 24 Malastare 0.00000000
## 25 Mirial 0.00000000
## 26 Mon Cala 0.00000000
## 27 Muunilinst 0.00000000
## 28 Naboo 0.09090909
## 29 Nal Hutta 0.00000000
## 30 Ojom 1.00000000
## 31 Quermia 1.00000000
## 32 Rodia 0.00000000
## 33 Ryloth 0.00000000
## 34 Serenno 0.00000000
## 35 Shili 0.00000000
## 36 Skako 0.00000000
## 37 Socorro 0.00000000
## 38 Stewjon 0.00000000
## 39 Sullust 0.00000000
## 40 Tatooine 0.20000000
## 41 Toydaria 1.00000000
## 42 Trandosha 0.00000000
## 43 Troiken 0.00000000
## 44 Tund 0.00000000
## 45 Umbara 0.00000000
## 46 Utapau 0.00000000
## 47 Vulpter 1.00000000
## 48 Zolan 1.00000000
## 49 <NA> 0.00000000
We said the options for output are arrays, data frames, and lists.
We’ve seen examples above, but let’s look more systematically.
*dply
Returns a data frame with columns for the values of the processing function and columns describing the data splits.
The processing function should return either a row of a data frame or a vector of a consistent length.
data(diamonds)
diamond_coefs <- ddply(diamonds, .(color), function(data_subset) {
diamond_lm <- lm(log(price) ~ carat, data = data_subset)
diamond_coefs <- coef(diamond_lm)
return(diamond_coefs)
})
diamond_coefs
## color (Intercept) carat
## 1 D 6.048811 2.383864
## 2 E 6.034513 2.348335
## 3 F 6.088442 2.272790
## 4 G 6.109554 2.178489
## 5 H 6.180284 1.906300
## 6 I 6.175315 1.799199
## 7 J 6.254074 1.627947
diamond_coefs_2 <- ddply(diamonds, .(color), function(data_subset) {
diamond_lm <- lm(log(price) ~ carat, data = data_subset)
diamond_coefs <- coef(diamond_lm)
df_row <- data.frame(slope = diamond_coefs['carat'],
my_color_description = paste0("color", unique(data_subset$color)))
return(df_row)
})
diamond_coefs_2
## color slope my_color_description
## 1 D 2.383864 colorD
## 2 E 2.348335 colorE
## 3 F 2.272790 colorF
## 4 G 2.178489 colorG
## 5 H 1.906300 colorH
## 6 I 1.799199 colorI
## 7 J 1.627947 colorJ
Check on your own what happens when you replace .(color)
with .(color, clarity)
. How is the output different than
what we got using by
last time? Is it better?
*aply
Returns an array with dimension equal to the dimension of the
split concatenated with the dimension of the output.
The first
dimensions in the array correspond to the split dimensions, and the last
dimension(s) has/have the same size as the output from the processing
function.
Has a .drop
argument, which is TRUE by default. Any
dimensions of size 1 will be dropped (e.g., if the output would have
been a 3 x 1 matrix, with .drop = TRUE
you will get a
length-3 vector).
The processing function should return a vector or array of the same type and dimensionality each time it is called.
Here we split along one dimension and have one-dimensional output. Similar to what we’ve seen before with apply in base R.
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [1] 3 1
##
## X1 1
## 1 5
## 2 7
## 3 9
## NULL
## 1 2 3
## 5 7 9
Here we split along two dimensions and the processing function gives one-dimensional output:
## [1] 3 2 2
## , , 1
##
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
##
## , , 2
##
## [,1] [,2]
## [1,] 7 10
## [2,] 8 11
## [3,] 9 12
## [1] 3 2 1
## , , = 1
##
## X2
## X1 1 2
## 1 8 14
## 2 10 16
## 3 12 18
## , , = 1
##
## X2
## X1 1 2
## 1 8 14
## 2 10 16
## 3 12 18
Here we split along one dimension and have four-dimensional output from the processing function:
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [1] 3 4
##
## X1 1 2 3 4
## 1 2 -4 0 0
## 2 4 -5 0 0
## 3 6 -6 0 0
*lply
Returns a list, names of the list describe the split.
Since the output type is list, no restrictions on the type of output the processing function returns.
diamond_lms <- dlply(diamonds, .(color), function(data_subset) {
lm(log(price) ~ carat, data = data_subset)
})
class(diamond_lms)
## [1] "list"
## [1] "lm"
## [1] "D" "E" "F" "G" "H" "I" "J"
No output: *_ply
(for functions that just have side
effects, like making plots or writing to the disk).
r*ply
: like l*ply
, but r
for repeat. Pass it some number of times to repeat an expression instead
of passing it data, and pass it an expression to evaluate instead of a
function.
A second package, different interface, specifically for data frames.
To do split/apply/combine with dplyr
, we still
specify:
Data
How to split the data
What function to apply to each split
but the syntax is different
The magrittr
library offers the “pipe” operation,
%>%
.
In words: “Pipe the output of the last function as input to the next function”.
Suppose f1
and f2
are functions. These are
equivalent:
f2(f1(x))
f1(x) %>% f2()
The output of f1(x)
is used as the first argument to
f2
.
dplyr syntax:
df %>%
group_by(group_var) %>%
summarise(output_name = expression)
df
is a data frame with the input data.
group_var
is a factor variable describing how the
data should be split.
expression
is an expression that should be evaluated
on each subset of the data. It will be evaluated in an environment where
the columns of the data frame are attached, so your expression can be in
terms of the columns of the data frame.
Note that expression
is not a function
here.
Star wars example again:
What is the average height in each world?
## # A tibble: 49 × 2
## homeworld avg_height
## <chr> <dbl>
## 1 Alderaan 176.
## 2 Aleen Minor 79
## 3 Bespin 175
## 4 Bestine IV 180
## 5 Cato Neimoidia 191
## 6 Cerea 198
## 7 Champala 196
## 8 Chandrila 150
## 9 Concord Dawn 183
## 10 Corellia 175
## # ℹ 39 more rows
What fraction of individuals on each world have yellow eyes?
starwars %>%
group_by(homeworld) %>%
summarise(frac_yellow_eyes = sum(eye_color == "yellow") / length(eye_color))
## # A tibble: 49 × 2
## homeworld frac_yellow_eyes
## <chr> <dbl>
## 1 Alderaan 0
## 2 Aleen Minor 0
## 3 Bespin 0
## 4 Bestine IV 0
## 5 Cato Neimoidia 0
## 6 Cerea 1
## 7 Champala 0
## 8 Chandrila 0
## 9 Concord Dawn 0
## 10 Corellia 0
## # ℹ 39 more rows
plyr
:
a*ply
functions take array-like structures and split
them up row-by-row or column-by-column.
d*ply
functions take data frames and split them on a
factor or a combination of factors.
l*ply
functions take lists and split them up one
element at a time.
dplyr
:
group_by
and summarise
allow you to
specify a split and an expression to compute in each split.
Only for data frames
In every case, a function is applied to each element of the split, the output computed, and the results reported as either an array, a data frame, or a list.