Stat 610 lecture 8, Split/apply/combine part 2

Split/apply/combine part 2

Reading: Hadley Wickham, “The Split/Apply/Combine Strategy for Data Analysis”.

Last time

Split/apply/combine with base R, lots of different functions for different tasks

plyr/dplyr is going to clean this up for us. The plyr functions have the goals as base R functions, but with

plyr naming convention

All plyr functions named **ply

The possible vaues for the positions are

What you will have you specify:

List input: l*ply

syntax: l*ply(data, fun)

Example:

library(plyr)
a_list <- list(a = 1, b = "state", c = TRUE)
a_list
## $a
## [1] 1
## 
## $b
## [1] "state"
## 
## $c
## [1] TRUE
laply(a_list, typeof)
## [1] "double"    "character" "logical"

l*ply will also work on vectors, e.g.:

vec <- 1:10
laply(vec, log)
##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
##  [8] 2.0794415 2.1972246 2.3025851

Array input: a*ply

syntax: a*ply(data, margin, fun)

Example:

X <- matrix(1:6, nrow = 2, ncol = 3)
X
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
aaply(X, 1, sum)
##  1  2 
##  9 12
adply(X, 1, sum)
##   X1 V1
## 1  1  9
## 2  2 12
alply(X, 1, sum)
## $`1`
## [1] 9
## 
## $`2`
## [1] 12
## 
## attr(,"split_type")
## [1] "array"
## attr(,"split_labels")
##   X1
## 1  1
## 2  2

A couple of extras:

Example on a 3-dimensional array:

data(HairEyeColor)
HairEyeColor
## , , Sex = Male
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
## 
## , , Sex = Female
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8
dim(HairEyeColor)
## [1] 4 4 2
dimnames(HairEyeColor)
## $Hair
## [1] "Black" "Brown" "Red"   "Blond"
## 
## $Eye
## [1] "Brown" "Blue"  "Hazel" "Green"
## 
## $Sex
## [1] "Male"   "Female"
adply(HairEyeColor, c(1,2), sum)
##     Hair   Eye  V1
## 1  Black Brown  68
## 2  Brown Brown 119
## 3    Red Brown  26
## 4  Blond Brown   7
## 5  Black  Blue  20
## 6  Brown  Blue  84
## 7    Red  Blue  17
## 8  Blond  Blue  94
## 9  Black Hazel  15
## 10 Brown Hazel  54
## 11   Red Hazel  14
## 12 Blond Hazel  10
## 13 Black Green   5
## 14 Brown Green  29
## 15   Red Green  14
## 16 Blond Green  16

Play around with replacing the margin vector with others, and convince yourself of why you get the output you do.

Data frame input: d*ply

syntax: d*ply(data, variables, fun)

For example: Star Wars

data(starwars, package = "dplyr")
starwars
## # A tibble: 87 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
##  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Suppose we want to know the average height of individuals on each world.

First ask:

First define a function that will take a subset of the data and find return to us the average height:

get_avg_height <- function(data_subset) {
    avg_height <- mean(data_subset$height)
    return(avg_height)
}

Then split starwars on the homeworld variable and apply get_avg_height to each data subset:

ddply(starwars, .(homeworld), get_avg_height)
##         homeworld       V1
## 1        Alderaan 176.3333
## 2     Aleen Minor  79.0000
## 3          Bespin 175.0000
## 4      Bestine IV 180.0000
## 5  Cato Neimoidia 191.0000
## 6           Cerea 198.0000
## 7        Champala 196.0000
## 8       Chandrila 150.0000
## 9    Concord Dawn 183.0000
## 10       Corellia 175.0000
## 11      Coruscant 173.6667
## 12       Dathomir 175.0000
## 13          Dorin 188.0000
## 14          Endor  88.0000
## 15         Eriadu 180.0000
## 16       Geonosis 183.0000
## 17    Glee Anselm 196.0000
## 18     Haruun Kal 188.0000
## 19        Iktotch 188.0000
## 20       Iridonia 171.0000
## 21          Kalee 216.0000
## 22         Kamino 208.3333
## 23       Kashyyyk 231.0000
## 24      Malastare 112.0000
## 25         Mirial 168.0000
## 26       Mon Cala 180.0000
## 27     Muunilinst 191.0000
## 28          Naboo 177.2727
## 29      Nal Hutta 175.0000
## 30           Ojom 198.0000
## 31        Quermia 264.0000
## 32          Rodia 173.0000
## 33         Ryloth 179.0000
## 34        Serenno 193.0000
## 35          Shili 178.0000
## 36          Skako 193.0000
## 37        Socorro 177.0000
## 38        Stewjon 182.0000
## 39        Sullust 160.0000
## 40       Tatooine 169.8000
## 41       Toydaria 137.0000
## 42      Trandosha 190.0000
## 43        Troiken 122.0000
## 44           Tund 163.0000
## 45         Umbara 178.0000
## 46         Utapau 206.0000
## 47        Vulpter  94.0000
## 48          Zolan 168.0000
## 49           <NA>       NA

Slightly more complicated example: we want to know the fraction of individuals with yellow eyes on each world.

get_frac_yellow_eyes <- function(data_subset) {
    num_yellow_eyes <- sum(data_subset$eye_color == "yellow")
    num_individuals <- nrow(data_subset)
    return(num_yellow_eyes / num_individuals)
}
ddply(starwars, .(homeworld), get_frac_yellow_eyes)
##         homeworld         V1
## 1        Alderaan 0.00000000
## 2     Aleen Minor 0.00000000
## 3          Bespin 0.00000000
## 4      Bestine IV 0.00000000
## 5  Cato Neimoidia 0.00000000
## 6           Cerea 1.00000000
## 7        Champala 0.00000000
## 8       Chandrila 0.00000000
## 9    Concord Dawn 0.00000000
## 10       Corellia 0.00000000
## 11      Coruscant 0.00000000
## 12       Dathomir 1.00000000
## 13          Dorin 0.00000000
## 14          Endor 0.00000000
## 15         Eriadu 0.00000000
## 16       Geonosis 1.00000000
## 17    Glee Anselm 0.00000000
## 18     Haruun Kal 0.00000000
## 19        Iktotch 0.00000000
## 20       Iridonia 0.00000000
## 21          Kalee 0.00000000
## 22         Kamino 0.00000000
## 23       Kashyyyk 0.00000000
## 24      Malastare 0.00000000
## 25         Mirial 0.00000000
## 26       Mon Cala 0.00000000
## 27     Muunilinst 0.00000000
## 28          Naboo 0.09090909
## 29      Nal Hutta 0.00000000
## 30           Ojom 1.00000000
## 31        Quermia 1.00000000
## 32          Rodia 0.00000000
## 33         Ryloth 0.00000000
## 34        Serenno 0.00000000
## 35          Shili 0.00000000
## 36          Skako 0.00000000
## 37        Socorro 0.00000000
## 38        Stewjon 0.00000000
## 39        Sullust 0.00000000
## 40       Tatooine 0.20000000
## 41       Toydaria 1.00000000
## 42      Trandosha 0.00000000
## 43        Troiken 0.00000000
## 44           Tund 0.00000000
## 45         Umbara 0.00000000
## 46         Utapau 0.00000000
## 47        Vulpter 1.00000000
## 48          Zolan 1.00000000
## 49           <NA> 0.00000000

Output types

We said the options for output are arrays, data frames, and lists.

We’ve seen examples above, but let’s look more systematically.

Data frame output: *dply

data(diamonds)
diamond_coefs <- ddply(diamonds, .(color), function(data_subset) {
    diamond_lm <- lm(log(price) ~ carat, data = data_subset)
    diamond_coefs <- coef(diamond_lm)
    return(diamond_coefs)
})
diamond_coefs
##   color (Intercept)    carat
## 1     D    6.048811 2.383864
## 2     E    6.034513 2.348335
## 3     F    6.088442 2.272790
## 4     G    6.109554 2.178489
## 5     H    6.180284 1.906300
## 6     I    6.175315 1.799199
## 7     J    6.254074 1.627947
diamond_coefs_2 <- ddply(diamonds, .(color), function(data_subset) {
    diamond_lm <- lm(log(price) ~ carat, data = data_subset)
    diamond_coefs <- coef(diamond_lm)
    df_row <- data.frame(slope = diamond_coefs['carat'],
                         my_color_description = paste0("color", unique(data_subset$color)))
    return(df_row)
})
diamond_coefs_2
##   color    slope my_color_description
## 1     D 2.383864               colorD
## 2     E 2.348335               colorE
## 3     F 2.272790               colorF
## 4     G 2.178489               colorG
## 5     H 1.906300               colorH
## 6     I 1.799199               colorI
## 7     J 1.627947               colorJ

Check on your own what happens when you replace .(color) with .(color, clarity). How is the output different than what we got using by last time? Is it better?

Array output: *aply

Here we split along one dimension and have one-dimensional output. Similar to what we’ve seen before with apply in base R.

X <- matrix(1:6, nrow = 3, ncol = 2)
X
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
row_sums_x <- aaply(X, 1, sum, .drop = FALSE)
dim(row_sums_x)
## [1] 3 1
row_sums_x
##    
## X1  1
##   1 5
##   2 7
##   3 9
row_sums_x <- aaply(X, 1, sum)
dim(row_sums_x)
## NULL
row_sums_x
## 1 2 3 
## 5 7 9

Here we split along two dimensions and the processing function gives one-dimensional output:

X_array <- array(data = 1:12, dim = c(3, 2, 2))
dim(X_array)
## [1] 3 2 2
X_array
## , , 1
## 
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    7   10
## [2,]    8   11
## [3,]    9   12
third_dim_sums_x <- aaply(X_array, 1:2, sum, .drop = FALSE)
dim(third_dim_sums_x)
## [1] 3 2 1
third_dim_sums_x
## , ,  = 1
## 
##    X2
## X1   1  2
##   1  8 14
##   2 10 16
##   3 12 18
third_dim_sums_x <- aaply(X_array, 1:2, sum, .drop = FALSE)
third_dim_sums_x
## , ,  = 1
## 
##    X2
## X1   1  2
##   1  8 14
##   2 10 16
##   3 12 18

Here we split along one dimension and have four-dimensional output from the processing function:

nonsense_function <- function(x) {
    out <- c(x[1] * 2, x[2] * -1, 0, 0)
    return(out)
}
X
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
nonsense_applied_to_rows <- aaply(X, 1, nonsense_function)
dim(nonsense_applied_to_rows)
## [1] 3 4
nonsense_applied_to_rows
##    
## X1  1  2 3 4
##   1 2 -4 0 0
##   2 4 -5 0 0
##   3 6 -6 0 0

List output: *lply

diamond_lms <- dlply(diamonds, .(color), function(data_subset) {
    lm(log(price) ~ carat, data = data_subset)
})

class(diamond_lms)
## [1] "list"
class(diamond_lms[[1]])
## [1] "lm"
names(diamond_lms)
## [1] "D" "E" "F" "G" "H" "I" "J"

Some special cases

dplyr

A second package, different interface, specifically for data frames.

To do split/apply/combine with dplyr, we still specify:

but the syntax is different

dplyr syntax:

df %>% 
   group_by(group_var) %>% 
   summarise(output_name = expression)

Note that expression is not a function here.

Star wars example again:

What is the average height in each world?

library(dplyr)
starwars %>% group_by(homeworld) %>% summarise(avg_height = mean(height))
## # A tibble: 49 × 2
##    homeworld      avg_height
##    <chr>               <dbl>
##  1 Alderaan             176.
##  2 Aleen Minor           79 
##  3 Bespin               175 
##  4 Bestine IV           180 
##  5 Cato Neimoidia       191 
##  6 Cerea                198 
##  7 Champala             196 
##  8 Chandrila            150 
##  9 Concord Dawn         183 
## 10 Corellia             175 
## # ℹ 39 more rows

What fraction of individuals on each world have yellow eyes?

starwars %>%
    group_by(homeworld) %>%
    summarise(frac_yellow_eyes = sum(eye_color == "yellow") / length(eye_color))
## # A tibble: 49 × 2
##    homeworld      frac_yellow_eyes
##    <chr>                     <dbl>
##  1 Alderaan                      0
##  2 Aleen Minor                   0
##  3 Bespin                        0
##  4 Bestine IV                    0
##  5 Cato Neimoidia                0
##  6 Cerea                         1
##  7 Champala                      0
##  8 Chandrila                     0
##  9 Concord Dawn                  0
## 10 Corellia                      0
## # ℹ 39 more rows

Summing up

plyr:

dplyr:

In every case, a function is applied to each element of the split, the output computed, and the results reported as either an array, a data frame, or a list.