Stat 610 Lecture 7: Split/apply/combine

Avoiding iteration, split/apply/combine

Reading: R Cookbook, Chapter 6

Agenda for today:

Avoiding iteration

We’ve already seen a lot of examples of functions in R that are written so as to avoid iteration.

Built-in functions for making sequences or repeating elements, for example, we can say:

x <- seq(1, 11, by = 2)

instead of

x <- numeric(6)
for(i in 1:length(x)) {
    x[i] <- 1 + 2 * (i - 1)
}

Most of the built-in functions are vectorized, so we can use

log_x <- log(x)

instead of

log_x <- numeric(length(x))
for(i in 1:length(x)) {
    log_x[i] <- log(x[i])
}

Plus a whole bunch we haven’t seen yet, like rowSums, colMeans, sweep, scale

However, not everything is vectorized, and you will often want to go beyond the built-in functions, and for that you use the *apply family of functions.

Applying to one-dimensional structures

lapply, and sapply will apply a function to a one-dimensional structure.

The difference between them is how the output is formatted.

They have the same syntax: lapply(X = x, FUN = fun, ...) and sapply(X = x, FUN = fun, ...)

Return values:

Example of lapply

data_dir <- "~/GitHub/stat-comp-fall-22/lectures/lecture07/bglobin_annotations"
file_names <- list.files(data_dir, full.names = TRUE)
annotations <- lapply(file_names, read.csv)
annotations <- lapply(file_names, read.csv, stringsAsFactors = FALSE)
lapply(annotations[1:3], summary)
## [[1]]
##   unique_ids           v_gene             d_gene             j_gene         
##  Length:136         Length:136         Length:136         Length:136        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   cdr3_length    mut_freqs         n_mutations       input_seqs       
##  Min.   :366   Min.   :0.000000   Min.   :  0.000   Length:136        
##  1st Qu.:367   1st Qu.:0.004951   1st Qu.:  2.000   Class :character  
##  Median :367   Median :0.009901   Median :  4.000   Mode  :character  
##  Mean   :367   Mean   :0.021003   Mean   :  8.485                     
##  3rd Qu.:367   3rd Qu.:0.022277   3rd Qu.:  9.000                     
##  Max.   :367   Max.   :0.544555   Max.   :220.000                     
##  indel_reversed_seqs  naive_seq           indelfos         duplicates    
##  Length:136          Length:136         Length:136         Mode:logical  
##  Class :character    Class :character   Class :character   NA's:136      
##  Mode  :character    Mode  :character   Mode  :character                 
##                                                                          
##                                                                          
##                                                                          
##  v_per_gene_support d_per_gene_support j_per_gene_support    v_3p_del      
##  Mode:logical       Mode:logical       Mode:logical       Min.   :0.00000  
##  NA's:136           NA's:136           NA's:136           1st Qu.:0.00000  
##                                                           Median :0.00000  
##                                                           Mean   :0.08823  
##                                                           3rd Qu.:0.00000  
##                                                           Max.   :9.00000  
##     d_5p_del    d_3p_del    j_5p_del    v_5p_del    j_3p_del vd_insertion      
##  Min.   :0   Min.   :0   Min.   :0   Min.   :0   Min.   :0   Length:136        
##  1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0   Class :character  
##  Median :0   Median :0   Median :0   Median :0   Median :0   Mode  :character  
##  Mean   :0   Mean   :0   Mean   :0   Mean   :0   Mean   :0                     
##  3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0                     
##  Max.   :0   Max.   :0   Max.   :0   Max.   :0   Max.   :0                     
##  dj_insertion   fv_insertion   jf_insertion   mutated_invariants
##  Mode:logical   Mode:logical   Mode:logical   Length:136        
##  NA's:136       NA's:136       NA's:136       Class :character  
##                                               Mode  :character  
##                                                                 
##                                                                 
##                                                                 
##   in_frames            stops               k_v                k_d           
##  Length:136         Length:136         Length:136         Length:136        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  padlefts       padrights      all_matches         mut_freqs.1      
##  Mode:logical   Mode:logical   Length:136         Min.   :0.000000  
##  NA's:136       NA's:136       Class :character   1st Qu.:0.004951  
##                                Mode  :character   Median :0.009901  
##                                                   Mean   :0.021003  
##                                                   3rd Qu.:0.022277  
##                                                   Max.   :0.544555  
## 
## [[2]]
##   unique_ids           v_gene             d_gene             j_gene         
##  Length:214         Length:214         Length:214         Length:214        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   cdr3_length    mut_freqs         n_mutations       input_seqs       
##  Min.   :365   Min.   :0.000000   Min.   :  0.000   Length:214        
##  1st Qu.:367   1st Qu.:0.002475   1st Qu.:  1.000   Class :character  
##  Median :367   Median :0.002475   Median :  1.000   Mode  :character  
##  Mean   :367   Mean   :0.009892   Mean   :  3.995                     
##  3rd Qu.:367   3rd Qu.:0.004951   3rd Qu.:  2.000                     
##  Max.   :368   Max.   :0.559406   Max.   :226.000                     
##  indel_reversed_seqs  naive_seq           indelfos         duplicates    
##  Length:214          Length:214         Length:214         Mode:logical  
##  Class :character    Class :character   Class :character   NA's:214      
##  Mode  :character    Mode  :character   Mode  :character                 
##                                                                          
##                                                                          
##                                                                          
##  v_per_gene_support d_per_gene_support j_per_gene_support    v_3p_del     
##  Mode:logical       Mode:logical       Mode:logical       Min.   :0.0000  
##  NA's:214           NA's:214           NA's:214           1st Qu.:0.0000  
##                                                           Median :0.0000  
##                                                           Mean   :0.1776  
##                                                           3rd Qu.:0.0000  
##                                                           Max.   :9.0000  
##     d_5p_del    d_3p_del    j_5p_del    v_5p_del    j_3p_del vd_insertion      
##  Min.   :0   Min.   :0   Min.   :0   Min.   :0   Min.   :0   Length:214        
##  1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0   Class :character  
##  Median :0   Median :0   Median :0   Median :0   Median :0   Mode  :character  
##  Mean   :0   Mean   :0   Mean   :0   Mean   :0   Mean   :0                     
##  3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0                     
##  Max.   :0   Max.   :0   Max.   :0   Max.   :0   Max.   :0                     
##  dj_insertion   fv_insertion   jf_insertion   mutated_invariants
##  Mode:logical   Mode:logical   Mode:logical   Length:214        
##  NA's:214       NA's:214       NA's:214       Class :character  
##                                               Mode  :character  
##                                                                 
##                                                                 
##                                                                 
##   in_frames            stops               k_v                k_d           
##  Length:214         Length:214         Length:214         Length:214        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  padlefts       padrights      all_matches         mut_freqs.1      
##  Mode:logical   Mode:logical   Length:214         Min.   :0.000000  
##  NA's:214       NA's:214       Class :character   1st Qu.:0.002475  
##                                Mode  :character   Median :0.002475  
##                                                   Mean   :0.009892  
##                                                   3rd Qu.:0.004951  
##                                                   Max.   :0.559406  
## 
## [[3]]
##   unique_ids           v_gene             d_gene             j_gene         
##  Length:201         Length:201         Length:201         Length:201        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   cdr3_length      mut_freqs         n_mutations      input_seqs       
##  Min.   :294.0   Min.   :0.000000   Min.   : 0.000   Length:201        
##  1st Qu.:367.0   1st Qu.:0.002475   1st Qu.: 1.000   Class :character  
##  Median :367.0   Median :0.002475   Median : 1.000   Mode  :character  
##  Mean   :366.6   Mean   :0.005742   Mean   : 2.299                     
##  3rd Qu.:367.0   3rd Qu.:0.004951   3rd Qu.: 2.000                     
##  Max.   :368.0   Max.   :0.057402   Max.   :20.000                     
##  indel_reversed_seqs  naive_seq           indelfos         duplicates    
##  Length:201          Length:201         Length:201         Mode:logical  
##  Class :character    Class :character   Class :character   NA's:201      
##  Mode  :character    Mode  :character   Mode  :character                 
##                                                                          
##                                                                          
##                                                                          
##  v_per_gene_support d_per_gene_support j_per_gene_support    v_3p_del      
##  Mode:logical       Mode:logical       Mode:logical       Min.   : 0.0000  
##  NA's:201           NA's:201           NA's:201           1st Qu.: 0.0000  
##                                                           Median : 0.0000  
##                                                           Mean   : 0.5373  
##                                                           3rd Qu.: 0.0000  
##                                                           Max.   :74.0000  
##     d_5p_del    d_3p_del    j_5p_del    v_5p_del    j_3p_del vd_insertion      
##  Min.   :0   Min.   :0   Min.   :0   Min.   :0   Min.   :0   Length:201        
##  1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0   Class :character  
##  Median :0   Median :0   Median :0   Median :0   Median :0   Mode  :character  
##  Mean   :0   Mean   :0   Mean   :0   Mean   :0   Mean   :0                     
##  3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0                     
##  Max.   :0   Max.   :0   Max.   :0   Max.   :0   Max.   :0                     
##  dj_insertion   fv_insertion   jf_insertion   mutated_invariants
##  Mode:logical   Mode:logical   Mode:logical   Length:201        
##  NA's:201       NA's:201       NA's:201       Class :character  
##                                               Mode  :character  
##                                                                 
##                                                                 
##                                                                 
##   in_frames            stops               k_v                k_d           
##  Length:201         Length:201         Length:201         Length:201        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  padlefts       padrights      all_matches         mut_freqs.1      
##  Mode:logical   Mode:logical   Length:201         Min.   :0.000000  
##  NA's:201       NA's:201       Class :character   1st Qu.:0.002475  
##                                Mode  :character   Median :0.002475  
##                                                   Mean   :0.005742  
##                                                   3rd Qu.:0.004951  
##                                                   Max.   :0.057402

We need to do this because read.csv doesn’t vectorize. Check what happens if you do read_file(file_names) (you’ll have to change data_dir to wherever you saved the data to).

Example of sapply

sapply(annotations, nrow)
##  [1] 136 214 201 186 231 112  86 123  52 132

Applying to two-dimensional structures

Syntax: apply(X = mat, MARGIN = dim, FUN = fun, ...)

As with lapply and sapply, fun will be called with each row or each column of mat as its first argument, along with any extra arguments specified after the function.

Some simple examples:

X <- matrix(sample(-5:5, size = 20, replace = TRUE), nrow = 4, ncol = 5)
X
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    3   -4    5   -1    1
## [2,]   -2    1   -3    4    3
## [3,]    1    5   -5    0   -1
## [4,]   -5   -4   -1    4   -1
apply(X, 2, min)
## [1] -5 -4 -5 -1 -1
log_X <- log(X)
## Warning in log(X): NaNs produced
log_X
##          [,1]     [,2]     [,3]     [,4]     [,5]
## [1,] 1.098612      NaN 1.609438      NaN 0.000000
## [2,]      NaN 0.000000      NaN 1.386294 1.098612
## [3,] 0.000000 1.609438      NaN     -Inf      NaN
## [4,]      NaN      NaN      NaN 1.386294      NaN
apply(log_X, 1, min, na.rm = TRUE)
## [1] 0.000000 0.000000     -Inf 1.386294

Be careful when using apply on rows of a data frame: apply assumes the argument you pass to the function is a vector, and will coerce the rows of the data frame to vectors to make it so.

steak_combinations <- data.frame(
    temp = c(117, 120, 135, 105),
    type = c("rare", "med_rare", "med_rare", "rare"))

steak_directions_for_apply <- function(temp_and_type) {
    temp <- temp_and_type[1]
    steak_type <- temp_and_type[2]
    if(steak_type == "rare" & temp > 115) {
        return("take your steak off!")
    } else if(steak_type == "med_rare" & temp > 125) {
        return("take your steak off!")        
    } 
    "you can keep cooking"
}
## this works, but it really shouldn't
apply(steak_combinations, 1, steak_directions_for_apply)
## [1] "take your steak off!" "you can keep cooking" "take your steak off!"
## [4] "you can keep cooking"

One that actually doesn’t work:

steak_admonishments_for_apply <- function(temp_and_type) {
    temp <- temp_and_type[1]
    steak_type <- temp_and_type[2]
    if(steak_type == "rare" & temp > 115) {
        admonishment <- sprintf("Your steak is overcooked by %f degrees!", temp - 115)
        return(admonhishment)
    } else if(steak_type == "med_rare" & temp > 125) {
        admonishment <- sprintf("Your steak is overcooked by %f degrees!", temp - 125)
        return(admonishment)
    }
    "Your steak is not overcooked yet"
}
apply(steak_combinations, 1, steak_admonishments_for_apply)
## Error in temp - 115: non-numeric argument to binary operator

If you want to apply a function that uses columns of a data frame that are of different types, better/safer to use mapply:

mapply(FUN = f, vec1, vec2, ..., vecN)

steak_directions <- function(steak_temp, steak_type) {
    if(steak_type == "rare" & steak_temp > 115) {
        return("take your steak off!")
    } else if(steak_type == "med_rare" & steak_temp > 125) {
        return("take your steak off!")        
    } 
    "you can keep cooking"
}
mapply(FUN = steak_directions, steak_combinations$temp, steak_combinations$type)
## [1] "take your steak off!" "you can keep cooking" "take your steak off!"
## [4] "you can keep cooking"
steak_admonishments <- function(steak_temp, steak_type) {
    if(steak_type == "rare" & steak_temp > 115) {
        admonishment <- sprintf("Your steak is overcooked by %i degrees!", steak_temp - 115)
        return(admonishment)
    } else if(steak_type == "med_rare" & steak_temp > 125) {
        admonishment <- sprintf("Your steak is overcooked by %i degrees!", steak_temp - 125)
        return(admonishment)
    }
    "Your steak is not overcooked yet."
}

mapply(FUN = steak_admonishments, steak_combinations$temp, steak_combinations$type)
## [1] "Your steak is overcooked by 2 degrees!" 
## [2] "Your steak is not overcooked yet."      
## [3] "Your steak is overcooked by 10 degrees!"
## [4] "Your steak is not overcooked yet."

What if things aren’t quite so tidy?

Split/apply/combine paradigm

Turns out this is common enough that it has a name and a bunch of built-in functions.

The programming pattern is pretty clear from the name, what we want to do is:

Why is this a useful abstraction?

Imagine how you would do this if you only had for-loops.

You would have to first

Then, for each level of the grouping factor:

Compare with split/apply/combine abstraction:

The abstraction is useful both for writing and reading the resulting code: you think about what you want to do instead of how to do it.

Split/apply/combine for vectors

Simple case: we have a vector, and a grouping factor, we want to apply a function to groups of elements of the vector defined by the grouping factor

Solution is tapply

tapply(X = x, INDEX = f, FUN = fun)

Syntax:

Example: data on police stops in New York City in 1998–1999, information on the number of stops as categorized by certain variables.

In particular, we have counts of police stops for all combinations of

frisk <- read.table("http://www.stat.columbia.edu/~gelman/arm/examples/police/frisk_with_noise.dat",
                    skip = 6, header = TRUE)
head(frisk)
##   stops  pop past.arrests precinct eth crime
## 1    75 1720          191        1   1     1
## 2    36 1720           57        1   1     2
## 3    74 1720          599        1   1     3
## 4    17 1720          133        1   1     4
## 5    37 1368           62        1   2     1
## 6    39 1368           27        1   2     2

Suppose we want to get the total number of stops in each precinct.

We can use tapply with the sum function to do so:

tapply(X = frisk$stops, INDEX = frisk$precinct, FUN = sum)
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##  385  347 1603 1411 1600 1297  649 1572  285 1156  838  884 2215 1165 2373  965 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##  946 2359 1311 1566 1613 4030 2334 2622 3685 1541 2005  902 2888 2157 1805 1041 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
## 2966 2810  724 1189  997 1808 2464  859 1979 2679 1645 1669 2953 3443 1001 1574 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
## 1041 3054 1208 2116 1216  995 1140  572  856 2671 1147 2941 1625 1635 1248 1351 
##   65   66   67   68   69   70   71   72   73   74   75 
## 2290 2941 2430 1109 1267 2476 2922 3560 3650 1327  322

Just as a check, make sure that the numbers match up:

precinct_1 <- subset(frisk, precinct == 1)
sum(precinct_1$stops)
## [1] 385

And remember how much better this is than iterating using a for loop would have been:

## find the number of precincts and make a vector of the right length to store the output
precincts <- unique(frisk$precinct)
stops_per_precinct <- numeric(length(precincts))
## iterate over precinct, get the subset for each one, compute the number of stops, and store
for(i in 1:length(precincts)) {
    precinct_subset <- subset(frisk, precinct == precincts[i])
    precinct_stops <- sum(precinct_subset$stops)
    stops_per_precinct[i] <- precinct_stops
}
stops_per_precinct
##  [1]  385  347 1603 1411 1600 1297  649 1572  285 1156  838  884 2215 1165 2373
## [16]  965  946 2359 1311 1566 1613 4030 2334 2622 3685 1541 2005  902 2888 2157
## [31] 1805 1041 2966 2810  724 1189  997 1808 2464  859 1979 2679 1645 1669 2953
## [46] 3443 1001 1574 1041 3054 1208 2116 1216  995 1140  572  856 2671 1147 2941
## [61] 1625 1635 1248 1351 2290 2941 2430 1109 1267 2476 2922 3560 3650 1327  322

Split/apply/combine for data frames

Splitting a data frame into groups of rows and applying a function to the groups:

by(data = df, INDICES = fact, FUN = fun)

stops_and_past_arrests_correlation <- function(frisk_subset) {
    cor(frisk_subset$stops, frisk_subset$past.arrests)
}
within_precinct_cors <- by(data = frisk,
                           INDICES = frisk$precinct,
                           FUN = stops_and_past_arrests_correlation)
head(within_precinct_cors)
## frisk$precinct
##           1           2           3           4           5           6 
##  0.60479730  0.01659348 -0.21949242  0.12046523 -0.06892661  0.54151306

Anonymous functions

Here we only need the stops_and_past_arrests_correlation function in the context of by, so there’s no need to define it in the global environment.

Instead we can specify it as an anonymous function:

within_precinct_cors <- by(data = frisk,
                           INDICES = frisk$precinct,
                           FUN = function(frisk_subset) {
                               cor(frisk_subset$stops, frisk_subset$past.arrests)
                           })
head(within_precinct_cors)
## frisk$precinct
##           1           2           3           4           5           6 
##  0.60479730  0.01659348 -0.21949242  0.12046523 -0.06892661  0.54151306

A more complex example:

data(diamonds)
get_diamond_coefficients <- function(data_subset) {
    diamond_lm <- lm(log(price) ~ carat, data = data_subset)
    diamond_coefficients <- coef(diamond_lm)
    return(diamond_coefficients)
}
## by does the split and apply steps, and an ugly version of a combine step
out <- by(data = diamonds, INDICES = diamonds$color, FUN = get_diamond_coefficients)

The output is as a list with some extra attributes (the class is by, as you can check).

The form is not ideal on its own, usually you’ll want to simplify it. You have a couple of options, including the examples below:

simplify2array(out)
##                    D        E        F        G        H        I        J
## (Intercept) 6.048811 6.034513 6.088442 6.109554 6.180284 6.175315 6.254074
## carat       2.383864 2.348335 2.272790 2.178489 1.906300 1.799199 1.627947
## changing the output from a list to a matrix
do.call(rbind, out)
##   (Intercept)    carat
## D    6.048811 2.383864
## E    6.034513 2.348335
## F    6.088442 2.272790
## G    6.109554 2.178489
## H    6.180284 1.906300
## I    6.175315 1.799199
## J    6.254074 1.627947
## it's kind of bad that the splitting variable is only available as row names
## we can fix that but it's a bit of a pain
diamond_coef_matrix <- do.call(rbind, out)
diamond_coef_df <- data.frame(diamond_coef_matrix,
                              color = attributes(out)$dimnames$`diamonds$color`)
diamond_coef_df
##   X.Intercept.    carat color
## D     6.048811 2.383864     D
## E     6.034513 2.348335     E
## F     6.088442 2.272790     F
## G     6.109554 2.178489     G
## H     6.180284 1.906300     H
## I     6.175315 1.799199     I
## J     6.254074 1.627947     J

by can also be used with multiple grouping factors:

by(data = diamonds, INDICES = diamonds[, c("color", "cut")], FUN = get_diamond_coefficients)
## color: D
## cut: Fair
## (Intercept)       carat 
##    6.723804    1.514938 
## ------------------------------------------------------------ 
## color: E
## cut: Fair
## (Intercept)       carat 
##    6.246003    1.955816 
## ------------------------------------------------------------ 
## color: F
## cut: Fair
## (Intercept)       carat 
##    6.466073    1.637304 
## ------------------------------------------------------------ 
## color: G
## cut: Fair
## (Intercept)       carat 
##    6.659407    1.358722 
## ------------------------------------------------------------ 
## color: H
## cut: Fair
## (Intercept)       carat 
##    6.921231    1.122101 
## ------------------------------------------------------------ 
## color: I
## cut: Fair
## (Intercept)       carat 
##    6.759325    1.202279 
## ------------------------------------------------------------ 
## color: J
## cut: Fair
## (Intercept)       carat 
##   7.0504140   0.8761416 
## ------------------------------------------------------------ 
## color: D
## cut: Good
## (Intercept)       carat 
##    5.925962    2.415309 
## ------------------------------------------------------------ 
## color: E
## cut: Good
## (Intercept)       carat 
##    5.976449    2.322896 
## ------------------------------------------------------------ 
## color: F
## cut: Good
## (Intercept)       carat 
##    6.016284    2.273226 
## ------------------------------------------------------------ 
## color: G
## cut: Good
## (Intercept)       carat 
##    6.180670    2.037179 
## ------------------------------------------------------------ 
## color: H
## cut: Good
## (Intercept)       carat 
##    6.106822    1.952841 
## ------------------------------------------------------------ 
## color: I
## cut: Good
## (Intercept)       carat 
##    6.182043    1.762664 
## ------------------------------------------------------------ 
## color: J
## cut: Good
## (Intercept)       carat 
##    6.133176    1.728532 
## ------------------------------------------------------------ 
## color: D
## cut: Very Good
## (Intercept)       carat 
##    5.953898    2.487391 
## ------------------------------------------------------------ 
## color: E
## cut: Very Good
## (Intercept)       carat 
##    5.909810    2.480529 
## ------------------------------------------------------------ 
## color: F
## cut: Very Good
## (Intercept)       carat 
##    5.947350    2.446917 
## ------------------------------------------------------------ 
## color: G
## cut: Very Good
## (Intercept)       carat 
##    5.965330    2.343517 
## ------------------------------------------------------------ 
## color: H
## cut: Very Good
## (Intercept)       carat 
##    6.110954    1.993643 
## ------------------------------------------------------------ 
## color: I
## cut: Very Good
## (Intercept)       carat 
##    6.207955    1.807934 
## ------------------------------------------------------------ 
## color: J
## cut: Very Good
## (Intercept)       carat 
##    6.211086    1.696800 
## ------------------------------------------------------------ 
## color: D
## cut: Premium
## (Intercept)       carat 
##    6.109571    2.258907 
## ------------------------------------------------------------ 
## color: E
## cut: Premium
## (Intercept)       carat 
##    6.133275    2.185096 
## ------------------------------------------------------------ 
## color: F
## cut: Premium
## (Intercept)       carat 
##    6.189666    2.124037 
## ------------------------------------------------------------ 
## color: G
## cut: Premium
## (Intercept)       carat 
##    6.181514    2.052932 
## ------------------------------------------------------------ 
## color: H
## cut: Premium
## (Intercept)       carat 
##    6.270586    1.809367 
## ------------------------------------------------------------ 
## color: I
## cut: Premium
## (Intercept)       carat 
##    6.229239    1.723257 
## ------------------------------------------------------------ 
## color: J
## cut: Premium
## (Intercept)       carat 
##    6.305114    1.578440 
## ------------------------------------------------------------ 
## color: D
## cut: Ideal
## (Intercept)       carat 
##    5.970971    2.626088 
## ------------------------------------------------------------ 
## color: E
## cut: Ideal
## (Intercept)       carat 
##    5.989833    2.526368 
## ------------------------------------------------------------ 
## color: F
## cut: Ideal
## (Intercept)       carat 
##    6.052536    2.408753 
## ------------------------------------------------------------ 
## color: G
## cut: Ideal
## (Intercept)       carat 
##    6.032829    2.367710 
## ------------------------------------------------------------ 
## color: H
## cut: Ideal
## (Intercept)       carat 
##    6.075676    2.071994 
## ------------------------------------------------------------ 
## color: I
## cut: Ideal
## (Intercept)       carat 
##    6.071246    1.932255 
## ------------------------------------------------------------ 
## color: J
## cut: Ideal
## (Intercept)       carat 
##    6.104631    1.784315

But how do you get at which coefficients correspond to which factor groups?

If you want to do split and apply/combine separately

split does just the splitting part of split/apply/combine

Syntax: split(x = data, f = fact)

Output:

We can re-do the stop-and-frisk example that we used tapply for with split followed by lapply or sapply

frisk_split_by_precinct <- split(x = frisk$stops, f = frisk$precinct)
## check that this has length 75, the number of precincts
length(frisk_split_by_precinct)
## [1] 75
sapply(frisk_split_by_precinct, sum)
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##  385  347 1603 1411 1600 1297  649 1572  285 1156  838  884 2215 1165 2373  965 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##  946 2359 1311 1566 1613 4030 2334 2622 3685 1541 2005  902 2888 2157 1805 1041 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
## 2966 2810  724 1189  997 1808 2464  859 1979 2679 1645 1669 2953 3443 1001 1574 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
## 1041 3054 1208 2116 1216  995 1140  572  856 2671 1147 2941 1625 1635 1248 1351 
##   65   66   67   68   69   70   71   72   73   74   75 
## 2290 2941 2430 1109 1267 2476 2922 3560 3650 1327  322

In general, base R apply-family functions don’t produce nice output from the combine step.

We’ll see implementations of split/apply/combine next time that do a nicer job.