Reading: R Cookbook, Chapter 6
Agenda for today:
Avoiding iteration in base R
Split/apply/combine
We’ve already seen a lot of examples of functions in R that are written so as to avoid iteration.
Built-in functions for making sequences or repeating elements, for example, we can say:
instead of
Most of the built-in functions are vectorized, so we can use
instead of
Plus a whole bunch we haven’t seen yet, like rowSums
,
colMeans
, sweep
, scale
However, not everything is vectorized, and you will often want to go
beyond the built-in functions, and for that you use the
*apply
family of functions.
lapply
, and sapply
will apply a function to
a one-dimensional structure.
The difference between them is how the output is formatted.
They have the same syntax: lapply(X = x, FUN = fun, ...)
and sapply(X = x, FUN = fun, ...)
x
is a vector or list that the function should be
applied to
fun
is the function that should be applied to each
element of the vector or the list
You can specify other arguments in the ...
portion.
fun
will be called with first argument equal to each
element of x
, and subsequent arguments anything you specify
in the ...
portion.
Return values:
lapply
returns a list with length equal to the
length of x
containing the result of the function
evaluation on the corresponding element of x
.
sapply
tries to simplify the output to a vector or
matrix. If each element of the output is of length 1 it returns a
vector. If each element has the same length of two or more, it returns a
matrix. Otherwise it returns a list like lapply
.
Example of lapply
data_dir <- "~/GitHub/stat-comp-fall-22/lectures/lecture07/bglobin_annotations"
file_names <- list.files(data_dir, full.names = TRUE)
annotations <- lapply(file_names, read.csv)
annotations <- lapply(file_names, read.csv, stringsAsFactors = FALSE)
lapply(annotations[1:3], summary)
## [[1]]
## unique_ids v_gene d_gene j_gene
## Length:136 Length:136 Length:136 Length:136
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## cdr3_length mut_freqs n_mutations input_seqs
## Min. :366 Min. :0.000000 Min. : 0.000 Length:136
## 1st Qu.:367 1st Qu.:0.004951 1st Qu.: 2.000 Class :character
## Median :367 Median :0.009901 Median : 4.000 Mode :character
## Mean :367 Mean :0.021003 Mean : 8.485
## 3rd Qu.:367 3rd Qu.:0.022277 3rd Qu.: 9.000
## Max. :367 Max. :0.544555 Max. :220.000
## indel_reversed_seqs naive_seq indelfos duplicates
## Length:136 Length:136 Length:136 Mode:logical
## Class :character Class :character Class :character NA's:136
## Mode :character Mode :character Mode :character
##
##
##
## v_per_gene_support d_per_gene_support j_per_gene_support v_3p_del
## Mode:logical Mode:logical Mode:logical Min. :0.00000
## NA's:136 NA's:136 NA's:136 1st Qu.:0.00000
## Median :0.00000
## Mean :0.08823
## 3rd Qu.:0.00000
## Max. :9.00000
## d_5p_del d_3p_del j_5p_del v_5p_del j_3p_del vd_insertion
## Min. :0 Min. :0 Min. :0 Min. :0 Min. :0 Length:136
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 Class :character
## Median :0 Median :0 Median :0 Median :0 Median :0 Mode :character
## Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
## dj_insertion fv_insertion jf_insertion mutated_invariants
## Mode:logical Mode:logical Mode:logical Length:136
## NA's:136 NA's:136 NA's:136 Class :character
## Mode :character
##
##
##
## in_frames stops k_v k_d
## Length:136 Length:136 Length:136 Length:136
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## padlefts padrights all_matches mut_freqs.1
## Mode:logical Mode:logical Length:136 Min. :0.000000
## NA's:136 NA's:136 Class :character 1st Qu.:0.004951
## Mode :character Median :0.009901
## Mean :0.021003
## 3rd Qu.:0.022277
## Max. :0.544555
##
## [[2]]
## unique_ids v_gene d_gene j_gene
## Length:214 Length:214 Length:214 Length:214
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## cdr3_length mut_freqs n_mutations input_seqs
## Min. :365 Min. :0.000000 Min. : 0.000 Length:214
## 1st Qu.:367 1st Qu.:0.002475 1st Qu.: 1.000 Class :character
## Median :367 Median :0.002475 Median : 1.000 Mode :character
## Mean :367 Mean :0.009892 Mean : 3.995
## 3rd Qu.:367 3rd Qu.:0.004951 3rd Qu.: 2.000
## Max. :368 Max. :0.559406 Max. :226.000
## indel_reversed_seqs naive_seq indelfos duplicates
## Length:214 Length:214 Length:214 Mode:logical
## Class :character Class :character Class :character NA's:214
## Mode :character Mode :character Mode :character
##
##
##
## v_per_gene_support d_per_gene_support j_per_gene_support v_3p_del
## Mode:logical Mode:logical Mode:logical Min. :0.0000
## NA's:214 NA's:214 NA's:214 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1776
## 3rd Qu.:0.0000
## Max. :9.0000
## d_5p_del d_3p_del j_5p_del v_5p_del j_3p_del vd_insertion
## Min. :0 Min. :0 Min. :0 Min. :0 Min. :0 Length:214
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 Class :character
## Median :0 Median :0 Median :0 Median :0 Median :0 Mode :character
## Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
## dj_insertion fv_insertion jf_insertion mutated_invariants
## Mode:logical Mode:logical Mode:logical Length:214
## NA's:214 NA's:214 NA's:214 Class :character
## Mode :character
##
##
##
## in_frames stops k_v k_d
## Length:214 Length:214 Length:214 Length:214
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## padlefts padrights all_matches mut_freqs.1
## Mode:logical Mode:logical Length:214 Min. :0.000000
## NA's:214 NA's:214 Class :character 1st Qu.:0.002475
## Mode :character Median :0.002475
## Mean :0.009892
## 3rd Qu.:0.004951
## Max. :0.559406
##
## [[3]]
## unique_ids v_gene d_gene j_gene
## Length:201 Length:201 Length:201 Length:201
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## cdr3_length mut_freqs n_mutations input_seqs
## Min. :294.0 Min. :0.000000 Min. : 0.000 Length:201
## 1st Qu.:367.0 1st Qu.:0.002475 1st Qu.: 1.000 Class :character
## Median :367.0 Median :0.002475 Median : 1.000 Mode :character
## Mean :366.6 Mean :0.005742 Mean : 2.299
## 3rd Qu.:367.0 3rd Qu.:0.004951 3rd Qu.: 2.000
## Max. :368.0 Max. :0.057402 Max. :20.000
## indel_reversed_seqs naive_seq indelfos duplicates
## Length:201 Length:201 Length:201 Mode:logical
## Class :character Class :character Class :character NA's:201
## Mode :character Mode :character Mode :character
##
##
##
## v_per_gene_support d_per_gene_support j_per_gene_support v_3p_del
## Mode:logical Mode:logical Mode:logical Min. : 0.0000
## NA's:201 NA's:201 NA's:201 1st Qu.: 0.0000
## Median : 0.0000
## Mean : 0.5373
## 3rd Qu.: 0.0000
## Max. :74.0000
## d_5p_del d_3p_del j_5p_del v_5p_del j_3p_del vd_insertion
## Min. :0 Min. :0 Min. :0 Min. :0 Min. :0 Length:201
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 Class :character
## Median :0 Median :0 Median :0 Median :0 Median :0 Mode :character
## Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
## dj_insertion fv_insertion jf_insertion mutated_invariants
## Mode:logical Mode:logical Mode:logical Length:201
## NA's:201 NA's:201 NA's:201 Class :character
## Mode :character
##
##
##
## in_frames stops k_v k_d
## Length:201 Length:201 Length:201 Length:201
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## padlefts padrights all_matches mut_freqs.1
## Mode:logical Mode:logical Length:201 Min. :0.000000
## NA's:201 NA's:201 Class :character 1st Qu.:0.002475
## Mode :character Median :0.002475
## Mean :0.005742
## 3rd Qu.:0.004951
## Max. :0.057402
We need to do this because read.csv
doesn’t vectorize.
Check what happens if you do read_file(file_names)
(you’ll
have to change data_dir
to wherever you saved the data
to).
Syntax: apply(X = mat, MARGIN = dim, FUN = fun, ...)
mat
: The matrix or data frame you want to apply the
function to.
dim
: Either 1 or 2: 1 means apply the function to
each row in mat
, 2 means apply the function to each column
in mat
. (1 and 2 are related to the way the dimensions are
laid out: the first dimension in the matrix is the number of rows, the
second is the number of columns.)
fun
: The function you want to apply to each row or
to each column. It should take a vector as its first argument.
...
: Extra arguments to be passed to
fun
.
As with lapply
and sapply
, fun
will be called with each row or each column of mat
as its
first argument, along with any extra arguments specified after the
function.
Some simple examples:
## Warning in log(X): NaNs produced
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.098612 NaN 1.609438 NaN 0.000000
## [2,] NaN 0.000000 NaN 1.386294 1.098612
## [3,] 0.000000 1.609438 NaN -Inf NaN
## [4,] NaN NaN NaN 1.386294 NaN
## [1] 0.000000 0.000000 -Inf 1.386294
Be careful when using apply on rows of a data frame: apply assumes the argument you pass to the function is a vector, and will coerce the rows of the data frame to vectors to make it so.
steak_combinations <- data.frame(
temp = c(117, 120, 135, 105),
type = c("rare", "med_rare", "med_rare", "rare"))
steak_directions_for_apply <- function(temp_and_type) {
temp <- temp_and_type[1]
steak_type <- temp_and_type[2]
if(steak_type == "rare" & temp > 115) {
return("take your steak off!")
} else if(steak_type == "med_rare" & temp > 125) {
return("take your steak off!")
}
"you can keep cooking"
}
## this works, but it really shouldn't
apply(steak_combinations, 1, steak_directions_for_apply)
## [1] "take your steak off!" "you can keep cooking" "take your steak off!"
## [4] "you can keep cooking"
One that actually doesn’t work:
steak_admonishments_for_apply <- function(temp_and_type) {
temp <- temp_and_type[1]
steak_type <- temp_and_type[2]
if(steak_type == "rare" & temp > 115) {
admonishment <- sprintf("Your steak is overcooked by %f degrees!", temp - 115)
return(admonhishment)
} else if(steak_type == "med_rare" & temp > 125) {
admonishment <- sprintf("Your steak is overcooked by %f degrees!", temp - 125)
return(admonishment)
}
"Your steak is not overcooked yet"
}
apply(steak_combinations, 1, steak_admonishments_for_apply)
## Error in temp - 115: non-numeric argument to binary operator
If you want to apply a function that uses columns of a data frame
that are of different types, better/safer to use
mapply
:
mapply(FUN = f, vec1, vec2, ..., vecN)
f
is the function you want to apply
Output is a list, ith element will be f(vec1[i], vec2[i],
..., vecN[i])
steak_directions <- function(steak_temp, steak_type) {
if(steak_type == "rare" & steak_temp > 115) {
return("take your steak off!")
} else if(steak_type == "med_rare" & steak_temp > 125) {
return("take your steak off!")
}
"you can keep cooking"
}
mapply(FUN = steak_directions, steak_combinations$temp, steak_combinations$type)
## [1] "take your steak off!" "you can keep cooking" "take your steak off!"
## [4] "you can keep cooking"
steak_admonishments <- function(steak_temp, steak_type) {
if(steak_type == "rare" & steak_temp > 115) {
admonishment <- sprintf("Your steak is overcooked by %i degrees!", steak_temp - 115)
return(admonishment)
} else if(steak_type == "med_rare" & steak_temp > 125) {
admonishment <- sprintf("Your steak is overcooked by %i degrees!", steak_temp - 125)
return(admonishment)
}
"Your steak is not overcooked yet."
}
mapply(FUN = steak_admonishments, steak_combinations$temp, steak_combinations$type)
## [1] "Your steak is overcooked by 2 degrees!"
## [2] "Your steak is not overcooked yet."
## [3] "Your steak is overcooked by 10 degrees!"
## [4] "Your steak is not overcooked yet."
The *apply
functions we’ve seen so far have applied
a function element-by-element to a vector or matrix.
What if we need to apply a function not to individual elements/rows/columns, but to subsets of different sizes?
Turns out this is common enough that it has a name and a bunch of built-in functions.
The programming pattern is pretty clear from the name, what we want to do is:
split the data into groups
apply a function to each of the groups
combine the results of applying the function to each group into an output dataset
Imagine how you would do this if you only had for-loops.
You would have to first
Compute what size the output dataset should be
Create a data structure to hold the output.
Then, for each level of the grouping factor:
Compute how many data elements correspond to that grouping level
Create a data structure to hold the data subset
Loop through the data to extract the correct subset
Apply your function to that subset and put the output in the data structure your created at the beginning.
Compare with split/apply/combine abstraction:
Specify data
Specify grouping variable for split
Specify function to apply to each split
The abstraction is useful both for writing and reading the resulting code: you think about what you want to do instead of how to do it.
Simple case: we have a vector, and a grouping factor, we want to apply a function to groups of elements of the vector defined by the grouping factor
Solution is tapply
tapply(X = x, INDEX = f, FUN = fun)
Syntax:
x
: A vector with the data.
f
: A factor variable describing how the data should
be split up.
fun
: The function you want to apply to each subset
of the data.
Example: data on police stops in New York City in 1998–1999, information on the number of stops as categorized by certain variables.
In particular, we have counts of police stops for all combinations of
precinct: 75 total
eth: Ethnicity of the person stopped, three possibilities, and
crime: The type of crime, four possibilities
frisk <- read.table("http://www.stat.columbia.edu/~gelman/arm/examples/police/frisk_with_noise.dat",
skip = 6, header = TRUE)
head(frisk)
## stops pop past.arrests precinct eth crime
## 1 75 1720 191 1 1 1
## 2 36 1720 57 1 1 2
## 3 74 1720 599 1 1 3
## 4 17 1720 133 1 1 4
## 5 37 1368 62 1 2 1
## 6 39 1368 27 1 2 2
Suppose we want to get the total number of stops in each precinct.
We can use tapply
with the sum
function to
do so:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 385 347 1603 1411 1600 1297 649 1572 285 1156 838 884 2215 1165 2373 965
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 946 2359 1311 1566 1613 4030 2334 2622 3685 1541 2005 902 2888 2157 1805 1041
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 2966 2810 724 1189 997 1808 2464 859 1979 2679 1645 1669 2953 3443 1001 1574
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 1041 3054 1208 2116 1216 995 1140 572 856 2671 1147 2941 1625 1635 1248 1351
## 65 66 67 68 69 70 71 72 73 74 75
## 2290 2941 2430 1109 1267 2476 2922 3560 3650 1327 322
Just as a check, make sure that the numbers match up:
## [1] 385
And remember how much better this is than iterating using a for loop would have been:
## find the number of precincts and make a vector of the right length to store the output
precincts <- unique(frisk$precinct)
stops_per_precinct <- numeric(length(precincts))
## iterate over precinct, get the subset for each one, compute the number of stops, and store
for(i in 1:length(precincts)) {
precinct_subset <- subset(frisk, precinct == precincts[i])
precinct_stops <- sum(precinct_subset$stops)
stops_per_precinct[i] <- precinct_stops
}
stops_per_precinct
## [1] 385 347 1603 1411 1600 1297 649 1572 285 1156 838 884 2215 1165 2373
## [16] 965 946 2359 1311 1566 1613 4030 2334 2622 3685 1541 2005 902 2888 2157
## [31] 1805 1041 2966 2810 724 1189 997 1808 2464 859 1979 2679 1645 1669 2953
## [46] 3443 1001 1574 1041 3054 1208 2116 1216 995 1140 572 856 2671 1147 2941
## [61] 1625 1635 1248 1351 2290 2941 2430 1109 1267 2476 2922 3560 3650 1327 322
Splitting a data frame into groups of rows and applying a function to the groups:
by(data = df, INDICES = fact, FUN = fun)
df
: A data frame containing the data you want to
split up and apply a function to.
fact
: A factor variable describing the groups you
want to split df
into.
fun
: The function you want to apply to each group.
It should take as its first argument a data frame.
stops_and_past_arrests_correlation <- function(frisk_subset) {
cor(frisk_subset$stops, frisk_subset$past.arrests)
}
within_precinct_cors <- by(data = frisk,
INDICES = frisk$precinct,
FUN = stops_and_past_arrests_correlation)
head(within_precinct_cors)
## frisk$precinct
## 1 2 3 4 5 6
## 0.60479730 0.01659348 -0.21949242 0.12046523 -0.06892661 0.54151306
Here we only need the stops_and_past_arrests_correlation
function in the context of by
, so there’s no need to define
it in the global environment.
Instead we can specify it as an anonymous function:
within_precinct_cors <- by(data = frisk,
INDICES = frisk$precinct,
FUN = function(frisk_subset) {
cor(frisk_subset$stops, frisk_subset$past.arrests)
})
head(within_precinct_cors)
## frisk$precinct
## 1 2 3 4 5 6
## 0.60479730 0.01659348 -0.21949242 0.12046523 -0.06892661 0.54151306
A more complex example:
data(diamonds)
get_diamond_coefficients <- function(data_subset) {
diamond_lm <- lm(log(price) ~ carat, data = data_subset)
diamond_coefficients <- coef(diamond_lm)
return(diamond_coefficients)
}
## by does the split and apply steps, and an ugly version of a combine step
out <- by(data = diamonds, INDICES = diamonds$color, FUN = get_diamond_coefficients)
The output is as a list with some extra attributes (the class is
by
, as you can check).
The form is not ideal on its own, usually you’ll want to simplify it. You have a couple of options, including the examples below:
## D E F G H I J
## (Intercept) 6.048811 6.034513 6.088442 6.109554 6.180284 6.175315 6.254074
## carat 2.383864 2.348335 2.272790 2.178489 1.906300 1.799199 1.627947
## (Intercept) carat
## D 6.048811 2.383864
## E 6.034513 2.348335
## F 6.088442 2.272790
## G 6.109554 2.178489
## H 6.180284 1.906300
## I 6.175315 1.799199
## J 6.254074 1.627947
## it's kind of bad that the splitting variable is only available as row names
## we can fix that but it's a bit of a pain
diamond_coef_matrix <- do.call(rbind, out)
diamond_coef_df <- data.frame(diamond_coef_matrix,
color = attributes(out)$dimnames$`diamonds$color`)
diamond_coef_df
## X.Intercept. carat color
## D 6.048811 2.383864 D
## E 6.034513 2.348335 E
## F 6.088442 2.272790 F
## G 6.109554 2.178489 G
## H 6.180284 1.906300 H
## I 6.175315 1.799199 I
## J 6.254074 1.627947 J
by
can also be used with multiple grouping factors:
## color: D
## cut: Fair
## (Intercept) carat
## 6.723804 1.514938
## ------------------------------------------------------------
## color: E
## cut: Fair
## (Intercept) carat
## 6.246003 1.955816
## ------------------------------------------------------------
## color: F
## cut: Fair
## (Intercept) carat
## 6.466073 1.637304
## ------------------------------------------------------------
## color: G
## cut: Fair
## (Intercept) carat
## 6.659407 1.358722
## ------------------------------------------------------------
## color: H
## cut: Fair
## (Intercept) carat
## 6.921231 1.122101
## ------------------------------------------------------------
## color: I
## cut: Fair
## (Intercept) carat
## 6.759325 1.202279
## ------------------------------------------------------------
## color: J
## cut: Fair
## (Intercept) carat
## 7.0504140 0.8761416
## ------------------------------------------------------------
## color: D
## cut: Good
## (Intercept) carat
## 5.925962 2.415309
## ------------------------------------------------------------
## color: E
## cut: Good
## (Intercept) carat
## 5.976449 2.322896
## ------------------------------------------------------------
## color: F
## cut: Good
## (Intercept) carat
## 6.016284 2.273226
## ------------------------------------------------------------
## color: G
## cut: Good
## (Intercept) carat
## 6.180670 2.037179
## ------------------------------------------------------------
## color: H
## cut: Good
## (Intercept) carat
## 6.106822 1.952841
## ------------------------------------------------------------
## color: I
## cut: Good
## (Intercept) carat
## 6.182043 1.762664
## ------------------------------------------------------------
## color: J
## cut: Good
## (Intercept) carat
## 6.133176 1.728532
## ------------------------------------------------------------
## color: D
## cut: Very Good
## (Intercept) carat
## 5.953898 2.487391
## ------------------------------------------------------------
## color: E
## cut: Very Good
## (Intercept) carat
## 5.909810 2.480529
## ------------------------------------------------------------
## color: F
## cut: Very Good
## (Intercept) carat
## 5.947350 2.446917
## ------------------------------------------------------------
## color: G
## cut: Very Good
## (Intercept) carat
## 5.965330 2.343517
## ------------------------------------------------------------
## color: H
## cut: Very Good
## (Intercept) carat
## 6.110954 1.993643
## ------------------------------------------------------------
## color: I
## cut: Very Good
## (Intercept) carat
## 6.207955 1.807934
## ------------------------------------------------------------
## color: J
## cut: Very Good
## (Intercept) carat
## 6.211086 1.696800
## ------------------------------------------------------------
## color: D
## cut: Premium
## (Intercept) carat
## 6.109571 2.258907
## ------------------------------------------------------------
## color: E
## cut: Premium
## (Intercept) carat
## 6.133275 2.185096
## ------------------------------------------------------------
## color: F
## cut: Premium
## (Intercept) carat
## 6.189666 2.124037
## ------------------------------------------------------------
## color: G
## cut: Premium
## (Intercept) carat
## 6.181514 2.052932
## ------------------------------------------------------------
## color: H
## cut: Premium
## (Intercept) carat
## 6.270586 1.809367
## ------------------------------------------------------------
## color: I
## cut: Premium
## (Intercept) carat
## 6.229239 1.723257
## ------------------------------------------------------------
## color: J
## cut: Premium
## (Intercept) carat
## 6.305114 1.578440
## ------------------------------------------------------------
## color: D
## cut: Ideal
## (Intercept) carat
## 5.970971 2.626088
## ------------------------------------------------------------
## color: E
## cut: Ideal
## (Intercept) carat
## 5.989833 2.526368
## ------------------------------------------------------------
## color: F
## cut: Ideal
## (Intercept) carat
## 6.052536 2.408753
## ------------------------------------------------------------
## color: G
## cut: Ideal
## (Intercept) carat
## 6.032829 2.367710
## ------------------------------------------------------------
## color: H
## cut: Ideal
## (Intercept) carat
## 6.075676 2.071994
## ------------------------------------------------------------
## color: I
## cut: Ideal
## (Intercept) carat
## 6.071246 1.932255
## ------------------------------------------------------------
## color: J
## cut: Ideal
## (Intercept) carat
## 6.104631 1.784315
But how do you get at which coefficients correspond to which factor groups?
split
does just the splitting part of
split/apply/combine
Syntax: split(x = data, f = fact)
data
: Your data, either a vector or a data frame,
that you want to divide into groups.
fact
: A factor variable used to split
data
. Length is equal to the length of dat
if
it is a vector or equal to the number of rows of data
if it
is a data frame.
Output:
A list of length equal to the number of levels in
fact
Each element of the list is either a vector or a data frame,
depending on what data
was, containing the elements of
data
corresponding to one of the levels of
fact
.
We can re-do the stop-and-frisk example that we used
tapply
for with split
followed by
lapply
or sapply
frisk_split_by_precinct <- split(x = frisk$stops, f = frisk$precinct)
## check that this has length 75, the number of precincts
length(frisk_split_by_precinct)
## [1] 75
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 385 347 1603 1411 1600 1297 649 1572 285 1156 838 884 2215 1165 2373 965
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 946 2359 1311 1566 1613 4030 2334 2622 3685 1541 2005 902 2888 2157 1805 1041
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 2966 2810 724 1189 997 1808 2464 859 1979 2679 1645 1669 2953 3443 1001 1574
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 1041 3054 1208 2116 1216 995 1140 572 856 2671 1147 2941 1625 1635 1248 1351
## 65 66 67 68 69 70 71 72 73 74 75
## 2290 2941 2430 1109 1267 2476 2922 3560 3650 1327 322
In general, base R apply-family functions don’t produce nice output from the combine step.
We’ll see implementations of split/apply/combine next time that do a nicer job.