Reading: Matloff, Chapter 14
Logistics:
Midterm next week,
One more lecture on programming,
Then on to algorithms
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
Sir Tony Hoare, popularized by Donald Knuth
You have:
Code that is correct
Code that is slow for the application you want
You want:
Code that is correct
Code that runs in a reasonable amount of time
Two steps:
Find what the bottleneck is
Rewrite the code for the bottleneck to make it run faster
What to do?
Make a smaller example that runs quickly
Write tests (formal or informal) for the output of the function
Make a new function to implement the same algorithm, rewritten to be (hopefully) faster
Check on the test cases that it gives the same output
Check whether the new version is faster
Base R function for this task is Rprof
.
Many other packages, e.g. profvis
(I believe what RStudio uses now) are based on Rprof
, usually with more nicely formatted output.
How does Rprof work?
Once every set amount of time (default .02 seconds), stops the interpreter and checks what function is being called.
Output is a list of function calls.
Output is formatted to indicate what takes the most time.
It will be non-deterministic, depends a bit on what else your computer is doing.
If we don’t sample frequently enough, we don’t get very useful output:
Several packages for testing alternative implementations
microbenchmark
function in microbenchmark
package runs the same code a large number of times
mark
function in the bench
package does the same thing but with a little more information and some extra plotting facilities.
Remove extra work
Vectorise
Avoid copies
Rewrite your R code in C/C++/fortran
If you know something about the input to a function, you can speed up your code by skipping some of the steps R takes in checking what kind of data is used.
Here we’re bypassing the method dispatch for mean
.
## Unit: nanoseconds
## expr min lq mean median uq max neval cld
## mean(x) 2123 2222 2719.09 2273.0 2361.0 15633 100 b
## mean.default(x) 977 1023 1259.76 1059.5 1101.5 15034 100 a
The as.data.frame
function performs many checks and is not optimized for performance.
Suppose that we have a list and want to change it to a data frame.
If we’re sure it’s formatted correctly, we can use the quickdf
function.
quickdf = function(l) {
class(l) = "data.frame"
attr(l, "row.names") = 1:length(l[[1]])
l
}
l = lapply(1:26, function(i) runif(1e3))
names(l) = letters
## Unit: microseconds
## expr min lq mean median uq max
## quick_df 9.795 11.2755 37.25248 16.4895 19.2925 2100.902
## as.data.frame 834.235 898.4920 1026.75537 958.2835 1039.0330 2727.082
## neval cld
## 100 a
## 100 b
But we need to be sure the input is formatted correctly, because it does no checks:
## Warning in format.data.frame(x, digits = digits, na.encode = FALSE):
## corrupt data frame: columns will be truncated or padded with NAs
## x y
## 1 1 1
random_string = function() {
paste(sample(letters, 50, replace = TRUE), collapse = "")
}
strings10 = replicate(10, random_string())
strings100 = replicate(100, random_string())
collapse = function(xs) {
out = ""
for (x in xs) {
out = paste0(out, x)
}
out
}
microbenchmark(
loop10 = collapse(strings10),
loop100 = collapse(strings100),
vec10 = paste(strings10, collapse = ""),
vec100 = paste(strings100, collapse = "")
)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## loop10 14.630 15.3790 17.57578 16.324 17.5530 48.325 100 a
## loop100 517.291 523.8855 605.96058 568.129 597.0755 2793.477 100 b
## vec10 3.596 3.9195 4.70854 4.357 5.1320 17.976 100 a
## vec100 28.755 30.3115 33.98348 31.691 34.5485 102.933 100 a
This will sometimes give speed improvements and sometimes will not.
Speed improvements happen primarily when it leads to calling compiled C code.
x = matrix(rnorm(15), nrow = 3, ncol = 5)
row_means_loop = function(x) {
row_means = numeric(nrow(x))
for(i in 1:nrow(x)) {
row_sum = 0
for(j in 1:ncol(x)) {
row_sum = row_sum + x[i,j]
}
row_means[i] = row_sum / ncol(x)
}
return(row_means)
}
microbenchmark(
no_vectorization = row_means_loop(x),
apply = apply(x, 1, mean),
c_version = rowMeans(x),
times = 1000)
## Unit: microseconds
## expr min lq mean median uq max neval
## no_vectorization 4.524 5.7855 13.008270 6.874 7.5575 6113.275 1000
## apply 18.465 21.7525 27.769083 28.535 30.0870 97.173 1000
## c_version 2.348 3.2730 3.939602 3.823 4.4555 25.342 1000
## cld
## a
## b
## a
Another example: a problem in probability.
Urn 1 contains 10 blue marbles and 8 yellow marbles
Urn 2 contains 6 blue marbles and 6 yellow marbles
We draw one marble uniformly at random from urn 1, place it in urn 2, and then draw a marble uniformly at random from urn 2.
What is the probability that the second marble is blue?
First attempt:
# perform nreps repetitions of the marble experiment, to estimate
# P(pick blue from Urn 2)
sim1 = function(nreps) {
nb1 = 10 #10 blue marbles in Urn1
n1 = 18 # number of marbles in Urn 1 at 1st pick
n2 = 13 # number of marbles in Urn 2 at 2nd pick
count = 0 # number of repetitions in which get blue from Urn 2
for (i in 1:nreps) {
nb2 = 6 # 6 bluemarbles orig. in Urn2
## pick from Urn 1 and put in Urn 2; is it blue?
if (runif(1) < nb1/n1) nb2 = nb2 + 1
## pick from Urn 2; is it blue?
if (runif(1) < nb2/n2) count = count + 1
}
return(count/nreps) # est. P(pick blue from Urn 2)
}
Let’s try pre-generate all of our random numbers and remove the for loop:
sim2 = function(nreps) {
nb1 = 10
nb2 = 6
n1 = 18
n2 = 13
## pre-generate all our random numbers, one row per repetition
u = matrix(c(runif(2 * nreps)), nrow = nreps, ncol = 2)
## define simfun for use in apply(); simulates one repetition
simfun = function(rw) {
## rw ("row") is a pair of random numbers
## choose from Urn 1
if (rw[1] < nb1 / n1) nb2 = nb2 + 1
## choose from Urn 2, and return boolean on choosing blue
return (rw[2] < nb2 / n2)
}
z = apply(u, 1, simfun)
return(mean(z))
}
## Unit: microseconds
## expr min lq mean median uq max neval cld
## sim1(100) 259.107 271.3620 478.8515 285.0615 341.2680 11297.506 100 a
## sim2(100) 128.224 135.2935 235.9363 141.4335 158.3995 8446.768 100 a
The timings here are different from in Matloff. I believe this is because more recent version of R byte compile functions that you write, and this was not true for the version of R used in the book.
Take 3:
sim3 = function(nreps) {
nb1 = 10
nb2 = 6
n1 = 18
n2 = 13
u = matrix(c(runif(2 * nreps)), nrow = nreps, ncol = 2)
## set up the condition vector
cndtn =
u[,1] <= nb1 / n1 & u[,2] <= (nb2+1) / n2 |
u[,1] > nb1 / n1 & u[,2] <= nb2 / n2
return(mean(cndtn))
}
## Unit: microseconds
## expr min lq mean median uq max neval cld
## sim1(100) 261.157 279.9745 484.1770 411.6000 459.8785 10179.709 100 b
## sim2(100) 127.339 137.5345 212.5064 185.0070 249.0700 1054.157 100 a
## sim3(100) 19.062 23.3220 122.1441 28.4635 36.6295 8939.858 100 a
If you know something about your input that R has to check for, you can save time by bypassing the checks
Vectorised functions are sometimes faster, sometimes not
Matrix operations generally call C and are therefore fast
Compiled C code is faster than R code