Speeding up loops in R

This is from a session I did with the UBC R Study Group. Loops can be convenient for applying the same steps to big/distributed datasets, running simulations, and writing your own resampling/bootstrapping analyses. Here are some ways to make them faster.

1. Don’t grow things in your loops.
2. Vectorize where possible. i.e. pull things out of the loop.
3. Do less work in the loop if you can.

1. Don’t grow things in your loops.
# These two examples do the exact same thing: hit <- NAsystem.time(for(i in 1:50000){ hit[i] <- 0.3 }) head(hit) hit2 <- rep(NA, 50000)system.time (for(i in 1:50000){ hit2[i] <- 0.3 }) head(hit2)

The difference is in the first example the hit vector gets bigger with each iteration of the loop, whereas the second example stores results in a vector (hit2) that was initialized outside the loop at its final length. Option 2 is a huge time saver.

# aside: three ways to time your code # i) enclose it in system.time() system.time(mean(1:1000000)) # ii) or use Sys.time() I like using this with big chunks. Run these lines together: (start <- Sys.time()) mean(1:1000000) Sys.time()-start # note that it adds a bit of time, relative to system.time() # iii) microbenchmark repeats the process to get a distribution of times, for greater accuracy. you will have to get the package. beware: because it repeats, it will take more time! library(microbenchmark) microbenchmark(mean(1:1000000)) # you can pass microbenchmark lots of things, and change the number of samples/repeats: (rnormtime <- microbenchmark(rnorm(1000), rnorm(10000), rnorm(100000), rnorm(1000000), times=50)) head(rnormtime$time) # retrieve the data

Back to how to speed things up. So if you’re storing results somewhere, initialize the vectors where you’ll do this outside of the loop at their full length. i.e., don’t use rbind(), c() or append() to grow things within a loop

2. Vectorize
Sometimes you don’t need a loop at all. Here’s an example using the gapminder global demographic dataset (you’ll need the gapminder package):

library(gapminder) head(gapminder) # suppose we want to denote every time the life expectancy for a country went down as being a bad year. we can do this with a loop: start <- Sys.time() gapminder$badyear <- NA for(i in 2:dim(gapminder)[1]){ if((gapminder$lifeExp[i] < gapminder$lifeExp[i-1]) & (gapminder$country[i]==gapminder$country[i-1])){ gapminder$badyear[i] <- 'bad' } else { gapminder$badyear[i] <- 'nbad' } } Sys.time() - start summary(factor(gapminder$badyear)) # # alternatively, we could get rid of the loop altogether: start <- Sys.time() gapminder$lifeExpnext <- NA gapminder$lifeExpnext[1:dim(gapminder)[1]-1] <- gapminder$lifeExp[2:dim(gapminder)[1]] gapminder$countrynext <- NA gapminder$countrynext[1:dim(gapminder)[1]-1] <- as.character(gapminder$country[2:dim(gapminder)[1]]) gapminder$badyear2 <- ifelse(gapminder$countrynext==gapminder$country & gapminder$lifeExpnext < gapminder$lifeExp, 'bad', 'nbad') Sys.time() - start summary(factor(gapminder$badyear2))

The second example is more typing, and it’s probably trickier to set up and read, but this would save a ton of time with a really big dataset. So wherever possible do work outside of loops. Often conditionals can be set up beforehand.

3. Do less work in the loop
Here are a few examples of this one.

# Suppose we want to simulate a bunch of t-tests: ys <- matrix(rnorm(n=50*1000), nrow=50, ncol=5000) (group <- rep(1:2, each=25)) t.test(ys[,1] ~ group) # example of what we want to do. now repeat. surely it won't matter how we specify the t.test... system.time( for(i in 1:1000) t.test(ys[,i] ~ group)) system.time( for(i in 1:1000) t.test(group, ys[,i] )) # turns out it makes a huge difference. it must be that supplying a formula makes t.test() do more work (hidden to us) # # recall this example: start <- Sys.time() hit2 <- rep(NA, 500000) for(i in 1:500000){ hit2[i] <- rnorm(1) > 0.3 } Sys.time()-start summary(hit2) # # it's slightly faster to do this instead: start <- Sys.time() hit3 <- rep(F, 500000) for(i in 1:500000){ if(rnorm(1) > 0.3) hit3[i] <- T} Sys.time()-start summary(hit3) # the reason must be the second example only proceeds if the condition is met. meaning less work # # let's try two methods of storing simulated t-test results: in a data frame, vs in two vectors results.df <- data.frame(tstat=rep(NA,5000), pval=NA) tstat <- rep(NA, 5000) pval <- rep(NA, 5000) # system.time( for(i in 1:5000){ test1 <- t.test(group, ys[,i]) results.df$tstat[i] <- test1$statistic results.df$pval[i] <- test1$p.value } ) # takes my computer about 1.5 s # system.time( for(i in 1:5000){ test1 <- t.test(group, ys[,i]) tstat[i] <- test1$statistic pval[i] <- test1$p.value } ) # just over 1 s

So there is an advantage to indexing vectors over dataframes. Why? Even ‘$’ and [] are functions in R. It must be more work to index a dataframe than a vector. This can add up if you have a big loop that does a lot of indexing… in this case, it would be better to work with vectors and then add them back in to a dataframe afterwards, outside the loop.

Resources
Patrick Burns’ R Inferno
Hadley Wickham’s Advanced R