My learning curve with the statistical software R has been a long one, but one of the steepest and most exciting times was learning how to write functions and loops. Suddenly I could do all kinds of things that used to seem impossible. Since then, I’ve learned to avoid for loops whenever possible. Why? Because doing things serially is slow. With R, you can almost always reduce a big loop to just few lines of vectorized code.

But there’s one situation where I can’t avoid the dreaded for loop. Recently, I learned how to make for loops run 100s of times faster in these situations.

The situation I have in mind comes up fairly often when working with large behavioural datasets. More specifically, behaviour datasets that involve time series. Imagine, for instance, a series of males that a female visited over a number of days. What I find I often want to do is to label rows conditional on a value in the previous row. Here’s another example:

mydata <- data.frame(matrix(NA, ncol=3, nrow=100000))
names(mydata) <- c('rep', 'bird', 'direction')
mydata[,1:2] <- expand.grid(rep(1:10, each=1000), rep(1:10))
mydata$time <- rep(1:1000,100)
mydata$direction <- sample(c(0, 1), 100000, replace=T)
head(mydata)
summary(mydata)

Imagine this fake dataset is a series of 10 flights (rep 1-10) by 10 different birds (bird 1-10). Each row represents one frame in the video recording of a flight. The frame # is indicated by the column “time”.

table(mydata$bird, mydata$rep) # we have 1000 entries per flight

Direction (either 1 or 0) gives the direction of the bird’s travel (to or from the feeder). Suppose we want to create an additional column to divide each flight rep into separate events, with the event number changing whenever the bird’s direction changes. An obvious way to do this is with a for loop, where we examine each row of the dataframe in sequence, filling in the new event column conditional on the value of the current and previous rows (note: it’s best to execute this chunk all at once to calculate the elapsed time at the end… takes a little over a minute):


# we'll time our loop with proc.time(). Here we use it like a stopwatch: initialize it to our starting time, run the desired code, and then find the time difference
mydata$event <- 1
ptm <- proc.time() # start clock
for(i in 2:length(mydata[,1])){
if(mydata$bird[i] == mydata$bird[i-1] & mydata$rep[i] == mydata$rep[i-1] & mydata$direction[i] == mydata$direction[i-1]){
mydata$event[i] <- mydata$event[i-1]
} else {
mydata$event[i] <- mydata$event[i-1]+1
}
print(i)
}
proc.time() - ptm # stop the clock and find the difference
# this takes my computer just over 92s elapsed time

We can speed it up a little by removing the print() command progress indicator:

# remove the print command
mydata$event <- 1
ptm <- proc.time()
for(i in 2:length(mydata[,1])){
if(mydata$bird[i] == mydata$bird[i-1] & mydata$rep[i] == mydata$rep[i-1] & mydata$direction[i] == mydata$direction[i-1]){
mydata$event[i] <- mydata$event[i-1]
} else {
mydata$event[i] <- mydata$event[i-1]+1
}
}
proc.time() - ptm # without the print(), we're down to 78s

This is not too bad with 100,000 rows, but I have experiments with 10s of millions rows, and I found myself waiting hours to run these types of things. Annoying? Yes, but R is slow, right? Perhaps the fact that my laptop can render high-definition videos without much trouble should have clued me in that it didn’t have to be this way.

Then I came across an example online that proved this doesn’t have to be the case. The reason is indexing dataframes/matrices in R is slow, whereas indexing vectors in MUCH faster.

# here's how to speed things up a lot:
event <- rep(1, 100000) # initialize a vector to store the event outside of the dataframe
ptm <- proc.time()
for(i in 2:length(mydata[,1])){
if (mydata$bird[i] == mydata$bird[i-1] & mydata$rep[i] == mydata$rep[i-1] & mydata$direction[i] == mydata$direction[i-1]){
event[i] <- event[i-1]
} else {
event[i] <- event[i-1]+1
}
}
mydata$event <- event # put the newly-generated event vector back into the dataframe
proc.time() - ptm # less than 8 seconds

One last change, with a major effect:

# rather than perform a logical test on the dataframe with each iteration, we can generate a vector that does this outside of the dataframe instead:
event <- rep(1, 100000)
id <- paste(mydata$bird, mydata$rep, mydata$direction, sep='_')
idprev <- c(NA, id[1:199999])
cbind(id,idprev)[1:20,]
logi <- id == idprev # now we have a logical test of whether id matches the previous id value, in vector format
ptm <- proc.time()
for(i in 2:length(mydata[,1])){
if (logi[i] == T){ event[i] <- event[i-1]
} else {
event[i] <- event[i-1]+1
}
}
mydata$event <- event
proc.time() - ptm # down to a quarter of a second! that is 100s of times faster than the original version.
# for a really big dataset, this would be like going from an hour to a few seconds

So that’s how to loop more efficiently when you have to. Bottom-line: indexing is faster on vectors than dataframes.

Credit to this post and Benny Goller for figuring out that print() can be slow.

From January 26, 2015