In R, how do you loop over the rows of a data frame really fast? - r

Suppose that you have a data frame with many rows and many columns.
The columns have names. You want to access rows by number, and columns by name.
For example, one (possibly slow) way to loop over the rows is
for (i in 1:nrow(df)) {
print(df[i, "column1"])
# do more things with the data frame...
}
Another way is to create "lists" for separate columns (like column1_list = df[["column1"]), and access the lists in one loop. This approach might be fast, but also inconvenient if you want to access many columns.
Is there a fast way of looping over the rows of a data frame? Is some other data structure better for looping fast?

I think I need to make this a full answer because I find comments harder to track and I already lost one comment on this... There is an example by nullglob that demonstrates the differences among for, and apply family functions much better than other examples. When one makes the function such that it is very slow then that's where all the speed is consumed and you won't find differences among the variations on looping. But when you make the function trivial then you can see how much the looping influences things.
I'd also like to add that some members of the apply family unexplored in other examples have interesting performance properties. First I'll show replications of nullglob's relative results on my machine.
n <- 1e6
system.time(for(i in 1:n) sinI[i] <- sin(i))
user system elapsed
5.721 0.028 5.712
lapply runs much faster for the same result
system.time(sinI <- lapply(1:n,sin))
user system elapsed
1.353 0.012 1.361
He also found sapply much slower. Here are some others that weren't tested.
Plain old apply to a matrix version of the data...
mat <- matrix(1:n,ncol =1),1,sin)
system.time(sinI <- apply(mat,1,sin))
user system elapsed
8.478 0.116 8.531
So, the apply() command itself is substantially slower than the for loop. (for loop is not slowed down appreciably if I use sin(mat[i,1]).
Another one that doesn't seem to be tested in other posts is tapply.
system.time(sinI <- tapply(1:n, 1:n, sin))
user system elapsed
12.908 0.266 13.589
Of course, one would never use tapply this way and it's utility is far beyond any such speed problem in most cases.

The fastest way is to not loop (i.e. vectorized operations). One of the only instances in which you need to loop is when there are dependencies (i.e. one iteration depends on another). Otherwise, try to do as much vectorized computation outside the loop as possible.
If you do need to loop, then using a for loop is essentially as fast as anything else (lapply can be a little faster, but other apply functions tend to be around the same speed as for).

Exploiting the fact that data.frames are essentially lists of column vectors, one can use do.call to apply a function with the arity of the number of columns over each column of the data.frame (similar to a "zipping" over a list in other languages).
do.call(paste, data.frame(x=c(1,2), z=c("a","b"), z=c(5,6)))

Related

Am I using the most efficient (or right) R instructions?

first question, I'll try to go straight to the point.
I'm currently working with tables and I've chosen R because it has no limit with dataframe sizes and can perform several operations over the data within the tables. I am happy with that, as I can manipulate it at my will, merges, concats and row and column manipulation works fine; but I recently had to run a loop with 0.00001 sec/instruction over a 6 Mill table row and it took over an hour.
Maybe the approach of R was wrong to begin with, and I've tried to look for the most efficient ways to run some operations (using list assignments instead of c(list,new_element)) but, since as far as I can tell, this is not something that you can optimize with some sort of algorithm like graphs or heaps (is just tables, you have to iterate through it all) I was wondering if there might be some other instructions or other basic ways to work with tables that I don't know (assign, extract...) that take less time, or configuration over RStudio to improve performance.
This is the loop, just so if it helps to understand the question:
my_list <- vector("list",nrow(table[,"Date_of_count"]))
for(i in 1:nrow(table[,"Date_of_count"])){
my_list[[i]] <- format(as.POSIXct(strptime(table[i,"Date_of_count"]%>%pull(1),"%Y-%m-%d")),format = "%Y-%m-%d")
}
The table, as aforementioned, has over 6 Mill rows and 25 variables. I want the list to be filled to append it to the table as a column once finished.
Please let me know if it lacks specificity or concretion, or if it just does not belong here.
In order to improve performance (and properly work with R and tables), the answer was a mixture of the first comments:
use vectors
avoid repeated conversions
if possible, avoid loops and apply functions directly over list/vector
I just converted the table (which, realized, had some tibbles inside) into a dataframe and followed the aforementioned keys.
df <- as.data.frame(table)
In this case, by doing this the dates were converted directly to character so I did not have to apply any more conversions.
New execution time over 6 Mill rows: 25.25 sec.

Optimized way of merging ~200k large [many columns but only one row] data frames with differing column names

I have a large number (over 200,000) separate files, each of which contains a single row and many columns (sometimes upwards of several hundred columns). There is one column (id) in common across all of the files. Otherwise, the column names are semi-random, with incomplete overlap between dataframes. Currently, I am using %in% to determine which columns are in common and then merging using those columns. This works perfectly well, although I am confident (maybe hopeful is a better word) that it could be done faster.
For example:
dfone<-data.frame(id="12fgh",fred="1",wilma="2",barney="1")
dftwo<-data.frame(id="36fdl",fred="5",daphne="3")
common<-names(dfone)[names(dfone) %in% names(dftwo)]
merged<-merge(dfone,dftwo,by=common,all=TRUE)
So, since I'm reading a large number of files, here is what I'm doing right now:
fls<-list.files()
first<-fls[1]
merged<-read.csv(first)
for (fl in fls) {
dffl<-read.csv(fl)
common<-names(dffl)[names(dffl) %in% names(merged)]
merged<-merge(dffl,merged,by=common,all=TRUE)
# print(paste(nrow(merged)," rows in dataframe.",sep=""))
# flush.console()
# print(paste("Just did ",fl,".",sep=""))
# flush.console()
}
Obviously, the commented out section is just a way of keeping track of it as it's running. Which it is, although profoundly slowly, and it runs ever more slowly as it is assembling the data frame.
(1) I am confident that a loop isn't the right way to do this, but I can't figure out a way to vectorize this
(2) My hope is that there is some way to do the merge that I'm missing that doesn't involve my column name comparison kludge
(3) All of which is to say that this is running way too slowly to be viable
Any thought on how to optimize this mess? Thank you very much in advance.
A much shorter and cleaner approach is to read them all into a list and then merge.
do.call(merge, lapply(list.files(), read.csv))
It will still be slow though. You could speed it up by replacing read.csv with something faster (e.g., data.table::fread) and possibly by replacing lapply with parallel::mclapply.

Reduce computation time

Most of the data sets that I have worked with has generally been of moderate size (mostly less than 100k rows) and hence my code's execution time has usually not been that big a problem for me.
But I was recently trying to write a function that takes 2 dataframes as arguments (with, say, m & n rows) and returns a new dataframe with m*n rows. I then have to perform some operations on the resulting data set. So, even with small values of m & n (say around 1000 each ) the resulting dataframe would have more than a million rows.
When I try even simple operations on this dataset, the code takes an intolerably long time to run. Specifically, my resulting dataframe has 2 columns with numeric values and I need to add a new column which will compare the values of these columns and categorize them as - "Greater than", "less than", "Tied"
I am using the following code:
df %>% mutate(compare=ifelse(var1==var2,"tied",
ifelse(var1>var2,"Greater than","lesser then")
And, as I mentioned before, this takes forever to run. I did some research on this, and I figured out that apparently operations on data.table is significantly faster than dataframe, so maybe that's one option I can try.
But I have never used data.tables before. So before I plunge into that, I was quite curious to know if there are any other ways to speed up computation time for large data sets.
What other options do you think I can try?
Thanks!
For large problems like this I like to parallelize. Since operations on individual rows are atomic, meaning that the outcome of an operation on a particular row is independent of every other row, this is an "embarassingly parallel" situation.
library(doParallel)
library(foreach)
registerDoParallel() #You could specify the number of cores to use here. See the documentation.
df$compare <- foreach(m=df$m, n=df$n, .combine='c') %dopar% {
#Borrowing from #nicola in the comments because it's a good solution.
c('Less Than', 'Tied', 'Greater Than')[sign(m-n)+2]
}

Yet another apply Questions

I am totally convinced that an efficient R programm should avoid using loops whenever possible and instead should use the big family of the apply functions.
But this cannot happen without pain.
For example I face with a problem whose solution involves a sum in the applied function, as a result the list of results is reduced to a single value, which is not what I want.
To be concrete I will try to simplify my problem
assume N =100
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
As you can see the function inside cause length of the built vector to explode
whereas using the sum inside would collapse everything to single value
sapply(list(1:N), function(n) (
choose(n,(floor(n/2)+1):n) *
eps^((floor(n/2)+1):n) *
(1- eps)^(n-((floor(n/2)+1):n))))
What I would like to have is a the list of degree of N.
so what do you think? how can I repair it?
Your question doesn't contain reproducible code (what's "eps"?), but on the general point about for loops and optimising code:
For loops are not incredibly slow. For loops are incredibly slow when used improperly because of how memory is assigned to objects. For primitive objects (like vectors), modifying a value in a field has a tiny cost - but expanding the /length/ of the vector is fairly costly because what you're actually doing is creating an entirely new object, finding space for that object, copying the name over, removing the old object, etc. For non-primitive objects (say, data frames), it's even more costly because every modification, even if it doesn't alter the length of the data.frame, triggers this process.
But: there are ways to optimise a for loop and make them run quickly. The easiest guidelines are:
Do not run a for loop that writes to a data.frame. Use plyr or dplyr, or data.table, depending on your preference.
If you are using a vector and can know the length of the output in advance, it will work a lot faster. Specify the size of the output object before writing to it.
Do not twist yourself into knots avoiding for loops.
So in this case - if you're only producing a single value for each thing in N, you could make that work perfectly nicely with a vector:
#Create output object. We're specifying the length in advance so that writing to
#it is cheap
output <- numeric(length = length(N))
#Start the for loop
for(i in seq_along(output)){
output[i] <- your_computations_go_here(N[i])
}
This isn't actually particularly slow - because you're writing to a vector and you've specified the length in advance. And since data.frames are actually lists of equally-sized vectors, you can even work around some issues with running for loops over data.frames using this; if you're only writing to a single column in the data.frame, just create it as a vector and then write it to the data.frame via df$new_col <- output. You'll get the same output as if you had looped through the data.frame, but it'll work faster because you'll only have had to modify it once.

Merging dataframes in R on a pre-sorted column?

I usually work with big dataframes that are pretty well sorted (or can be easily sorted).
Given two dataframes, both sorted by 'user'
some.data <user> <data_1> <data_2>
user <user> <user_attr_1> <user_attr_2>
And I run m = merge(some.data,user), I receive the result as:
m = <user> <data_1> <data_2> <user_attr_1> <user_attr_2>
And this is fine so.
But merge doesn't take advantage of these dataframes being sorted on the common column making the merge pretty CPU/memory heavy. However, this merge could be done in O(n)
I am wondering if there is a way in R to conduct an efficient merge on sorted datasets?
I don't have any experience with it, but as far as I know, this is one of the issues that package data.tablewas designed to improve.
For most practical purposes, data.table=data.frame + index. As a consequence, when used right, this improves performance of quite a few large operations.
There is a danger that turning your data.frame into a data.table (i.e. adding the index) could take some time (although I expect this to be well optimized), but once you've got it up, functions like merge can easily use the index for better performance.
If your set of common keys/indexes is totally overlapping, that is...
Reduce(`&`, user$user.id %in% some.data$user.id)
...returns TRUE and they are, as you said, sorted,and there are no key duplicates then your merging problem is reduced to adding columns to a data.frame. Something in the lines along...
library(log4r)
t1 <- system.time(z <- merge(user, some.data, by='user.id'))
info(my.logger, paste('Elapsed time with merge():', t1['elapsed']))
t2 <- Sys.time()
r <- data.frame(user.id=user$user.id, V1.x=user$V1, V2.x=user$V2)
r[,names(some.data)] <- some.data[,names(some.data)
t3 <- Sys.time()
info(my.logger, paste('Elapsed time without:', t3-t2))
If the assumptions above do not hold, then it gets slightly messier set union of both key sets, translation function, NA padding) but the merging and overlapping assumption alone gets you a long way ahead.
Notice also that the timing of the seconds approach is biased since it's calling twice Sys.time() unlike the merge() timing which calls system.time() and only once.
(Excuse my lame usage of S.O. mark-up)

Resources