Merging dataframes in R on a pre-sorted column? - r

I usually work with big dataframes that are pretty well sorted (or can be easily sorted).
Given two dataframes, both sorted by 'user'
some.data <user> <data_1> <data_2>
user <user> <user_attr_1> <user_attr_2>
And I run m = merge(some.data,user), I receive the result as:
m = <user> <data_1> <data_2> <user_attr_1> <user_attr_2>
And this is fine so.
But merge doesn't take advantage of these dataframes being sorted on the common column making the merge pretty CPU/memory heavy. However, this merge could be done in O(n)
I am wondering if there is a way in R to conduct an efficient merge on sorted datasets?

I don't have any experience with it, but as far as I know, this is one of the issues that package data.tablewas designed to improve.
For most practical purposes, data.table=data.frame + index. As a consequence, when used right, this improves performance of quite a few large operations.
There is a danger that turning your data.frame into a data.table (i.e. adding the index) could take some time (although I expect this to be well optimized), but once you've got it up, functions like merge can easily use the index for better performance.

If your set of common keys/indexes is totally overlapping, that is...
Reduce(`&`, user$user.id %in% some.data$user.id)
...returns TRUE and they are, as you said, sorted,and there are no key duplicates then your merging problem is reduced to adding columns to a data.frame. Something in the lines along...
library(log4r)
t1 <- system.time(z <- merge(user, some.data, by='user.id'))
info(my.logger, paste('Elapsed time with merge():', t1['elapsed']))
t2 <- Sys.time()
r <- data.frame(user.id=user$user.id, V1.x=user$V1, V2.x=user$V2)
r[,names(some.data)] <- some.data[,names(some.data)
t3 <- Sys.time()
info(my.logger, paste('Elapsed time without:', t3-t2))
If the assumptions above do not hold, then it gets slightly messier set union of both key sets, translation function, NA padding) but the merging and overlapping assumption alone gets you a long way ahead.
Notice also that the timing of the seconds approach is biased since it's calling twice Sys.time() unlike the merge() timing which calls system.time() and only once.
(Excuse my lame usage of S.O. mark-up)

Related

Am I using the most efficient (or right) R instructions?

first question, I'll try to go straight to the point.
I'm currently working with tables and I've chosen R because it has no limit with dataframe sizes and can perform several operations over the data within the tables. I am happy with that, as I can manipulate it at my will, merges, concats and row and column manipulation works fine; but I recently had to run a loop with 0.00001 sec/instruction over a 6 Mill table row and it took over an hour.
Maybe the approach of R was wrong to begin with, and I've tried to look for the most efficient ways to run some operations (using list assignments instead of c(list,new_element)) but, since as far as I can tell, this is not something that you can optimize with some sort of algorithm like graphs or heaps (is just tables, you have to iterate through it all) I was wondering if there might be some other instructions or other basic ways to work with tables that I don't know (assign, extract...) that take less time, or configuration over RStudio to improve performance.
This is the loop, just so if it helps to understand the question:
my_list <- vector("list",nrow(table[,"Date_of_count"]))
for(i in 1:nrow(table[,"Date_of_count"])){
my_list[[i]] <- format(as.POSIXct(strptime(table[i,"Date_of_count"]%>%pull(1),"%Y-%m-%d")),format = "%Y-%m-%d")
}
The table, as aforementioned, has over 6 Mill rows and 25 variables. I want the list to be filled to append it to the table as a column once finished.
Please let me know if it lacks specificity or concretion, or if it just does not belong here.
In order to improve performance (and properly work with R and tables), the answer was a mixture of the first comments:
use vectors
avoid repeated conversions
if possible, avoid loops and apply functions directly over list/vector
I just converted the table (which, realized, had some tibbles inside) into a dataframe and followed the aforementioned keys.
df <- as.data.frame(table)
In this case, by doing this the dates were converted directly to character so I did not have to apply any more conversions.
New execution time over 6 Mill rows: 25.25 sec.

Optimized way of merging ~200k large [many columns but only one row] data frames with differing column names

I have a large number (over 200,000) separate files, each of which contains a single row and many columns (sometimes upwards of several hundred columns). There is one column (id) in common across all of the files. Otherwise, the column names are semi-random, with incomplete overlap between dataframes. Currently, I am using %in% to determine which columns are in common and then merging using those columns. This works perfectly well, although I am confident (maybe hopeful is a better word) that it could be done faster.
For example:
dfone<-data.frame(id="12fgh",fred="1",wilma="2",barney="1")
dftwo<-data.frame(id="36fdl",fred="5",daphne="3")
common<-names(dfone)[names(dfone) %in% names(dftwo)]
merged<-merge(dfone,dftwo,by=common,all=TRUE)
So, since I'm reading a large number of files, here is what I'm doing right now:
fls<-list.files()
first<-fls[1]
merged<-read.csv(first)
for (fl in fls) {
dffl<-read.csv(fl)
common<-names(dffl)[names(dffl) %in% names(merged)]
merged<-merge(dffl,merged,by=common,all=TRUE)
# print(paste(nrow(merged)," rows in dataframe.",sep=""))
# flush.console()
# print(paste("Just did ",fl,".",sep=""))
# flush.console()
}
Obviously, the commented out section is just a way of keeping track of it as it's running. Which it is, although profoundly slowly, and it runs ever more slowly as it is assembling the data frame.
(1) I am confident that a loop isn't the right way to do this, but I can't figure out a way to vectorize this
(2) My hope is that there is some way to do the merge that I'm missing that doesn't involve my column name comparison kludge
(3) All of which is to say that this is running way too slowly to be viable
Any thought on how to optimize this mess? Thank you very much in advance.
A much shorter and cleaner approach is to read them all into a list and then merge.
do.call(merge, lapply(list.files(), read.csv))
It will still be slow though. You could speed it up by replacing read.csv with something faster (e.g., data.table::fread) and possibly by replacing lapply with parallel::mclapply.

Reduce computation time

Most of the data sets that I have worked with has generally been of moderate size (mostly less than 100k rows) and hence my code's execution time has usually not been that big a problem for me.
But I was recently trying to write a function that takes 2 dataframes as arguments (with, say, m & n rows) and returns a new dataframe with m*n rows. I then have to perform some operations on the resulting data set. So, even with small values of m & n (say around 1000 each ) the resulting dataframe would have more than a million rows.
When I try even simple operations on this dataset, the code takes an intolerably long time to run. Specifically, my resulting dataframe has 2 columns with numeric values and I need to add a new column which will compare the values of these columns and categorize them as - "Greater than", "less than", "Tied"
I am using the following code:
df %>% mutate(compare=ifelse(var1==var2,"tied",
ifelse(var1>var2,"Greater than","lesser then")
And, as I mentioned before, this takes forever to run. I did some research on this, and I figured out that apparently operations on data.table is significantly faster than dataframe, so maybe that's one option I can try.
But I have never used data.tables before. So before I plunge into that, I was quite curious to know if there are any other ways to speed up computation time for large data sets.
What other options do you think I can try?
Thanks!
For large problems like this I like to parallelize. Since operations on individual rows are atomic, meaning that the outcome of an operation on a particular row is independent of every other row, this is an "embarassingly parallel" situation.
library(doParallel)
library(foreach)
registerDoParallel() #You could specify the number of cores to use here. See the documentation.
df$compare <- foreach(m=df$m, n=df$n, .combine='c') %dopar% {
#Borrowing from #nicola in the comments because it's a good solution.
c('Less Than', 'Tied', 'Greater Than')[sign(m-n)+2]
}

R: Why does it take so long to parse this data table

I have a data frame df that has 15 columns and 1000000 rows of all ints. My code is:
for(i in 1:nrow(df))
{
if(is.null(df$col1[i]) || .... || is.null(df$col9[i]))
df[-i,] #to delete the row if one of those columns is null
}
This has been running for an hour and is still going. Why? It seems like it should be relatively fast code to run. How can I speed it up?
The reason it is slow is that R is relatively slow at looping through vectors. Most functions in R are vectorized which means you can perform them on a vector at once much faster than it can loop through each element one by one. On a side note, I don't think you have NULLs in your data frame. I think you have NAs so I'm going to assume that is what you have. Even if you have NULLs then the following should still work.
This syntax should give you a nice speed boost.
This will take advantage of rowSums producing NA for every row that has missing values in it.
df<-subset(df, !is.na(rowSums(df[,1:10])))
This syntax should also work.
df<-df[rowSums(is.na(df[,1:10]))==0,]

In R, how do you loop over the rows of a data frame really fast?

Suppose that you have a data frame with many rows and many columns.
The columns have names. You want to access rows by number, and columns by name.
For example, one (possibly slow) way to loop over the rows is
for (i in 1:nrow(df)) {
print(df[i, "column1"])
# do more things with the data frame...
}
Another way is to create "lists" for separate columns (like column1_list = df[["column1"]), and access the lists in one loop. This approach might be fast, but also inconvenient if you want to access many columns.
Is there a fast way of looping over the rows of a data frame? Is some other data structure better for looping fast?
I think I need to make this a full answer because I find comments harder to track and I already lost one comment on this... There is an example by nullglob that demonstrates the differences among for, and apply family functions much better than other examples. When one makes the function such that it is very slow then that's where all the speed is consumed and you won't find differences among the variations on looping. But when you make the function trivial then you can see how much the looping influences things.
I'd also like to add that some members of the apply family unexplored in other examples have interesting performance properties. First I'll show replications of nullglob's relative results on my machine.
n <- 1e6
system.time(for(i in 1:n) sinI[i] <- sin(i))
user system elapsed
5.721 0.028 5.712
lapply runs much faster for the same result
system.time(sinI <- lapply(1:n,sin))
user system elapsed
1.353 0.012 1.361
He also found sapply much slower. Here are some others that weren't tested.
Plain old apply to a matrix version of the data...
mat <- matrix(1:n,ncol =1),1,sin)
system.time(sinI <- apply(mat,1,sin))
user system elapsed
8.478 0.116 8.531
So, the apply() command itself is substantially slower than the for loop. (for loop is not slowed down appreciably if I use sin(mat[i,1]).
Another one that doesn't seem to be tested in other posts is tapply.
system.time(sinI <- tapply(1:n, 1:n, sin))
user system elapsed
12.908 0.266 13.589
Of course, one would never use tapply this way and it's utility is far beyond any such speed problem in most cases.
The fastest way is to not loop (i.e. vectorized operations). One of the only instances in which you need to loop is when there are dependencies (i.e. one iteration depends on another). Otherwise, try to do as much vectorized computation outside the loop as possible.
If you do need to loop, then using a for loop is essentially as fast as anything else (lapply can be a little faster, but other apply functions tend to be around the same speed as for).
Exploiting the fact that data.frames are essentially lists of column vectors, one can use do.call to apply a function with the arity of the number of columns over each column of the data.frame (similar to a "zipping" over a list in other languages).
do.call(paste, data.frame(x=c(1,2), z=c("a","b"), z=c(5,6)))

Resources