Optimized way of merging ~200k large [many columns but only one row] data frames with differing column names - r

I have a large number (over 200,000) separate files, each of which contains a single row and many columns (sometimes upwards of several hundred columns). There is one column (id) in common across all of the files. Otherwise, the column names are semi-random, with incomplete overlap between dataframes. Currently, I am using %in% to determine which columns are in common and then merging using those columns. This works perfectly well, although I am confident (maybe hopeful is a better word) that it could be done faster.
For example:
dfone<-data.frame(id="12fgh",fred="1",wilma="2",barney="1")
dftwo<-data.frame(id="36fdl",fred="5",daphne="3")
common<-names(dfone)[names(dfone) %in% names(dftwo)]
merged<-merge(dfone,dftwo,by=common,all=TRUE)
So, since I'm reading a large number of files, here is what I'm doing right now:
fls<-list.files()
first<-fls[1]
merged<-read.csv(first)
for (fl in fls) {
dffl<-read.csv(fl)
common<-names(dffl)[names(dffl) %in% names(merged)]
merged<-merge(dffl,merged,by=common,all=TRUE)
# print(paste(nrow(merged)," rows in dataframe.",sep=""))
# flush.console()
# print(paste("Just did ",fl,".",sep=""))
# flush.console()
}
Obviously, the commented out section is just a way of keeping track of it as it's running. Which it is, although profoundly slowly, and it runs ever more slowly as it is assembling the data frame.
(1) I am confident that a loop isn't the right way to do this, but I can't figure out a way to vectorize this
(2) My hope is that there is some way to do the merge that I'm missing that doesn't involve my column name comparison kludge
(3) All of which is to say that this is running way too slowly to be viable
Any thought on how to optimize this mess? Thank you very much in advance.

A much shorter and cleaner approach is to read them all into a list and then merge.
do.call(merge, lapply(list.files(), read.csv))
It will still be slow though. You could speed it up by replacing read.csv with something faster (e.g., data.table::fread) and possibly by replacing lapply with parallel::mclapply.

Related

Am I using the most efficient (or right) R instructions?

first question, I'll try to go straight to the point.
I'm currently working with tables and I've chosen R because it has no limit with dataframe sizes and can perform several operations over the data within the tables. I am happy with that, as I can manipulate it at my will, merges, concats and row and column manipulation works fine; but I recently had to run a loop with 0.00001 sec/instruction over a 6 Mill table row and it took over an hour.
Maybe the approach of R was wrong to begin with, and I've tried to look for the most efficient ways to run some operations (using list assignments instead of c(list,new_element)) but, since as far as I can tell, this is not something that you can optimize with some sort of algorithm like graphs or heaps (is just tables, you have to iterate through it all) I was wondering if there might be some other instructions or other basic ways to work with tables that I don't know (assign, extract...) that take less time, or configuration over RStudio to improve performance.
This is the loop, just so if it helps to understand the question:
my_list <- vector("list",nrow(table[,"Date_of_count"]))
for(i in 1:nrow(table[,"Date_of_count"])){
my_list[[i]] <- format(as.POSIXct(strptime(table[i,"Date_of_count"]%>%pull(1),"%Y-%m-%d")),format = "%Y-%m-%d")
}
The table, as aforementioned, has over 6 Mill rows and 25 variables. I want the list to be filled to append it to the table as a column once finished.
Please let me know if it lacks specificity or concretion, or if it just does not belong here.
In order to improve performance (and properly work with R and tables), the answer was a mixture of the first comments:
use vectors
avoid repeated conversions
if possible, avoid loops and apply functions directly over list/vector
I just converted the table (which, realized, had some tibbles inside) into a dataframe and followed the aforementioned keys.
df <- as.data.frame(table)
In this case, by doing this the dates were converted directly to character so I did not have to apply any more conversions.
New execution time over 6 Mill rows: 25.25 sec.

Reduce computation time

Most of the data sets that I have worked with has generally been of moderate size (mostly less than 100k rows) and hence my code's execution time has usually not been that big a problem for me.
But I was recently trying to write a function that takes 2 dataframes as arguments (with, say, m & n rows) and returns a new dataframe with m*n rows. I then have to perform some operations on the resulting data set. So, even with small values of m & n (say around 1000 each ) the resulting dataframe would have more than a million rows.
When I try even simple operations on this dataset, the code takes an intolerably long time to run. Specifically, my resulting dataframe has 2 columns with numeric values and I need to add a new column which will compare the values of these columns and categorize them as - "Greater than", "less than", "Tied"
I am using the following code:
df %>% mutate(compare=ifelse(var1==var2,"tied",
ifelse(var1>var2,"Greater than","lesser then")
And, as I mentioned before, this takes forever to run. I did some research on this, and I figured out that apparently operations on data.table is significantly faster than dataframe, so maybe that's one option I can try.
But I have never used data.tables before. So before I plunge into that, I was quite curious to know if there are any other ways to speed up computation time for large data sets.
What other options do you think I can try?
Thanks!
For large problems like this I like to parallelize. Since operations on individual rows are atomic, meaning that the outcome of an operation on a particular row is independent of every other row, this is an "embarassingly parallel" situation.
library(doParallel)
library(foreach)
registerDoParallel() #You could specify the number of cores to use here. See the documentation.
df$compare <- foreach(m=df$m, n=df$n, .combine='c') %dopar% {
#Borrowing from #nicola in the comments because it's a good solution.
c('Less Than', 'Tied', 'Greater Than')[sign(m-n)+2]
}

R: Why does it take so long to parse this data table

I have a data frame df that has 15 columns and 1000000 rows of all ints. My code is:
for(i in 1:nrow(df))
{
if(is.null(df$col1[i]) || .... || is.null(df$col9[i]))
df[-i,] #to delete the row if one of those columns is null
}
This has been running for an hour and is still going. Why? It seems like it should be relatively fast code to run. How can I speed it up?
The reason it is slow is that R is relatively slow at looping through vectors. Most functions in R are vectorized which means you can perform them on a vector at once much faster than it can loop through each element one by one. On a side note, I don't think you have NULLs in your data frame. I think you have NAs so I'm going to assume that is what you have. Even if you have NULLs then the following should still work.
This syntax should give you a nice speed boost.
This will take advantage of rowSums producing NA for every row that has missing values in it.
df<-subset(df, !is.na(rowSums(df[,1:10])))
This syntax should also work.
df<-df[rowSums(is.na(df[,1:10]))==0,]

Selecting rows in a long dataframe based on a short list

I'm sure this should be easier to do than the way I know how to do it.
I'd like to apply fields from a short dataframe back into a long one based on matching a common factor.
Example short dataframe, list of valid cases:
$ptid (factor) values 1,2,3,4,5...20
$valid 1/0 (to represent true/false; variable through ptid)
long dataframe has 15k rows, each level of $ptid will have several thousand rows
I want to apply $valid onto those rows when the it is 1/true from the list above
The way I know how to do it is to loop through each row of long dataframe, but this is horribly inelegant and also slow.
I have a niggling feeling there is a much better way with dply or similar and I'd really like to learn how.
Worked this out based on the comments, thank you Colonel.
combination_dataset <- Merge(short_dataframe, long_dataframe) worked (very quickly).
Thanks to those who commented.

Merging dataframes in R on a pre-sorted column?

I usually work with big dataframes that are pretty well sorted (or can be easily sorted).
Given two dataframes, both sorted by 'user'
some.data <user> <data_1> <data_2>
user <user> <user_attr_1> <user_attr_2>
And I run m = merge(some.data,user), I receive the result as:
m = <user> <data_1> <data_2> <user_attr_1> <user_attr_2>
And this is fine so.
But merge doesn't take advantage of these dataframes being sorted on the common column making the merge pretty CPU/memory heavy. However, this merge could be done in O(n)
I am wondering if there is a way in R to conduct an efficient merge on sorted datasets?
I don't have any experience with it, but as far as I know, this is one of the issues that package data.tablewas designed to improve.
For most practical purposes, data.table=data.frame + index. As a consequence, when used right, this improves performance of quite a few large operations.
There is a danger that turning your data.frame into a data.table (i.e. adding the index) could take some time (although I expect this to be well optimized), but once you've got it up, functions like merge can easily use the index for better performance.
If your set of common keys/indexes is totally overlapping, that is...
Reduce(`&`, user$user.id %in% some.data$user.id)
...returns TRUE and they are, as you said, sorted,and there are no key duplicates then your merging problem is reduced to adding columns to a data.frame. Something in the lines along...
library(log4r)
t1 <- system.time(z <- merge(user, some.data, by='user.id'))
info(my.logger, paste('Elapsed time with merge():', t1['elapsed']))
t2 <- Sys.time()
r <- data.frame(user.id=user$user.id, V1.x=user$V1, V2.x=user$V2)
r[,names(some.data)] <- some.data[,names(some.data)
t3 <- Sys.time()
info(my.logger, paste('Elapsed time without:', t3-t2))
If the assumptions above do not hold, then it gets slightly messier set union of both key sets, translation function, NA padding) but the merging and overlapping assumption alone gets you a long way ahead.
Notice also that the timing of the seconds approach is biased since it's calling twice Sys.time() unlike the merge() timing which calls system.time() and only once.
(Excuse my lame usage of S.O. mark-up)

Resources