Speed up merging many dataframes in R - r

I am currently using following code to merge >130 data frames and the code takes too many hours to run (I actually never got to the completion on such a big dataset, only on subsets). Each table has two columns: unit (string) and counts (integer). I am merging by units.
tables <- lapply(files, function(x) read.table(x), col.names=c("unit", x))))
MyMerge <- function(x, y){
df <- merge(x, y, by="unit", all.x= TRUE, all.y= TRUE)
return(df)
}
data <- Reduce(MyMerge, tables)
Is there any way to speed this up easily? Each table/dataframe separately has around 500,000 rows and many of those are unique to that table. Therefore, by merging multiple tables one quickly gets number of the rows of the merged dataframe to many millions..
At the end, I will drop rows with too low summary counts from my big merged table, but I don't want to to that during merging as the order of my files would matter then..

Here a small comparison, first with a rather small dataset, then with a larger one:
library(data.table)
library(plyr)
library(dplyr)
library(microbenchmark)
# sample size:
n = 4e3
# create some data.frames:
df_list <- lapply(1:100, function(x) {
out <- data.frame(id = c(1:n),
type = sample(c("coffee", "americano", "espresso"),n, replace=T))
names(out)[2] <- paste0(names(out)[2], x)
out})
# transform dfs into data.tables:
dt_list <- lapply(df_list, function(x) {
out <- as.data.table(x)
setkey(out, "id")
out
})
# set options to outer join for all methods:
mymerge <- function(...) base::merge(..., by="id", all=T)
mydplyr <- function(...) dplyr::full_join(..., by="id")
myplyr <- function(...) plyr::join(..., by="id", type="full")
mydt <- function(...) merge(..., by="id", all=T)
# Compare:
microbenchmark(base = Reduce(mymerge, df_list),
dplyr= Reduce(mydplyr, df_list),
plyr = Reduce(myplyr, df_list),
dt = Reduce(mydt, dt_list), times=50)
This gives the following results:
Unit: milliseconds
expr min lq mean median uq max neval cld
base 944.0048 956.9049 974.8875 962.9884 977.6824 1221.5301 50 c
dplyr 316.5211 322.2476 329.6281 326.9907 332.6721 381.6222 50 a
plyr 2682.9981 2754.3139 2788.7470 2773.8958 2812.5717 3003.2481 50 d
dt 537.2613 554.3957 570.8851 560.5323 572.5592 757.6631 50 b
We can see that the two contestants are dplyr and data.table. Changing the sample size to 5e5 yields the following comparisons, showing that indeed data.table dominates. Note that I added this part after #BenBolker's suggestion.
microbenchmark(dplyr= Reduce(mydplyr, df_list),
dt = Reduce(mydt, dt_list), times=50)
Unit: seconds
expr min lq mean median uq max neval cld
dplyr 34.48993 34.85559 35.29132 35.11741 35.66051 36.66748 50 b
dt 10.89544 11.32318 11.61326 11.54414 11.87338 12.77235 50 a

Related

How do i subtract 2 columns in a .csv file in R?

How to subtract 2 colmun in a .csv file uploaded in R?
I have named the new column using reading <- $started_time- $ended_time
Since you do not post any example data I post an example based on the iris built-in dataset: You can simply use - to subtract vector of the same length (if the length is not the same the shorter vector will be recycled).
You can select the column from your dataset with the $ operator or with [] operator
data(iris)
#assigning the result to a new column
iris$subtraction <- iris$Sepal.Length-iris$Sepal.Width
iris$subtraction <- iris[,1]-iris[,2]
#assigning the result to a new variable
subtraction <- iris[,1]-iris[,2]
subtraction <- iris$Sepal.Length-iris$Sepal.Width
EDIT
a mincrobenchmark of 3 equivalent solutions:
library(microbenchmark)
library(data.table)
library(dplyr)
library(ggplot2)
#prepare simulation ------------------------------------------------------------
#number of rows to be tested
nr <- seq(100000,10000000,100000)
#initialize an list to store results
time <- as.list(rep(NA,100))
#benchmark
for (i in 1:length(nr)) {
set.seed(5)
#create data
df <- data.frame(x=rnorm(nr[i]),y=rnorm(nr[i]))
dt <- data.table(x=rnorm(nr[i]),y=rnorm(nr[i]))
#benchmark
x <- print(microbenchmark(
base=df$new.col <- df$x-df$y,
DT=dt <- dt[,new.col:=x-y],
dplyr=df %>% mutate(new.col=x-y),
times = 10
))
#store results
time[[i]] <- x[,c(1,4)]
}
#discard the first 4 elements because they run in microsenconds
bench <- do.call(rbind,time[5:100])
#add the number of rows as column
bench$nrow <- rep(nr[5:100],each=3)
ggplot(bench,aes(x=nrow,y=mean,group=expr,col=expr))+
geom_smooth(se=F)+
theme_minimal()+
xlab("# rows")+
ylab("time (milliseconds)")
As you can see, for this simple task both the base and data.table solutions are equivalent, while the mutate solution is a bit slower. However, the entire simulation runs in a minute and the single operations in few milliseconds.
my PC has 16Gb RAM and 12 cores.
EDIT
After the OP asked for a Date case, here a small example with date as POSIXct class:
day <- Sys.Date()
hm <- merge(0:23, seq(0, 45, by = 15))
datetime <- merge(last7days, chron(time = paste(hm$x, ':', hm$y, ':', 0)))
colnames(datetime) <- c('date', 'time')
# create datetime
dt <- as.POSIXct(paste(datetime$date, datetime$time))
df <- data.frame(x=sample(dt,200000,replace = T),y=sample(dt,200000,replace = T))
microbenchmark(df$x-df$y)
the operation runs in a few milliseconds, as expected:
Unit: milliseconds
expr min lq mean median uq max neval
df$x - df$y 1.459801 1.544301 2.755227 1.624501 1.845401 62.7416 100

Merge nested lists into a data frame maintaining higher order list structure

I have a a list composed of nested lists. Each of these nested lists contains data frames that share the same columns. I want to merge the data frames within each nested list , maintaining the higher order list.
I've tried doing this with lapply and do.call, but it's taking far too long. Indeed I'm getting the following error:
Error: vector memory exhausted (limit reached?)
my.list <- replicate(100, replicate(10, data.frame(a = 1:5, b = 6:10), simplify = F), simplify = F)
my.list <- lapply(my.list, function(l) do.call("rbind", l))
This gives me exactly the data structure I want, but runs way too slow with large data.
Another option would be to use purrr::map with dplyr::bind_rows
library(purrr)
library(dplyr)
map(my.list, bind_rows)
Here is a microbenchmark comparison of the different methods
library(purrr)
library(dplyr)
library(data.table)
library(microbenchmark)
res <- microbenchmark(
lapply_do_call_rbind = {
lapply(my.list, function(l) do.call("rbind", l))
},
map_bind_rows = {
map(my.list, bind_rows)
},
lapply_rbindlist = {
lapply(my.list, rbindlist)
}
)
#Unit: milliseconds
# expr min lq mean median uq
# lapply_do_call_rbind 46.104965 49.801469 54.567249 51.815901 54.085547
# map_bind_rows 3.257474 3.490079 4.055779 3.620804 4.002505
# lapply_rbindlist 9.446331 10.009678 11.429870 10.796956 12.252741
library(ggplot2)
autoplot(res)

Why is dplyr so slow?

Like most people, I'm impressed by Hadley Wickham and what he's done for R -- so i figured that i'd move some functions toward his tidyverse ... having done so i'm left wondering what the point of it all is?
My new dplyr functions are much slower than their base equivalents -- i hope i'm doing something wrong. I'd particularly like some payoff from the effort required to understand non-standard-evaluation.
So, what am i doing wrong? Why is dplyr so slow?
An example:
require(microbenchmark)
require(dplyr)
df <- tibble(
a = 1:10,
b = c(1:5, 4:0),
c = 10:1)
addSpread_base <- function() {
df[['spread']] <- df[['a']] - df[['b']]
df
}
addSpread_dplyr <- function() df %>% mutate(spread := a - b)
all.equal(addSpread_base(), addSpread_dplyr())
microbenchmark(addSpread_base(), addSpread_dplyr(), times = 1e4)
Timing results:
Unit: microseconds
expr min lq mean median uq max neval
addSpread_base() 12.058 15.769 22.07805 24.58 26.435 2003.481 10000
addSpread_dplyr() 607.537 624.697 666.08964 631.19 636.291 41143.691 10000
So using dplyr functions to transform the data takes about 30x longer -- surely this isn't the intention?
I figured that perhaps this is too easy a case -- and that dplyr would really shine if we had a more realistic case where we are adding a column and sub-setting the data -- but this was worse. As you can see from the timings below, this is ~70x slower than the base approach.
# mutate and substitute
addSpreadSub_base <- function(df, col1, col2) {
df[['spread']] <- df[['a']] - df[['b']]
df[, c(col1, col2, 'spread')]
}
addSpreadSub_dplyr <- function(df, col1, col2) {
var1 <- as.name(col1)
var2 <- as.name(col2)
qq <- quo(!!var1 - !!var2)
df %>%
mutate(spread := !!qq) %>%
select(!!var1, !!var2, spread)
}
all.equal(addSpreadSub_base(df, col1 = 'a', col2 = 'b'),
addSpreadSub_dplyr(df, col1 = 'a', col2 = 'b'))
microbenchmark(addSpreadSub_base(df, col1 = 'a', col2 = 'b'),
addSpreadSub_dplyr(df, col1 = 'a', col2 = 'b'),
times = 1e4)
Results:
Unit: microseconds
expr min lq mean median uq max neval
addSpreadSub_base(df, col1 = "a", col2 = "b") 22.725 30.610 44.3874 45.450 53.798 2024.35 10000
addSpreadSub_dplyr(df, col1 = "a", col2 = "b") 2748.757 2837.337 3011.1982 2859.598 2904.583 44207.81 10000
These are micro seconds, your dataset has 10 rows, unless you plan on looping on millions of datasets of 10 rows your benchmark is pretty much irrelevant (and in that case I can't imagine a situation where it wouldn't be wise to bind them together as a first step).
Let's do it with a bigger dataset, like 1 million times bigger :
df <- tibble(
a = 1:10,
b = c(1:5, 4:0),
c = 10:1)
df2 <- bind_rows(replicate(1000000,df,F))
addSpread_base <- function(df) {
df[['spread']] <- df[['a']] - df[['b']]
df
}
addSpread_dplyr <- function(df) df %>% mutate(spread = a - b)
microbenchmark::microbenchmark(
addSpread_base(df2),
addSpread_dplyr(df2),
times = 100)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# addSpread_base(df2) 25.85584 26.93562 37.77010 32.33633 35.67604 170.6507 100 a
# addSpread_dplyr(df2) 26.91690 27.57090 38.98758 33.39769 39.79501 182.2847 100 a
Still quite fast and not much difference.
As for the "whys" of the result that you got, it's because you're using a much more complex function, so it has overheads.
Commenters have pointed that dplyr doesn't try too hard to be fast and maybe it's true when you compare to data.table, and interface is the first concern, but the authors have been working hard on speed as well. Hybrid evaluation for example allows (if I got it right) to execute C code directly on grouped data when aggregating with common functions, which can be much faster than base code, but simple code will always run faster with simple functions.

Make List easier with For-Loop

library(xml2)
library(rvest)
datpackage <- paste0("dat",1:10)
for(i in 1:10){
assign(datpackage[i], runif(2))
}
datlist <- list(dat1, dat2, dat3, dat4, dat5, dat6, dat7, dat8, dat9, dat10)
"datlist" is what I want, but is there easier way to make a list ?
datlist2 <- for (i in 1:10) {
list(paste0("dat",i))
}
datlist3 <- list(datpackage)
I've tried datlist2, and datlist3, but that's not the same as "datlist".
What should I have to do when I make a list with thousands of data?
We can use paste with mget if the objects are already created
datlist <- mget(paste0("dat", 1:10))
But, if we need to create a list of random uniform numbers,
datlist <- replicate(10, runif(2), simplify = FALSE)
For creating lists with random numbers I would also suggest:
datlist2 <- lapply(vector("list", 10), function(x) {runif(2)})
Benchmarking
May be worth adding that the lapply / vector approach appears to be faster:
funA <- function(x) {replicate(10, runif(2), simplify = FALSE)}
funB <- function(x) {lapply(vector("list", 10), function(x) {runif(2)})}
microbenchmark::microbenchmark(funA(), funB(), times = 1e4)
Results
Unit: microseconds
expr min lq mean median uq max neval cld
funA() 24.053 27.3305 37.98530 28.6665 34.4045 2478.510 10000 b
funB() 19.507 21.6400 30.37437 22.9235 27.0500 2547.145 10000 a

How do I sub sample data by group efficiently?

I do have a similar problem that is explained in this question. Similar to that question I have a data frame that has 3 columns (id, group, value). I want to take n samples with replacement from each group and produce a smaller data frame with n samples from each group.
However, I am doing hundreds of subsamples in a simulation code and the solution based on ddply is very slow to be used in my code. I tried to rewrite a simple code to see if I can get a better performance but it is still slow (not better than the ddply solution if not worse). Below is my code. I am wondering if it can be improved for performance
#Producing example DataFrame
dfsize <- 10
groupsize <- 7
test.frame.1 <- data.frame(id = 1:dfsize, group = rep(1:groupsize,each = ceiling(dfsize/groupsize))[1:dfsize], junkdata = sample(1:10000, size =dfsize))
#Main function for subsampling
sample.from.group<- function(df, dfgroup, size, replace){
outputsize <- 1
newdf <-df # assuming a sample cannot be larger than the original
uniquegroups <- unique(dfgroup)
for (uniquegroup in uniquegroups){
dataforgroup <- which(dfgroup==uniquegroup)
mysubsample <- df[sample(dataforgroup, size, replace),]
sizeofsample <- nrow(mysubsample)
newdf[outputsize:(outputsize+sizeofsample-1), ] <- mysubsample
outputsize <- outputsize + sizeofsample
}
return(newdf[1:(outputsize-1),])
}
#Using the function
sample.from.group(test.frame.1, test.frame.1$group, 100, replace = TRUE)
Here's two plyr based solutions:
library(plyr)
dfsize <- 1e4
groupsize <- 7
testdf <- data.frame(
id = seq_len(dfsize),
group = rep(1:groupsize, length = dfsize),
junkdata = sample(1:10000, size = dfsize))
sample_by_group_1 <- function(df, dfgroup, size, replace) {
ddply(df, dfgroup, function(x) {
x[sample(nrow(df), size = size, replace = replace), , drop = FALSE]
})
}
sample_by_group_2 <- function(df, dfgroup, size, replace) {
idx <- split_indices(df[[dfgroup]])
subs <- lapply(idx, sample, size = size, replace = replace)
df[unlist(subs, use.names = FALSE), , drop = FALSE]
}
library(microbenchmark)
microbenchmark(
ddply = sample_by_group_1(testdf, "group", 100, replace = TRUE),
plyr = sample_by_group_2(testdf, "group", 100, replace = TRUE)
)
# Unit: microseconds
# expr min lq median uq max neval
# ddply 4488 4723 5059 5360 36606 100
# plyr 443 487 507 536 31343 100
The second approach is much faster because it does the subsetting in a single step - if you can figure out how to do it in one step, it's usually any easy way to get better performance.
I think this is cleaner and possibly faster:
z <- sapply(unique(test.frame.1$group), FUN= function(x){
sample(which(test.frame.1$group==x), 100, TRUE)
})
out <- test.frame.1[z,]
out

Resources