Why is rbindlist "better" than rbind? - r

I am going through documentation of data.table and also noticed from some of the conversations over here on SO that rbindlist is supposed to be better than rbind.
I would like to know why is rbindlist better than rbind and in which scenarios rbindlist really excels over rbind?
Is there any advantage in terms of memory utilization?

rbindlist is an optimized version of do.call(rbind, list(...)), which is known for being slow when using rbind.data.frame
Where does it really excel
Some questions that show where rbindlist shines are
Fast vectorized merge of list of data.frames by row
Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply
These have benchmarks that show how fast it can be.
rbind.data.frame is slow, for a reason
rbind.data.frame does lots of checking, and will match by name. (i.e. rbind.data.frame will account for the fact that columns may be in different orders, and match up by name), rbindlist doesn't do this kind of checking, and will join by position
eg
do.call(rbind, list(data.frame(a = 1:2, b = 2:3), data.frame(b = 1:2, a = 2:3)))
## a b
## 1 1 2
## 2 2 3
## 3 2 1
## 4 3 2
rbindlist(list(data.frame(a = 1:5, b = 2:6), data.frame(b = 1:5, a = 2:6)))
## a b
## 1: 1 2
## 2: 2 3
## 3: 1 2
## 4: 2 3
Some other limitations of rbindlist
It used to struggle to deal with factors, due to a bug that has since been fixed:
rbindlist two data.tables where one has factor and other has character type for a column (Bug #2650)
It has problems with duplicate column names
see
Warning message: in rbindlist(allargs) : NAs introduced by coercion: possible bug in data.table? (Bug #2384)
rbind.data.frame rownames can be frustrating
rbindlist can handle lists data.frames and data.tables, and will return a data.table without rownames
you can get in a muddle of rownames using do.call(rbind, list(...))
see
How to avoid renaming of rows when using rbind inside do.call?
Memory efficiency
In terms of memory rbindlist is implemented in C, so is memory efficient, it uses setattr to set attributes by reference
rbind.data.frame is implemented in R, it does lots of assigning, and uses attr<- (and class<- and rownames<- all of which will (internally) create copies of the created data.frame.

By v1.9.2, rbindlist had evolved quite a bit, implementing many features including:
Choosing the highest SEXPTYPE of columns while binding - implemented in v1.9.2 closing FR #2456 and Bug #4981.
Handling factor columns properly - first implemented in v1.8.10 closing Bug #2650 and extended to binding ordered factors carefully in v1.9.2 as well, closing FR #4856 and Bug #5019.
In addition, in v1.9.2, rbind.data.table also gained a fill argument, that allows to bind by filling missing columns, implemented in R.
Now in v1.9.3, there are even more improvements on these existing features:
rbindlist gains an argument use.names, which by default is FALSE for backwards compatibility.
rbindlist also gains an argument fill, which by default is also FALSE for backwards compatibility.
These features are all implemented in C, and written carefully to not compromise in speed while adding functionalities.
Since rbindlist can now match by names and fill missing columns, rbind.data.table just calls rbindlist now. The only difference is that use.names=TRUE by default for rbind.data.table, for backwards compatibility.
rbind.data.frame slows down quite a bit mostly due to copies (which #mnel points out as well) that could be avoided (by moving to C). I think that's not the only reason. The implementation for checking/matching column names in rbind.data.frame could also get slower when there are many columns per data.frame and there are many such data.frames to bind (as shown in the benchmark below).
However, that rbindlist lack(ed) certain features (like checking factor levels or matching names) bears very tiny (or no) weight towards it being faster than rbind.data.frame. It's because they were carefully implemented in C, optimised for speed and memory.
Here's a benchmark that highlights the efficient binding while matching by column names as well using rbindlist's use.names feature from v1.9.3. The data set consists of 10000 data.frames each of size 10*500.
NB: this benchmark has been updated to include a comparison to dplyr's bind_rows
library(data.table) # 1.11.5, 2018-06-02 00:09:06 UTC
library(dplyr) # 0.7.5.9000, 2018-06-12 01:41:40 UTC
set.seed(1L)
names = paste0("V", 1:500)
cols = 500L
foo <- function() {
data = as.data.frame(setDT(lapply(1:cols, function(x) sample(10))))
setnames(data, sample(names))
}
n = 10e3L
ll = vector("list", n)
for (i in 1:n) {
.Call("Csetlistelt", ll, i, foo())
}
system.time(ans1 <- rbindlist(ll))
# user system elapsed
# 1.226 0.070 1.296
system.time(ans2 <- rbindlist(ll, use.names=TRUE))
# user system elapsed
# 2.635 0.129 2.772
system.time(ans3 <- do.call("rbind", ll))
# user system elapsed
# 36.932 1.628 38.594
system.time(ans4 <- bind_rows(ll))
# user system elapsed
# 48.754 0.384 49.224
identical(ans2, setDT(ans3))
# [1] TRUE
identical(ans2, setDT(ans4))
# [1] TRUE
Binding columns as such without checking for names took just 1.3 where as checking for column names and binding appropriately took just 1.5 seconds more. Compared to base solution, this is 14x faster, and 18x faster than dplyr's version.

Then this probably should be considered a bug?
# let us make a very simple list here
l <- list('a' = 1, 'b' = 2, 'c' = 3)
l
$a
[1] 1
$b
[1] 2
$c
[1] 3
# check that it is a list
class(l)
#[1] "list"
typeof(l)
#[1] "list"
And rbind can handle it without any problems
do.call('rbind', l)
# [,1]
# a 1
# b 2
# c 3
But when using rbindlist one gets this?
rbindlist(l)
Error in rbindlist(l) :
Item 1 of input is not a data.frame, data.table or list
The error message is more than confusing since we checked above that the input is a list, didn't we?
Is the correct application of the function for this simplest of cases documented somewhere?
Any hints appreciated ... since I was expecting similar or same results as with do.call('rbind', l) I am a bit confused why the data.table function decides to transpose the result of the equivalent call on the same list when I work around that wrong class error eg by doing this?
rbindlist(list(l))
# will result in
a b c
1: 1 2 3

Related

How can I harness `apply/lapply/sapply` instead of a `for` loop to improve performance?

I would like to speed-up the below function (fndf) which calls another function (fn1) based on a character array.
fndf- New Function
list_s - character array - chr [1:400]
rdata_i - empty data frame (for initialization)
fn1 - another custom function
rdata2 - data frame with 3000 obs of 40 variables
mdata - data.frame
nm - character
fndf = function(list_s, rdata2){
rdata_i = df <- data.frame(Date=as.Date(character()),
File=character(),
User=character(),
stringsAsFactors=FALSE)
for(i in 1:length(list_s))
{
rdata = fn1(list_s[i], rdata2)
rdata_i = rbind(rdata, rdata_i)
}
return(unique(rdata_i))
}
Can we also improve performance of the function below?
fn1 = function(nm, mdata){
n0 = mdata[mdata$Sign==nm,]
cn0 = unique(c(n0$Name))
repeat{
n1c = mdata[mdata$Mgr %in% cn0,]
n0 = unique(rbind(n0,n1c))
if(nrow(n1c)==0){
return(n0)
break
}
cn0= unique(c(n1c$Name))
}
}
It’s indeed hard to say how to best transform your loop into an *apply statement, and even harder to say whether this will speed it up. But fundamentally, the following transformation is what you’re after, and it definitely makes the function simpler and more readable. It also quite possibly corresponds to a substantial performance gain due to the loss of the repeated rbind, as noted by baptiste:
fndf = function (list_s, rdata2)
as.data.frame(do.call(rbind, unique(lapply(list_s, fn1, rdata2))))
(Yes. That’s a single statement.)
Also note that I’m now applying the unique directly to the list rather than the data.frame. This changes the semantics – unique is specialised for data.frames – but is probably the right thing for your purposes, and it will be more efficient because it means that we don’t construct a needlessly big data.frame with redundant rows.
It's hard to say without your data/functions, but here is a solution with plyr and some placeholder data:
list_s<-LETTERS
rdata2<-data.frame(a=rep(LETTERS,2),b=runif(52),c=runif(52)*10)
fn1<-function(a,b=rdata2)b[rdata2$a==a,]
fn1("A")
require(plyr) # for ldply function, which takes a list and returns a dataframe
result<-ldply(1:length(list_s),function(x)fn1(list_s[x],rdata2))
head(result)
a b c
1 A 0.281940237 2.7774933
2 A 0.023611392 0.6067029
3 B 0.456547803 9.4219258
4 B 0.645783746 5.3094864
5 C 0.475949523 4.8580622
6 C 0.006063407 2.5851738

temporarily keying a data.table for merging

I have a keyed data.table, x, and realize that I need to merge it using a different multicolumn key.
I want to avoid (i) setting and resetting x's key and (ii) keeping track of copies of x with different keys. Here's some sample data and my current approach:
require(data.table)
options(datatable.verbose=TRUE)
set.seed(1)
n <- 10
m <- 2
samp <- function(n) sample(1:9,n,replace=T)
x <- data.table(A = samp(n),B = samp(n),C = samp(n),key="A")
y <- x[samp(m),list(B,C,D=samp(m))]
# this works:
x[,.SD,key="B,C"][y]
# B C A D
# 1: 7 6 6 5
# 2: 9 4 6 2
So that approach works, but I get the comment
...j is a named list. It's very inefficient...
The named list is .SD. Is there a better or more standard way to do this?
It seems that using key or keyby without .SD has no effect:
key(x[,,keyby="B,C"]) # A
key(x[,,key="B,C"]) # A
In version 1.9.5, the on argument was added, with this usage note in the changelog:
data.tables can join now without having to set keys by using the new on argument. For example: DT1[DT2, on=c(x = "y")] would join column 'y' of DT2 with 'x' of DT1. DT1[DT2, on="y"] would join on column 'y' on both data.tables.
In this case, since the merge-column names are the same in x and y, x[y,on=c("B","C")] works.
Historical answer (around version 1.8.11): As of version 1.8.11 [.data.table will have a key argument, which is equivalent to calling setkeyv beforehand. It's not exactly what this question is looking for, but I don't see a way of achieving this without copying the entire data (bad imo), so I think this is a reasonable compromise, but please let me know if you think otherwise.
Edit from Matthew
Specifically adding an argument named key to [.data.table was a new suggestion in the last few days that I haven't responded to yet. We've discussed secondary keys in the past, set2key for example. Secondary keys won't copy the data.
We'll discuss it off list but I think key in [.data.table will likely change name or be done differently. Reminder to internet: v1.8.11 is in development, unstable and experimental. When it gets published to CRAN, then it can be relied on.

R large dataframe conversions in parallel

Essentially I've got a large dataframe: 10,000,000x900 (rows,columns) and I'm trying to convert the class of each column in parallel. The end result needs to be a data.frame
Here's what I've got so far:
Pretend df is the dataframe already defined, all columns are a mixture of numeric and character classes
library(snow)
cl=makeCluster(50,type="SOCK")
cl.out=clusterApplyLB(cl,df,function(x)factor(x,exclude=NULL))
cl.out is a list of what I want, except what I need is for this to be as a data.frame class
So this is where I get stuck... do I try and combine all of the elements of cl.out into a data.frame which isn't going to be in parallel? (SLOW, time is an issue)
Can I implement something else with a different package? (foreach?)
Do I have to hard-code some c to get this done efficiently?
Any help would be appreciated.
Thanks,
One useful paradigm is to subset and replace all columns, treating df as list-like
df[] <- lapply(df, factor, exclude=NULL)
Do you really have 50 cores on a single machine, as implied by your call to makeCluster? If you're not on a Windows machine, use the parallel package and mclapply instead
library(parallel)
options(mc.cores=50)
df[] <- mclapply(df, factor, exclude=NULL)
Is this really going to help you in your overall evaluation? It seems to cost as much to distribute and retrieve the data as to do the calculation.
> f = factor(rep("M", 10000000), levels=LETTERS)
> df = data.frame(f, f, f, f, f, f, f, f)
> system.time(lapply(df, factor, exclude=NULL))
user system elapsed
2.676 0.564 3.250
> system.time(clusterApply(cl, df, factor, exclude=NULL))
user system elapsed
1.488 0.752 2.476
> system.time(mclapply(df, factor, exclude=NULL))
user system elapsed
1.876 1.832 1.814
(the multi-core and multi-process timings are probably highly variable).
If you have a data.frame of that size, I think you will be running into memory issues very quickly.
I think it will be much faster, and more efficient.
You could use set
library(data.table)
# to set as a data.table without having to copy
setattr(df, 'class', c('data.table','data.frame')
alloc.col(df)
for(nn in names(df)){
set(df, j = nn, value = factor(df[[nn]])
}
It is worth reading data.table and parallel computing
Since a data.frame is a list, with a class attribute,
you can just convert the list into a data.frame,
with as.data.frame.
cl.out <- as.data.frame(cl.out)
I notice that the column names are lost: if you are sure that they are
in the same order, you can set them back with:
names(cl.out) <- names(df)

R data.table efficient replication by group

I am running into some memory allocation problems trying to replicate some data by groups using data.table and rep.
Here is some sample data:
ob1 <- as.data.frame(cbind(c(1999),c("THE","BLACK","DOG","JUMPED","OVER","RED","FENCE"),c(4)),stringsAsFactors=FALSE)
ob2 <- as.data.frame(cbind(c(2000),c("I","WALKED","THE","BLACK","DOG"),c(3)),stringsAsFactors=FALSE)
ob3 <- as.data.frame(cbind(c(2001),c("SHE","PAINTED","THE","RED","FENCE"),c(1)),stringsAsFactors=FALSE)
ob4 <- as.data.frame(cbind(c(2002),c("THE","YELLOW","HOUSE","HAS","BLACK","DOG","AND","RED","FENCE"),c(2)),stringsAsFactors=FALSE)
sample_data <- rbind(ob1,ob2,ob3,ob4)
colnames(sample_data) <- c("yr","token","multiple")
What I am trying to do is replicate the tokens (in the present order) by the multiple for each year.
The following code works and gives me the answer I want:
good_solution1 <- ddply(sample_data, "yr", function(x) data.frame(rep(x[,2],x[1,3])))
good_solution2 <- data.table(sample_data)[, rep(token,unique(multiple)),by = "yr"]
The issue is that when I scale this up to 40mm+ rows, I get into memory issues for both possible solutions.
If my understanding is correct, these solutions are essentially doing an rbind which allocates everytime.
Does anyone have a better solution?
I looked at set() for data.table but was running into issues because I wanted to keep the tokens in the same order for each replication.
One way is:
require(data.table)
dt <- data.table(sample_data)
# multiple seems to be a character, convert to numeric
dt[, multiple := as.numeric(multiple)]
setkey(dt, "multiple")
dt[J(rep(unique(multiple), unique(multiple))), allow.cartesian=TRUE]
Everything except the last line should be straightforward. The last line uses a subset using key column with the help of J(.). For each value in J(.) the corresponding value is matched with "key column" and the matched subset is returned.
That is, if you do dt[J(1)] you'll get the subset where multiple = 1. And if you note carefully, by doing dt[J(rep(1,2)] gives you the same subset, but twice. Note that there's a difference between passing dt[J(1,1)] and dt[J(rep(1,2)]. The former is matching values of (1,1) with the first-two-key-columns of the data.table respectively, where as the latter is subsetting by matching (1 and 2) against the first-key column of the data.table.
So, if we were to pass the same value of the column 2 times in J(.), then it gets be duplicated twice. We use this trick to pass 1 1-time, 2 2-times etc.. and that's what the rep(.) part does. rep(.) gives 1,2,2,3,3,3,4,4,4,4.
And if the join results in more rows than max(nrow(dt), nrow(i)) (i is the rep vector that's inside J(.)), you've to explicitly use allow.cartesian = TRUE to perform this join (I guess this is a new feature from data.table 1.8.8).
Edit: Here's some benchmarking I did on a "relatively" big data. I don't see any spike in memory allocations in both methods. But I've yet to find a way to monitor peak memory usage within a function in R. I am sure I've seen such a post here on SO, but it slips me at the moment. I'll write back again. For now, here's a test data and some preliminary results in case anyone is interested/wants to run it for themselves.
# dummy data
set.seed(45)
yr <- 1900:2013
sz <- sample(10:50, length(yr), replace = TRUE)
token <- unlist(sapply(sz, function(x) do.call(paste0, data.frame(matrix(sample(letters, x*4, replace=T), ncol=4)))))
multiple <- rep(sample(500:5000, length(yr), replace=TRUE), sz)
DF <- data.frame(yr = rep(yr, sz),
token = token,
multiple = multiple, stringsAsFactors=FALSE)
# Arun's solution
ARUN.DT <- function(dt) {
setkey(dt, "multiple")
idx <- unique(dt$multiple)
dt[J(rep(idx,idx)), allow.cartesian=TRUE]
}
# Ricardo's solution
RICARDO.DT <- function(dt) {
setkey(dt, key="yr")
newDT <- setkey(dt[, rep(NA, list(rows=length(token) * unique(multiple))), by=yr][, list(yr)], 'yr')
newDT[, tokenReps := as.character(NA)]
# Add the rep'd tokens into newDT, using recycling
newDT[, tokenReps := dt[.(y)][, token], by=list(y=yr)]
newDT
}
# create data.table
require(data.table)
DT <- data.table(DF)
# benchmark both versions
require(rbenchmark)
benchmark(res1 <- ARUN.DT(DT), res2 <- RICARDO.DT(DT), replications=10, order="elapsed")
# test replications elapsed relative user.self sys.self
# 1 res1 <- ARUN.DT(DT) 10 9.542 1.000 7.218 1.394
# 2 res2 <- RICARDO.DT(DT) 10 17.484 1.832 14.270 2.888
But as Ricardo says, it may not matter if you run out of memory. So, in that case, there has to be a trade-off between speed and memory. What I'd like to verify is the peak memory used in both methods here to say definitively if using Join is better.
you can try allocating the memory for all the rows first, and then populating them iteratively.
eg:
# make sure `sample_data$multiple` is an integer
sample_data$multiple <- as.integer(sample_data$multiple)
# create data.table
S <- data.table(sample_data, key='yr')
# optionally, drop original data.frame if not needed
rm(sample_data)
## Allocate the memory first
newDT <- data.table(yr = rep(sample_data$yr, sample_data$multiple), key="yr")
newDT[, tokenReps := as.character(NA)]
# Add the rep'd tokens into newDT, using recycling
newDT[, tokenReps := S[.(y)][, token], by=list(y=yr)]
Two notes:
(1) sample_data$multiple is currently a character and thus getting coerced when passed to rep (in your original example). It might be worth double-checking your real data if that is also the case.
(2) I used the following to determine the number of rows needed per year
S[, list(rows=length(token) * unique(multiple)), by=yr]

What you can do with a data.frame that you can't with a data.table?

I just started using R, and came across data.table. I found it brilliant.
A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?
From the data.table FAQ
FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?
As FAQ 1.1 highlights, j in [.data.table is fundamentally
different from j in [.data.frame. Even something as simple as
DF[,1] would break existing code in many packages and user code.
This is by design, and we want it to work this way for more
complicated syntax to work. There are other differences, too (see FAQ
2.17).
Furthermore, data.table inherits from data.frame. It is a
data.frame, too. A data.table can be passed to any package that
only accepts data.frame and that package can use [.data.frame
syntax on the data.table.
We have proposed enhancements to R wherever possible, too. One of
these was accepted as a new feature in R 2.12.0 :
unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked
encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements
to the way the hash code is generated in unique.c.
A second proposal was to use memcpy in duplicate.c, which is much
faster than a for loop in C. This would improve the way that R copies
data internally (on some measures by 13 times). The thread on r-devel
is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.
What are the smaller syntax differences between data.frame and data.table
DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column
DT[3, ] == DT[3], but DF[ , 3] == DF[3] (somewhat confusingly in data.frame, whereas data.table is consistent)
For this reason we say the comma is optional in DT, but not optional in DF
DT[[3]] == DF[, 3] == DF[[3]]
DT[i, ], where i is a single integer, returns a single row, just like DF[i, ], but unlike a matrix single-row subset which returns a vector.
DT[ , j] where j is a single integer returns a one-column data.table, unlike DF[, j] which returns a vector by default
DT[ , "colA"][[1]] == DF[ , "colA"].
DT[ , colA] == DF[ , "colA"] (currently in data.table v1.9.8 but is about to change, see release notes)
DT[ , list(colA)] == DF[ , "colA", drop = FALSE]
DT[NA] returns 1 row of NA, but DF[NA] returns an entire copy of DF containing NA throughout. The symbol NA is type logical in R and is therefore recycled by [.data.frame. The user's intention was probably DF[NA_integer_]. [.data.table diverts to this probable intention automatically, for convenience.
DT[c(TRUE, NA, FALSE)] treats the NA as FALSE, but DF[c(TRUE, NA, FALSE)] returns
NA rows for each NA
DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]
data.frame(list(1:2, "k", 1:4)) creates 3 columns, data.table creates one list column.
check.names is by default TRUE in data.frame but FALSE in data.table, for convenience.
stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to factor.
Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
In [.data.frame we very often set drop = FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In [.data.table we took the opportunity to make it consistent and dropped drop.
When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.
Small caveat
There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.
For example
see this question and prompt response
From the NEWS for v 1.8.2
base::unname(DT) now works again, as needed by plyr::melt(). Thanks to
Christoph Jaeckel for reporting. Test added.
An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2
without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added.
ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2
doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.

Resources