temporarily keying a data.table for merging - r

I have a keyed data.table, x, and realize that I need to merge it using a different multicolumn key.
I want to avoid (i) setting and resetting x's key and (ii) keeping track of copies of x with different keys. Here's some sample data and my current approach:
require(data.table)
options(datatable.verbose=TRUE)
set.seed(1)
n <- 10
m <- 2
samp <- function(n) sample(1:9,n,replace=T)
x <- data.table(A = samp(n),B = samp(n),C = samp(n),key="A")
y <- x[samp(m),list(B,C,D=samp(m))]
# this works:
x[,.SD,key="B,C"][y]
# B C A D
# 1: 7 6 6 5
# 2: 9 4 6 2
So that approach works, but I get the comment
...j is a named list. It's very inefficient...
The named list is .SD. Is there a better or more standard way to do this?
It seems that using key or keyby without .SD has no effect:
key(x[,,keyby="B,C"]) # A
key(x[,,key="B,C"]) # A

In version 1.9.5, the on argument was added, with this usage note in the changelog:
data.tables can join now without having to set keys by using the new on argument. For example: DT1[DT2, on=c(x = "y")] would join column 'y' of DT2 with 'x' of DT1. DT1[DT2, on="y"] would join on column 'y' on both data.tables.
In this case, since the merge-column names are the same in x and y, x[y,on=c("B","C")] works.
Historical answer (around version 1.8.11): As of version 1.8.11 [.data.table will have a key argument, which is equivalent to calling setkeyv beforehand. It's not exactly what this question is looking for, but I don't see a way of achieving this without copying the entire data (bad imo), so I think this is a reasonable compromise, but please let me know if you think otherwise.
Edit from Matthew
Specifically adding an argument named key to [.data.table was a new suggestion in the last few days that I haven't responded to yet. We've discussed secondary keys in the past, set2key for example. Secondary keys won't copy the data.
We'll discuss it off list but I think key in [.data.table will likely change name or be done differently. Reminder to internet: v1.8.11 is in development, unstable and experimental. When it gets published to CRAN, then it can be relied on.

Related

Strange issue with data.table row search

I am a big fan and massive user of data.tables in R. I really use them for a lot of code but have recently encountered a strange bug:
I have a huge data.table with multiple columns, example:
x y
1: 1 a
2: 1 b
3: 1 c
4: 2 a
5: 2 b
6: 2 c
7: 3 a
8: 3 b
9: 3 c
if I select
dataDT[x==‘1’]
I end up getting
x y
1: 1 a
whereas
dataDT[(x==‘1’)]
gives me
x y
1: 1 a
2: 1 b
3: 1 c
Any ideas? x and y are factor and the data.table is indexed by setKey by x.
ADDITIONAL INFOS AND CODE:
I actually fixed this issue but in a way that is not clear nor intuitive.
My code is structured as follows: I have a function called from my main code where I have to introduce a column in the data.table.
I have previously used the following notation
dataT[,nC:=oC,]
to do the deed.
I have instead found that creating the new column by using
dataT$nC <- dataT$oC
instead fixes the bug completely.
I tried to replicate the exact same bug on a simpler example code but I cannot, possibly because of dependencies related to the size structure of my data.table as well as the specific functions I am running on my table.
With that said, I have a working example that shows that when you insert a column using the dataT[,nC:=oC,] notation, it acts as if the table were passed by reference to the function rather than by value.
Also, interestingly enough, while performing
dataDT[x==‘1’]
vs
dataDT[(x==‘1’)]
shows the same result, the latter is 10 times slower, which I have noticed previously. I hope this code can shed some light.
rm(list=ls())
library(data.table)
superParF <- function(dtInput){
dtInputP <- dtInput[a==1]
dtInputN <- dtInput[a==2]
outDT <- rbind(dtInputP[,sum(y),by='x'],
dtInputN[,sum(y),by='x'])
return(outDT)
}
superFunction <- function(dtInput){
#create new column
dtInput[,z:=y,]
#run function
outDT <- rbindlist(lapply(unique(inputDT$x),
function(i)
superParF(inputDT[x==i])))
#output outDT
return(outDT)
}
inputDT <- data.table(x = c(rep(1,100000),
rep(2,100000),
rep(3,100000),
rep(4,100000),
rep(5,100000)),
y= c(rep(1:100000,5)))
inputDT$x <- as.factor(inputDT$x)
inputDT$y <- as.numeric(inputDT$y)
inputDT <- rbind(inputDT,inputDT)
inputDT$a <- c(rep(1,500000),rep(2,500000))
setkey(inputDT,x)
#first observation-> the two searches do not work with the same performance
a <- system.time(inputDT[x=='1'])
b <- system.time(inputDT[(x=='1')])
print(a)
print(b)
out <- superFunction(inputDT)
a <- system.time(inputDT[x=='1'])
b <- system.time(inputDT[(x=='1')])
print(a)
print(b)
inputDT
I asked in comments to provide the version number and to follow the guidelines on the Support page. It contains :
Read and search the README.md. Is there a bug fix or a new feature related to your issue? Probably we were aware of the issue or someone else reported it and we have already fixed the issue in the current development version.
So, searching the README.md for the string "index" just using Ctrl-F in the browser, yields :
21 Auto indexing handles logical subset of factor column using numeric value properly, #1361. Thanks #mplatzer.
26 Auto indexing returns order of subset properly when input data.table is already sorted, #1495. Thanks #huashan for the nice
reproducible example.
Those are fixed in v1.9.7 easily installed with one command detailed on the Installation page.
The first one (item 21) looks suspiciously close to your issue. So please do try v1.9.7 as requested on the Support page in point 4.
We ask for you state the version number up front to save time because we want to ensure you are using at least v1.9.6 on CRAN and not v1.9.4 which had this problem :
DT[column == value] no longer recycles value except in the length 1 case (when it still uses DT's key or an automatic secondary key, as introduced in v1.9.4). If length(value)==length(column) then it works element-wise as standard in R. Otherwise, a length error is issued to avoid common user errors. DT[column %in% values] still uses DT's key (or an an automatic secondary key) as before. Automatic indexing (i.e., optimization of == and %in%) may still be turned off with options(datatable.auto.index=FALSE).
So which version are you running please and have you tried v1.9.7 since it looks like it's worth a try?
Using the dT[,Column:=Value] notation seems to cause the SAME BUG in another post as well!
data.table not recognising logical in filter
Replacing dT[,Column:=Value] with dT$Column <- Value fixes both my bug and this posts bug.
#Matt Dowle: this post that I am linking has much more succinct code that I have and the bug is the same! You would find it of great help in your quest to fix this issue!

About the new features J() of data.table 1.9.2

I'm happy to find data.table has its new release, and got one question about J(). From data.table NEWS 1.9.2:
x[J(2), a], where a is the key column sees a in j, #2693 and FAQ 2.8. Also, x[J(2)] automatically names the columns from i using the key columns of x. In cases where the key columns of x and i are identical, i's columns can be referred to by using i.name; e.g., x[J(2), i.a]
There're several questions about J() in S.O, and also the introduction to data.table talks about the binary search of J(). But my understanding of J() is still not very clear.
All I know is that, if I want to select rows where "b" in column A and "d" in column B:
DT2 <- data.table(A = letters[1:5], B = letters[3:7], C = 1:5)
setkey(DT2, A, B)
DT2[J("b", "d")]
and if I want to select the rows where A = "a" or "c", I code like this
DT2[A == "a" | A == "c"]
much like the data.frame way. (minor question: how to select using a more data.table way?)
So to my understanding, 'J() only uses in the above case. select two single value from 2 different columns.
Hope my understanding is wrong. There're few documents about J(). I read How is J() function implemented in data.table?. J(.) is detected and simply replaced with list(.)
It seems that every case list(.) can replace J(.)
And back to the question, what the purpose of this new feature? x[J(2), a]
It's really appreciated if you can give some detailed explanations!
.() and J() as the function wrapping the i argument of data.table are simply replaced by list() because [.data.table does some programming on the language of the i and j arguments to optimize how things are done internally. It can be thought of as a alias for list
The reason they are included is to allow save time and effort (3 key strokes!)
If I wanted to select key values 'a' or 'c' from the first column of a key I could do
DT[.(c('a','c'))]
# or
DT[J(c('a','c'))]
# or
DT[list(c('a','c'))]
If I wanted A='a' or 'c' and B = 'd' then I would could use
DT[.(c('a','c'),'d')]
If I wanted A = 'a' or 'c' and B = 'd' or 'e' then I would use CJ (or expand.grid) to create all combinations
DT[CJ(c('a','c'),c('d','e'))]
The help for J,SJ and CJ is quite well written! See also the vignette Keys and fast binary search based subset.

Why is rbindlist "better" than rbind?

I am going through documentation of data.table and also noticed from some of the conversations over here on SO that rbindlist is supposed to be better than rbind.
I would like to know why is rbindlist better than rbind and in which scenarios rbindlist really excels over rbind?
Is there any advantage in terms of memory utilization?
rbindlist is an optimized version of do.call(rbind, list(...)), which is known for being slow when using rbind.data.frame
Where does it really excel
Some questions that show where rbindlist shines are
Fast vectorized merge of list of data.frames by row
Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply
These have benchmarks that show how fast it can be.
rbind.data.frame is slow, for a reason
rbind.data.frame does lots of checking, and will match by name. (i.e. rbind.data.frame will account for the fact that columns may be in different orders, and match up by name), rbindlist doesn't do this kind of checking, and will join by position
eg
do.call(rbind, list(data.frame(a = 1:2, b = 2:3), data.frame(b = 1:2, a = 2:3)))
## a b
## 1 1 2
## 2 2 3
## 3 2 1
## 4 3 2
rbindlist(list(data.frame(a = 1:5, b = 2:6), data.frame(b = 1:5, a = 2:6)))
## a b
## 1: 1 2
## 2: 2 3
## 3: 1 2
## 4: 2 3
Some other limitations of rbindlist
It used to struggle to deal with factors, due to a bug that has since been fixed:
rbindlist two data.tables where one has factor and other has character type for a column (Bug #2650)
It has problems with duplicate column names
see
Warning message: in rbindlist(allargs) : NAs introduced by coercion: possible bug in data.table? (Bug #2384)
rbind.data.frame rownames can be frustrating
rbindlist can handle lists data.frames and data.tables, and will return a data.table without rownames
you can get in a muddle of rownames using do.call(rbind, list(...))
see
How to avoid renaming of rows when using rbind inside do.call?
Memory efficiency
In terms of memory rbindlist is implemented in C, so is memory efficient, it uses setattr to set attributes by reference
rbind.data.frame is implemented in R, it does lots of assigning, and uses attr<- (and class<- and rownames<- all of which will (internally) create copies of the created data.frame.
By v1.9.2, rbindlist had evolved quite a bit, implementing many features including:
Choosing the highest SEXPTYPE of columns while binding - implemented in v1.9.2 closing FR #2456 and Bug #4981.
Handling factor columns properly - first implemented in v1.8.10 closing Bug #2650 and extended to binding ordered factors carefully in v1.9.2 as well, closing FR #4856 and Bug #5019.
In addition, in v1.9.2, rbind.data.table also gained a fill argument, that allows to bind by filling missing columns, implemented in R.
Now in v1.9.3, there are even more improvements on these existing features:
rbindlist gains an argument use.names, which by default is FALSE for backwards compatibility.
rbindlist also gains an argument fill, which by default is also FALSE for backwards compatibility.
These features are all implemented in C, and written carefully to not compromise in speed while adding functionalities.
Since rbindlist can now match by names and fill missing columns, rbind.data.table just calls rbindlist now. The only difference is that use.names=TRUE by default for rbind.data.table, for backwards compatibility.
rbind.data.frame slows down quite a bit mostly due to copies (which #mnel points out as well) that could be avoided (by moving to C). I think that's not the only reason. The implementation for checking/matching column names in rbind.data.frame could also get slower when there are many columns per data.frame and there are many such data.frames to bind (as shown in the benchmark below).
However, that rbindlist lack(ed) certain features (like checking factor levels or matching names) bears very tiny (or no) weight towards it being faster than rbind.data.frame. It's because they were carefully implemented in C, optimised for speed and memory.
Here's a benchmark that highlights the efficient binding while matching by column names as well using rbindlist's use.names feature from v1.9.3. The data set consists of 10000 data.frames each of size 10*500.
NB: this benchmark has been updated to include a comparison to dplyr's bind_rows
library(data.table) # 1.11.5, 2018-06-02 00:09:06 UTC
library(dplyr) # 0.7.5.9000, 2018-06-12 01:41:40 UTC
set.seed(1L)
names = paste0("V", 1:500)
cols = 500L
foo <- function() {
data = as.data.frame(setDT(lapply(1:cols, function(x) sample(10))))
setnames(data, sample(names))
}
n = 10e3L
ll = vector("list", n)
for (i in 1:n) {
.Call("Csetlistelt", ll, i, foo())
}
system.time(ans1 <- rbindlist(ll))
# user system elapsed
# 1.226 0.070 1.296
system.time(ans2 <- rbindlist(ll, use.names=TRUE))
# user system elapsed
# 2.635 0.129 2.772
system.time(ans3 <- do.call("rbind", ll))
# user system elapsed
# 36.932 1.628 38.594
system.time(ans4 <- bind_rows(ll))
# user system elapsed
# 48.754 0.384 49.224
identical(ans2, setDT(ans3))
# [1] TRUE
identical(ans2, setDT(ans4))
# [1] TRUE
Binding columns as such without checking for names took just 1.3 where as checking for column names and binding appropriately took just 1.5 seconds more. Compared to base solution, this is 14x faster, and 18x faster than dplyr's version.
Then this probably should be considered a bug?
# let us make a very simple list here
l <- list('a' = 1, 'b' = 2, 'c' = 3)
l
$a
[1] 1
$b
[1] 2
$c
[1] 3
# check that it is a list
class(l)
#[1] "list"
typeof(l)
#[1] "list"
And rbind can handle it without any problems
do.call('rbind', l)
# [,1]
# a 1
# b 2
# c 3
But when using rbindlist one gets this?
rbindlist(l)
Error in rbindlist(l) :
Item 1 of input is not a data.frame, data.table or list
The error message is more than confusing since we checked above that the input is a list, didn't we?
Is the correct application of the function for this simplest of cases documented somewhere?
Any hints appreciated ... since I was expecting similar or same results as with do.call('rbind', l) I am a bit confused why the data.table function decides to transpose the result of the equivalent call on the same list when I work around that wrong class error eg by doing this?
rbindlist(list(l))
# will result in
a b c
1: 1 2 3

What you can do with a data.frame that you can't with a data.table?

I just started using R, and came across data.table. I found it brilliant.
A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?
From the data.table FAQ
FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?
As FAQ 1.1 highlights, j in [.data.table is fundamentally
different from j in [.data.frame. Even something as simple as
DF[,1] would break existing code in many packages and user code.
This is by design, and we want it to work this way for more
complicated syntax to work. There are other differences, too (see FAQ
2.17).
Furthermore, data.table inherits from data.frame. It is a
data.frame, too. A data.table can be passed to any package that
only accepts data.frame and that package can use [.data.frame
syntax on the data.table.
We have proposed enhancements to R wherever possible, too. One of
these was accepted as a new feature in R 2.12.0 :
unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked
encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements
to the way the hash code is generated in unique.c.
A second proposal was to use memcpy in duplicate.c, which is much
faster than a for loop in C. This would improve the way that R copies
data internally (on some measures by 13 times). The thread on r-devel
is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.
What are the smaller syntax differences between data.frame and data.table
DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column
DT[3, ] == DT[3], but DF[ , 3] == DF[3] (somewhat confusingly in data.frame, whereas data.table is consistent)
For this reason we say the comma is optional in DT, but not optional in DF
DT[[3]] == DF[, 3] == DF[[3]]
DT[i, ], where i is a single integer, returns a single row, just like DF[i, ], but unlike a matrix single-row subset which returns a vector.
DT[ , j] where j is a single integer returns a one-column data.table, unlike DF[, j] which returns a vector by default
DT[ , "colA"][[1]] == DF[ , "colA"].
DT[ , colA] == DF[ , "colA"] (currently in data.table v1.9.8 but is about to change, see release notes)
DT[ , list(colA)] == DF[ , "colA", drop = FALSE]
DT[NA] returns 1 row of NA, but DF[NA] returns an entire copy of DF containing NA throughout. The symbol NA is type logical in R and is therefore recycled by [.data.frame. The user's intention was probably DF[NA_integer_]. [.data.table diverts to this probable intention automatically, for convenience.
DT[c(TRUE, NA, FALSE)] treats the NA as FALSE, but DF[c(TRUE, NA, FALSE)] returns
NA rows for each NA
DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]
data.frame(list(1:2, "k", 1:4)) creates 3 columns, data.table creates one list column.
check.names is by default TRUE in data.frame but FALSE in data.table, for convenience.
stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to factor.
Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
In [.data.frame we very often set drop = FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In [.data.table we took the opportunity to make it consistent and dropped drop.
When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.
Small caveat
There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.
For example
see this question and prompt response
From the NEWS for v 1.8.2
base::unname(DT) now works again, as needed by plyr::melt(). Thanks to
Christoph Jaeckel for reporting. Test added.
An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2
without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added.
ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2
doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.

Subset row and column at the same time

I'm a bit surprised by how data.table works:
> library(data.table)
data.table 1.8.2 For help type: help("data.table")
> dt <- data.table(a=11:20, b=21:30, c=31:40, key="a")
> dt[list(12)]
a b c
1: 12 22 32
> dt[list(12), b]
a b
1: 12 22
> dt[list(12)][,b]
[1] 22
What I'm trying to do is obtain the value of a single column (or expression) in rows matched by a selection. I see that I've got to pass the key as a list as a raw number would indicate a row number and not a key value. So the first of the above is clear to me. But why the second and the thirs subset expression yield different results seems rather confusing to me. I'd like to get the third result, but would exect being able to write it the second way.
Is there any good reason why subsetting a data.table for rows and columns at the same time will always include the key value as well as the computed result? Is there a syntactically shorter way to obtain a single result except by double subsetting as above?
I'm using data.table 1.8.2 on R 2.15.1. If you cannot reproduce my example, you might as well consider a factor as key:
dt <- data.table(a=paste("a", 11:20, sep=""), b=21:30, c=31:40, key="a")
dt["a11", b]
Regarding this question:
Is there any good reason why subsetting a data.table for rows and columns at the same time will always include the key value as well as the computed result?
I believe that the (good enough for me) reason is simply that Matthew Dowle hasn't yet gotten around to adding that option (likely because he has prioritized work on much more useful features such as ":= with by").
In comments following my answer here, Matthew seemed to indicate that it is on his TODO list, noting that "[this] is what drop=TRUE will do (with a speed advantage) when drop is added".
Until then, any of the following will get the job done:
dt[list(12)][,b]
# [1] 22
dt[list(12)][[2]]
# [1] 22
dt[dt[list(12), which=TRUE], b]
# [1] 22
One possibility is to use:
dt[a == 12]
and
dt[a == 12, b]
This will work as expected, but it prevents binary search and requires sequential search instead (is there a plan to change this behavior ??), making it potentially slower.
UPDATE Sep 2014: now in v1.9.3
From NEWS :
DT[column==values] is now optimized to use DT's key when key(DT)[1]=="column", otherwise a secondary key (a.k.a. index) is automatically added so the next DT[column==values] is much faster. DT[column %in% values] is equivalent; i.e., both == and %in% accept vector values. No code changes are needed; existing code should automatically benefit. Secondary keys can be added manually using set2key() and existence checked using key2(). These optimizations and function names/arguments are experimental and may be turned off with options(datatable.auto.index=FALSE).

Resources