About the new features J() of data.table 1.9.2 - r

I'm happy to find data.table has its new release, and got one question about J(). From data.table NEWS 1.9.2:
x[J(2), a], where a is the key column sees a in j, #2693 and FAQ 2.8. Also, x[J(2)] automatically names the columns from i using the key columns of x. In cases where the key columns of x and i are identical, i's columns can be referred to by using i.name; e.g., x[J(2), i.a]
There're several questions about J() in S.O, and also the introduction to data.table talks about the binary search of J(). But my understanding of J() is still not very clear.
All I know is that, if I want to select rows where "b" in column A and "d" in column B:
DT2 <- data.table(A = letters[1:5], B = letters[3:7], C = 1:5)
setkey(DT2, A, B)
DT2[J("b", "d")]
and if I want to select the rows where A = "a" or "c", I code like this
DT2[A == "a" | A == "c"]
much like the data.frame way. (minor question: how to select using a more data.table way?)
So to my understanding, 'J() only uses in the above case. select two single value from 2 different columns.
Hope my understanding is wrong. There're few documents about J(). I read How is J() function implemented in data.table?. J(.) is detected and simply replaced with list(.)
It seems that every case list(.) can replace J(.)
And back to the question, what the purpose of this new feature? x[J(2), a]
It's really appreciated if you can give some detailed explanations!

.() and J() as the function wrapping the i argument of data.table are simply replaced by list() because [.data.table does some programming on the language of the i and j arguments to optimize how things are done internally. It can be thought of as a alias for list
The reason they are included is to allow save time and effort (3 key strokes!)
If I wanted to select key values 'a' or 'c' from the first column of a key I could do
DT[.(c('a','c'))]
# or
DT[J(c('a','c'))]
# or
DT[list(c('a','c'))]
If I wanted A='a' or 'c' and B = 'd' then I would could use
DT[.(c('a','c'),'d')]
If I wanted A = 'a' or 'c' and B = 'd' or 'e' then I would use CJ (or expand.grid) to create all combinations
DT[CJ(c('a','c'),c('d','e'))]
The help for J,SJ and CJ is quite well written! See also the vignette Keys and fast binary search based subset.

Related

if else command in data.table to add column

Suppose I have a data.table with columns "A","C","byvar" and sometimes "B". I want to summarize it by a variable 'byvar', but only include B if it is present or conditional upon some other criteria.
The following doesn't seem to work, does someone have an idea?
dt[, .(
A=sum(A),
if("B" %in% names(dt)) {B=mean(B)},
C=mean(C),
D=sum(A)/C
), by = .(byvar)]
Try B=ifelse("B"%in%names(dt),mean(B),NA) it'll give you a column with NAs but it is extensible to arbitrary criteria and column names.
dt<-data.table(A=runif(100,1,100), C=runif(100,1,100), byvar=rep(letters[1:10],10))
dt[, .(
A=sum(A),
B=ifelse("B"%in%names(dt),mean(B),NA),
C=mean(C),
D=sum(A)/C
), by = .(byvar)]
In running this I get 100 row response because your D=sum(A)/C has C in it which grabs the original C not the new C and so it gives you 100 rows because there are 100 Cs. If you change your definition of D to sum(A)/mean(C) then it gives what you likely intended.
Edit:
Another way you can do this is to take advantage of the ability to use curly braces in the J expression
dt[, {checkcol='B'
prelimreturn=list(A=sum(A),
C=mean(C),
D=sum(A)/mean(C))
if(checkcol%in%names(dt)) prelimreturn[[checkcol]]<-mean(get(checkcol))
prelimreturn}
, by = .(byvar)]
Here I set a helper variable called checkcol so that we're not putting "B" in two places. Next we make your preliminary result with the columns you know you want. After that we check if whatever is in checkcol exists and if it does we add that column to our prexisting list. Then the last line in the curly braces is what data.table displays which is our prelimresult list which may or may not have a "B" column. You could extend this approach pretty broadly too.
You can try
dt[, lapply(.SD, sum), byvar,,.SDcols = patterns("A|B|C")]

Subsetting columns works on data.frame but not on data.table

I can select a few columns from a data.frame:
> z[c("events","users")]
events users
1 26246016 201816
2 942767 158793
3 29211295 137205
4 30797086 124314
but not from a data.table:
> best[c("events","users")]
Starting binary search ...Error in `[.data.table`(best, c("events", "users")) :
typeof x.pixel_id (integer) != typeof i.V1 (character)
Calls: [ -> [.data.table
What do I do?
Is there a better way than to turn the data.table back into a data.frame?
Given that you're looking for a data.table back you should use list rather than c in the j part of the call.
z[, list(events,users)] # first comma is important
Note that you don't need the quotes around the column names.
Column subsetting should be done in j, not in i. Do instead:
DT[, c("x", "y")]
Check this presentation (slide 4) to get an idea of how to read a data.table syntax (more like SQL). That'll help convince you that it makes more sense for providing columns in j - equivalent of SELECT in SQL.

temporarily keying a data.table for merging

I have a keyed data.table, x, and realize that I need to merge it using a different multicolumn key.
I want to avoid (i) setting and resetting x's key and (ii) keeping track of copies of x with different keys. Here's some sample data and my current approach:
require(data.table)
options(datatable.verbose=TRUE)
set.seed(1)
n <- 10
m <- 2
samp <- function(n) sample(1:9,n,replace=T)
x <- data.table(A = samp(n),B = samp(n),C = samp(n),key="A")
y <- x[samp(m),list(B,C,D=samp(m))]
# this works:
x[,.SD,key="B,C"][y]
# B C A D
# 1: 7 6 6 5
# 2: 9 4 6 2
So that approach works, but I get the comment
...j is a named list. It's very inefficient...
The named list is .SD. Is there a better or more standard way to do this?
It seems that using key or keyby without .SD has no effect:
key(x[,,keyby="B,C"]) # A
key(x[,,key="B,C"]) # A
In version 1.9.5, the on argument was added, with this usage note in the changelog:
data.tables can join now without having to set keys by using the new on argument. For example: DT1[DT2, on=c(x = "y")] would join column 'y' of DT2 with 'x' of DT1. DT1[DT2, on="y"] would join on column 'y' on both data.tables.
In this case, since the merge-column names are the same in x and y, x[y,on=c("B","C")] works.
Historical answer (around version 1.8.11): As of version 1.8.11 [.data.table will have a key argument, which is equivalent to calling setkeyv beforehand. It's not exactly what this question is looking for, but I don't see a way of achieving this without copying the entire data (bad imo), so I think this is a reasonable compromise, but please let me know if you think otherwise.
Edit from Matthew
Specifically adding an argument named key to [.data.table was a new suggestion in the last few days that I haven't responded to yet. We've discussed secondary keys in the past, set2key for example. Secondary keys won't copy the data.
We'll discuss it off list but I think key in [.data.table will likely change name or be done differently. Reminder to internet: v1.8.11 is in development, unstable and experimental. When it gets published to CRAN, then it can be relied on.

What you can do with a data.frame that you can't with a data.table?

I just started using R, and came across data.table. I found it brilliant.
A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?
From the data.table FAQ
FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?
As FAQ 1.1 highlights, j in [.data.table is fundamentally
different from j in [.data.frame. Even something as simple as
DF[,1] would break existing code in many packages and user code.
This is by design, and we want it to work this way for more
complicated syntax to work. There are other differences, too (see FAQ
2.17).
Furthermore, data.table inherits from data.frame. It is a
data.frame, too. A data.table can be passed to any package that
only accepts data.frame and that package can use [.data.frame
syntax on the data.table.
We have proposed enhancements to R wherever possible, too. One of
these was accepted as a new feature in R 2.12.0 :
unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked
encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements
to the way the hash code is generated in unique.c.
A second proposal was to use memcpy in duplicate.c, which is much
faster than a for loop in C. This would improve the way that R copies
data internally (on some measures by 13 times). The thread on r-devel
is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.
What are the smaller syntax differences between data.frame and data.table
DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column
DT[3, ] == DT[3], but DF[ , 3] == DF[3] (somewhat confusingly in data.frame, whereas data.table is consistent)
For this reason we say the comma is optional in DT, but not optional in DF
DT[[3]] == DF[, 3] == DF[[3]]
DT[i, ], where i is a single integer, returns a single row, just like DF[i, ], but unlike a matrix single-row subset which returns a vector.
DT[ , j] where j is a single integer returns a one-column data.table, unlike DF[, j] which returns a vector by default
DT[ , "colA"][[1]] == DF[ , "colA"].
DT[ , colA] == DF[ , "colA"] (currently in data.table v1.9.8 but is about to change, see release notes)
DT[ , list(colA)] == DF[ , "colA", drop = FALSE]
DT[NA] returns 1 row of NA, but DF[NA] returns an entire copy of DF containing NA throughout. The symbol NA is type logical in R and is therefore recycled by [.data.frame. The user's intention was probably DF[NA_integer_]. [.data.table diverts to this probable intention automatically, for convenience.
DT[c(TRUE, NA, FALSE)] treats the NA as FALSE, but DF[c(TRUE, NA, FALSE)] returns
NA rows for each NA
DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]
data.frame(list(1:2, "k", 1:4)) creates 3 columns, data.table creates one list column.
check.names is by default TRUE in data.frame but FALSE in data.table, for convenience.
stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to factor.
Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
In [.data.frame we very often set drop = FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In [.data.table we took the opportunity to make it consistent and dropped drop.
When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.
Small caveat
There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.
For example
see this question and prompt response
From the NEWS for v 1.8.2
base::unname(DT) now works again, as needed by plyr::melt(). Thanks to
Christoph Jaeckel for reporting. Test added.
An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2
without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added.
ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2
doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.

Subset row and column at the same time

I'm a bit surprised by how data.table works:
> library(data.table)
data.table 1.8.2 For help type: help("data.table")
> dt <- data.table(a=11:20, b=21:30, c=31:40, key="a")
> dt[list(12)]
a b c
1: 12 22 32
> dt[list(12), b]
a b
1: 12 22
> dt[list(12)][,b]
[1] 22
What I'm trying to do is obtain the value of a single column (or expression) in rows matched by a selection. I see that I've got to pass the key as a list as a raw number would indicate a row number and not a key value. So the first of the above is clear to me. But why the second and the thirs subset expression yield different results seems rather confusing to me. I'd like to get the third result, but would exect being able to write it the second way.
Is there any good reason why subsetting a data.table for rows and columns at the same time will always include the key value as well as the computed result? Is there a syntactically shorter way to obtain a single result except by double subsetting as above?
I'm using data.table 1.8.2 on R 2.15.1. If you cannot reproduce my example, you might as well consider a factor as key:
dt <- data.table(a=paste("a", 11:20, sep=""), b=21:30, c=31:40, key="a")
dt["a11", b]
Regarding this question:
Is there any good reason why subsetting a data.table for rows and columns at the same time will always include the key value as well as the computed result?
I believe that the (good enough for me) reason is simply that Matthew Dowle hasn't yet gotten around to adding that option (likely because he has prioritized work on much more useful features such as ":= with by").
In comments following my answer here, Matthew seemed to indicate that it is on his TODO list, noting that "[this] is what drop=TRUE will do (with a speed advantage) when drop is added".
Until then, any of the following will get the job done:
dt[list(12)][,b]
# [1] 22
dt[list(12)][[2]]
# [1] 22
dt[dt[list(12), which=TRUE], b]
# [1] 22
One possibility is to use:
dt[a == 12]
and
dt[a == 12, b]
This will work as expected, but it prevents binary search and requires sequential search instead (is there a plan to change this behavior ??), making it potentially slower.
UPDATE Sep 2014: now in v1.9.3
From NEWS :
DT[column==values] is now optimized to use DT's key when key(DT)[1]=="column", otherwise a secondary key (a.k.a. index) is automatically added so the next DT[column==values] is much faster. DT[column %in% values] is equivalent; i.e., both == and %in% accept vector values. No code changes are needed; existing code should automatically benefit. Secondary keys can be added manually using set2key() and existence checked using key2(). These optimizations and function names/arguments are experimental and may be turned off with options(datatable.auto.index=FALSE).

Resources