Subset row and column at the same time - r

I'm a bit surprised by how data.table works:
> library(data.table)
data.table 1.8.2 For help type: help("data.table")
> dt <- data.table(a=11:20, b=21:30, c=31:40, key="a")
> dt[list(12)]
a b c
1: 12 22 32
> dt[list(12), b]
a b
1: 12 22
> dt[list(12)][,b]
[1] 22
What I'm trying to do is obtain the value of a single column (or expression) in rows matched by a selection. I see that I've got to pass the key as a list as a raw number would indicate a row number and not a key value. So the first of the above is clear to me. But why the second and the thirs subset expression yield different results seems rather confusing to me. I'd like to get the third result, but would exect being able to write it the second way.
Is there any good reason why subsetting a data.table for rows and columns at the same time will always include the key value as well as the computed result? Is there a syntactically shorter way to obtain a single result except by double subsetting as above?
I'm using data.table 1.8.2 on R 2.15.1. If you cannot reproduce my example, you might as well consider a factor as key:
dt <- data.table(a=paste("a", 11:20, sep=""), b=21:30, c=31:40, key="a")
dt["a11", b]

Regarding this question:
Is there any good reason why subsetting a data.table for rows and columns at the same time will always include the key value as well as the computed result?
I believe that the (good enough for me) reason is simply that Matthew Dowle hasn't yet gotten around to adding that option (likely because he has prioritized work on much more useful features such as ":= with by").
In comments following my answer here, Matthew seemed to indicate that it is on his TODO list, noting that "[this] is what drop=TRUE will do (with a speed advantage) when drop is added".
Until then, any of the following will get the job done:
dt[list(12)][,b]
# [1] 22
dt[list(12)][[2]]
# [1] 22
dt[dt[list(12), which=TRUE], b]
# [1] 22

One possibility is to use:
dt[a == 12]
and
dt[a == 12, b]
This will work as expected, but it prevents binary search and requires sequential search instead (is there a plan to change this behavior ??), making it potentially slower.
UPDATE Sep 2014: now in v1.9.3
From NEWS :
DT[column==values] is now optimized to use DT's key when key(DT)[1]=="column", otherwise a secondary key (a.k.a. index) is automatically added so the next DT[column==values] is much faster. DT[column %in% values] is equivalent; i.e., both == and %in% accept vector values. No code changes are needed; existing code should automatically benefit. Secondary keys can be added manually using set2key() and existence checked using key2(). These optimizations and function names/arguments are experimental and may be turned off with options(datatable.auto.index=FALSE).

Related

R data.table - quick comparison of strings

I would like to find a fast solution to the following problem.
The example is very small, real data big and speed is an important factor.
I have two vectors of strings, currently in data.tables but this not so important. I need to find frequency of occurrences of strings from one vector in the second one and keep these results.
Example
library(data.table)
dt1<-data.table(c("first","second","third"),c(0,0,0))
dt2<-data.table(c("first and second","third and fifth","second and no else","first and second and third"))
Now, for every item in dt1 I need to find in how many items from dt2 it is contained and save the final frequencies to the second column of dt1.
The task itself is not difficult. I have, however, not managed to find reasonably quick solution.
The solution I have now is this:
pm<-proc.time()
for (l in 1:dim(dt2)[1]) {
for (k in 1:dim(dt1)[1]) set(dt1,k,2L,dt1[k,V2]+as.integer(grepl(dt1[k,V1],dt2[l,V1])))
}
proc.time() - pm
Real data are very large and this is pretty slow, on my PC even this larger version takes 2 seconds
dt1<-data.table(rep(c("first","second","third"),10),rep(c(0,0,0),10))
dt2<-data.table(rep(c("first and second","third and fifth","second and no else","first and second and third"),10))
pm<-proc.time()
for (l in 1:dim(dt2)[1]) {
for (k in 1:dim(dt1)[1]) set(dt1,k,2L,dt1[k,V2]+as.integer(grepl(dt1[k,V1],dt2[l,V1])))
}
proc.time() - pm
user system elapsed
1.93 0.06 2.06
Do I miss a better solution to this - I would say quite simple - task?
Actually it is so simple that I am sure that it must be a duplicate, but I have not managed to find it here or anything equivalent.
Cross merging of the data.tables is not possible due to memory problems (in the real situation).
Thank you.
dt1[, V2 := sapply(V1, function(x) sum(grepl(x, dt2$V1)))]
Also you probably can use fixed string matching for speed.
In that case you can use stri_detect_fixed from stringi package:
dt1[, V2 := sapply(V1, function(x) sum(stri_detect_fixed(dt2$V1, x)))]

Strange issue with data.table row search

I am a big fan and massive user of data.tables in R. I really use them for a lot of code but have recently encountered a strange bug:
I have a huge data.table with multiple columns, example:
x y
1: 1 a
2: 1 b
3: 1 c
4: 2 a
5: 2 b
6: 2 c
7: 3 a
8: 3 b
9: 3 c
if I select
dataDT[x==‘1’]
I end up getting
x y
1: 1 a
whereas
dataDT[(x==‘1’)]
gives me
x y
1: 1 a
2: 1 b
3: 1 c
Any ideas? x and y are factor and the data.table is indexed by setKey by x.
ADDITIONAL INFOS AND CODE:
I actually fixed this issue but in a way that is not clear nor intuitive.
My code is structured as follows: I have a function called from my main code where I have to introduce a column in the data.table.
I have previously used the following notation
dataT[,nC:=oC,]
to do the deed.
I have instead found that creating the new column by using
dataT$nC <- dataT$oC
instead fixes the bug completely.
I tried to replicate the exact same bug on a simpler example code but I cannot, possibly because of dependencies related to the size structure of my data.table as well as the specific functions I am running on my table.
With that said, I have a working example that shows that when you insert a column using the dataT[,nC:=oC,] notation, it acts as if the table were passed by reference to the function rather than by value.
Also, interestingly enough, while performing
dataDT[x==‘1’]
vs
dataDT[(x==‘1’)]
shows the same result, the latter is 10 times slower, which I have noticed previously. I hope this code can shed some light.
rm(list=ls())
library(data.table)
superParF <- function(dtInput){
dtInputP <- dtInput[a==1]
dtInputN <- dtInput[a==2]
outDT <- rbind(dtInputP[,sum(y),by='x'],
dtInputN[,sum(y),by='x'])
return(outDT)
}
superFunction <- function(dtInput){
#create new column
dtInput[,z:=y,]
#run function
outDT <- rbindlist(lapply(unique(inputDT$x),
function(i)
superParF(inputDT[x==i])))
#output outDT
return(outDT)
}
inputDT <- data.table(x = c(rep(1,100000),
rep(2,100000),
rep(3,100000),
rep(4,100000),
rep(5,100000)),
y= c(rep(1:100000,5)))
inputDT$x <- as.factor(inputDT$x)
inputDT$y <- as.numeric(inputDT$y)
inputDT <- rbind(inputDT,inputDT)
inputDT$a <- c(rep(1,500000),rep(2,500000))
setkey(inputDT,x)
#first observation-> the two searches do not work with the same performance
a <- system.time(inputDT[x=='1'])
b <- system.time(inputDT[(x=='1')])
print(a)
print(b)
out <- superFunction(inputDT)
a <- system.time(inputDT[x=='1'])
b <- system.time(inputDT[(x=='1')])
print(a)
print(b)
inputDT
I asked in comments to provide the version number and to follow the guidelines on the Support page. It contains :
Read and search the README.md. Is there a bug fix or a new feature related to your issue? Probably we were aware of the issue or someone else reported it and we have already fixed the issue in the current development version.
So, searching the README.md for the string "index" just using Ctrl-F in the browser, yields :
21 Auto indexing handles logical subset of factor column using numeric value properly, #1361. Thanks #mplatzer.
26 Auto indexing returns order of subset properly when input data.table is already sorted, #1495. Thanks #huashan for the nice
reproducible example.
Those are fixed in v1.9.7 easily installed with one command detailed on the Installation page.
The first one (item 21) looks suspiciously close to your issue. So please do try v1.9.7 as requested on the Support page in point 4.
We ask for you state the version number up front to save time because we want to ensure you are using at least v1.9.6 on CRAN and not v1.9.4 which had this problem :
DT[column == value] no longer recycles value except in the length 1 case (when it still uses DT's key or an automatic secondary key, as introduced in v1.9.4). If length(value)==length(column) then it works element-wise as standard in R. Otherwise, a length error is issued to avoid common user errors. DT[column %in% values] still uses DT's key (or an an automatic secondary key) as before. Automatic indexing (i.e., optimization of == and %in%) may still be turned off with options(datatable.auto.index=FALSE).
So which version are you running please and have you tried v1.9.7 since it looks like it's worth a try?
Using the dT[,Column:=Value] notation seems to cause the SAME BUG in another post as well!
data.table not recognising logical in filter
Replacing dT[,Column:=Value] with dT$Column <- Value fixes both my bug and this posts bug.
#Matt Dowle: this post that I am linking has much more succinct code that I have and the bug is the same! You would find it of great help in your quest to fix this issue!

About the new features J() of data.table 1.9.2

I'm happy to find data.table has its new release, and got one question about J(). From data.table NEWS 1.9.2:
x[J(2), a], where a is the key column sees a in j, #2693 and FAQ 2.8. Also, x[J(2)] automatically names the columns from i using the key columns of x. In cases where the key columns of x and i are identical, i's columns can be referred to by using i.name; e.g., x[J(2), i.a]
There're several questions about J() in S.O, and also the introduction to data.table talks about the binary search of J(). But my understanding of J() is still not very clear.
All I know is that, if I want to select rows where "b" in column A and "d" in column B:
DT2 <- data.table(A = letters[1:5], B = letters[3:7], C = 1:5)
setkey(DT2, A, B)
DT2[J("b", "d")]
and if I want to select the rows where A = "a" or "c", I code like this
DT2[A == "a" | A == "c"]
much like the data.frame way. (minor question: how to select using a more data.table way?)
So to my understanding, 'J() only uses in the above case. select two single value from 2 different columns.
Hope my understanding is wrong. There're few documents about J(). I read How is J() function implemented in data.table?. J(.) is detected and simply replaced with list(.)
It seems that every case list(.) can replace J(.)
And back to the question, what the purpose of this new feature? x[J(2), a]
It's really appreciated if you can give some detailed explanations!
.() and J() as the function wrapping the i argument of data.table are simply replaced by list() because [.data.table does some programming on the language of the i and j arguments to optimize how things are done internally. It can be thought of as a alias for list
The reason they are included is to allow save time and effort (3 key strokes!)
If I wanted to select key values 'a' or 'c' from the first column of a key I could do
DT[.(c('a','c'))]
# or
DT[J(c('a','c'))]
# or
DT[list(c('a','c'))]
If I wanted A='a' or 'c' and B = 'd' then I would could use
DT[.(c('a','c'),'d')]
If I wanted A = 'a' or 'c' and B = 'd' or 'e' then I would use CJ (or expand.grid) to create all combinations
DT[CJ(c('a','c'),c('d','e'))]
The help for J,SJ and CJ is quite well written! See also the vignette Keys and fast binary search based subset.

temporarily keying a data.table for merging

I have a keyed data.table, x, and realize that I need to merge it using a different multicolumn key.
I want to avoid (i) setting and resetting x's key and (ii) keeping track of copies of x with different keys. Here's some sample data and my current approach:
require(data.table)
options(datatable.verbose=TRUE)
set.seed(1)
n <- 10
m <- 2
samp <- function(n) sample(1:9,n,replace=T)
x <- data.table(A = samp(n),B = samp(n),C = samp(n),key="A")
y <- x[samp(m),list(B,C,D=samp(m))]
# this works:
x[,.SD,key="B,C"][y]
# B C A D
# 1: 7 6 6 5
# 2: 9 4 6 2
So that approach works, but I get the comment
...j is a named list. It's very inefficient...
The named list is .SD. Is there a better or more standard way to do this?
It seems that using key or keyby without .SD has no effect:
key(x[,,keyby="B,C"]) # A
key(x[,,key="B,C"]) # A
In version 1.9.5, the on argument was added, with this usage note in the changelog:
data.tables can join now without having to set keys by using the new on argument. For example: DT1[DT2, on=c(x = "y")] would join column 'y' of DT2 with 'x' of DT1. DT1[DT2, on="y"] would join on column 'y' on both data.tables.
In this case, since the merge-column names are the same in x and y, x[y,on=c("B","C")] works.
Historical answer (around version 1.8.11): As of version 1.8.11 [.data.table will have a key argument, which is equivalent to calling setkeyv beforehand. It's not exactly what this question is looking for, but I don't see a way of achieving this without copying the entire data (bad imo), so I think this is a reasonable compromise, but please let me know if you think otherwise.
Edit from Matthew
Specifically adding an argument named key to [.data.table was a new suggestion in the last few days that I haven't responded to yet. We've discussed secondary keys in the past, set2key for example. Secondary keys won't copy the data.
We'll discuss it off list but I think key in [.data.table will likely change name or be done differently. Reminder to internet: v1.8.11 is in development, unstable and experimental. When it gets published to CRAN, then it can be relied on.

What you can do with a data.frame that you can't with a data.table?

I just started using R, and came across data.table. I found it brilliant.
A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?
From the data.table FAQ
FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?
As FAQ 1.1 highlights, j in [.data.table is fundamentally
different from j in [.data.frame. Even something as simple as
DF[,1] would break existing code in many packages and user code.
This is by design, and we want it to work this way for more
complicated syntax to work. There are other differences, too (see FAQ
2.17).
Furthermore, data.table inherits from data.frame. It is a
data.frame, too. A data.table can be passed to any package that
only accepts data.frame and that package can use [.data.frame
syntax on the data.table.
We have proposed enhancements to R wherever possible, too. One of
these was accepted as a new feature in R 2.12.0 :
unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked
encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements
to the way the hash code is generated in unique.c.
A second proposal was to use memcpy in duplicate.c, which is much
faster than a for loop in C. This would improve the way that R copies
data internally (on some measures by 13 times). The thread on r-devel
is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.
What are the smaller syntax differences between data.frame and data.table
DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column
DT[3, ] == DT[3], but DF[ , 3] == DF[3] (somewhat confusingly in data.frame, whereas data.table is consistent)
For this reason we say the comma is optional in DT, but not optional in DF
DT[[3]] == DF[, 3] == DF[[3]]
DT[i, ], where i is a single integer, returns a single row, just like DF[i, ], but unlike a matrix single-row subset which returns a vector.
DT[ , j] where j is a single integer returns a one-column data.table, unlike DF[, j] which returns a vector by default
DT[ , "colA"][[1]] == DF[ , "colA"].
DT[ , colA] == DF[ , "colA"] (currently in data.table v1.9.8 but is about to change, see release notes)
DT[ , list(colA)] == DF[ , "colA", drop = FALSE]
DT[NA] returns 1 row of NA, but DF[NA] returns an entire copy of DF containing NA throughout. The symbol NA is type logical in R and is therefore recycled by [.data.frame. The user's intention was probably DF[NA_integer_]. [.data.table diverts to this probable intention automatically, for convenience.
DT[c(TRUE, NA, FALSE)] treats the NA as FALSE, but DF[c(TRUE, NA, FALSE)] returns
NA rows for each NA
DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]
data.frame(list(1:2, "k", 1:4)) creates 3 columns, data.table creates one list column.
check.names is by default TRUE in data.frame but FALSE in data.table, for convenience.
stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to factor.
Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
In [.data.frame we very often set drop = FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In [.data.table we took the opportunity to make it consistent and dropped drop.
When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.
Small caveat
There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.
For example
see this question and prompt response
From the NEWS for v 1.8.2
base::unname(DT) now works again, as needed by plyr::melt(). Thanks to
Christoph Jaeckel for reporting. Test added.
An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2
without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added.
ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2
doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.

Resources