I am a big fan and massive user of data.tables in R. I really use them for a lot of code but have recently encountered a strange bug:
I have a huge data.table with multiple columns, example:
x y
1: 1 a
2: 1 b
3: 1 c
4: 2 a
5: 2 b
6: 2 c
7: 3 a
8: 3 b
9: 3 c
if I select
dataDT[x==‘1’]
I end up getting
x y
1: 1 a
whereas
dataDT[(x==‘1’)]
gives me
x y
1: 1 a
2: 1 b
3: 1 c
Any ideas? x and y are factor and the data.table is indexed by setKey by x.
ADDITIONAL INFOS AND CODE:
I actually fixed this issue but in a way that is not clear nor intuitive.
My code is structured as follows: I have a function called from my main code where I have to introduce a column in the data.table.
I have previously used the following notation
dataT[,nC:=oC,]
to do the deed.
I have instead found that creating the new column by using
dataT$nC <- dataT$oC
instead fixes the bug completely.
I tried to replicate the exact same bug on a simpler example code but I cannot, possibly because of dependencies related to the size structure of my data.table as well as the specific functions I am running on my table.
With that said, I have a working example that shows that when you insert a column using the dataT[,nC:=oC,] notation, it acts as if the table were passed by reference to the function rather than by value.
Also, interestingly enough, while performing
dataDT[x==‘1’]
vs
dataDT[(x==‘1’)]
shows the same result, the latter is 10 times slower, which I have noticed previously. I hope this code can shed some light.
rm(list=ls())
library(data.table)
superParF <- function(dtInput){
dtInputP <- dtInput[a==1]
dtInputN <- dtInput[a==2]
outDT <- rbind(dtInputP[,sum(y),by='x'],
dtInputN[,sum(y),by='x'])
return(outDT)
}
superFunction <- function(dtInput){
#create new column
dtInput[,z:=y,]
#run function
outDT <- rbindlist(lapply(unique(inputDT$x),
function(i)
superParF(inputDT[x==i])))
#output outDT
return(outDT)
}
inputDT <- data.table(x = c(rep(1,100000),
rep(2,100000),
rep(3,100000),
rep(4,100000),
rep(5,100000)),
y= c(rep(1:100000,5)))
inputDT$x <- as.factor(inputDT$x)
inputDT$y <- as.numeric(inputDT$y)
inputDT <- rbind(inputDT,inputDT)
inputDT$a <- c(rep(1,500000),rep(2,500000))
setkey(inputDT,x)
#first observation-> the two searches do not work with the same performance
a <- system.time(inputDT[x=='1'])
b <- system.time(inputDT[(x=='1')])
print(a)
print(b)
out <- superFunction(inputDT)
a <- system.time(inputDT[x=='1'])
b <- system.time(inputDT[(x=='1')])
print(a)
print(b)
inputDT
I asked in comments to provide the version number and to follow the guidelines on the Support page. It contains :
Read and search the README.md. Is there a bug fix or a new feature related to your issue? Probably we were aware of the issue or someone else reported it and we have already fixed the issue in the current development version.
So, searching the README.md for the string "index" just using Ctrl-F in the browser, yields :
21 Auto indexing handles logical subset of factor column using numeric value properly, #1361. Thanks #mplatzer.
26 Auto indexing returns order of subset properly when input data.table is already sorted, #1495. Thanks #huashan for the nice
reproducible example.
Those are fixed in v1.9.7 easily installed with one command detailed on the Installation page.
The first one (item 21) looks suspiciously close to your issue. So please do try v1.9.7 as requested on the Support page in point 4.
We ask for you state the version number up front to save time because we want to ensure you are using at least v1.9.6 on CRAN and not v1.9.4 which had this problem :
DT[column == value] no longer recycles value except in the length 1 case (when it still uses DT's key or an automatic secondary key, as introduced in v1.9.4). If length(value)==length(column) then it works element-wise as standard in R. Otherwise, a length error is issued to avoid common user errors. DT[column %in% values] still uses DT's key (or an an automatic secondary key) as before. Automatic indexing (i.e., optimization of == and %in%) may still be turned off with options(datatable.auto.index=FALSE).
So which version are you running please and have you tried v1.9.7 since it looks like it's worth a try?
Using the dT[,Column:=Value] notation seems to cause the SAME BUG in another post as well!
data.table not recognising logical in filter
Replacing dT[,Column:=Value] with dT$Column <- Value fixes both my bug and this posts bug.
#Matt Dowle: this post that I am linking has much more succinct code that I have and the bug is the same! You would find it of great help in your quest to fix this issue!
Related
I have a keyed data.table, x, and realize that I need to merge it using a different multicolumn key.
I want to avoid (i) setting and resetting x's key and (ii) keeping track of copies of x with different keys. Here's some sample data and my current approach:
require(data.table)
options(datatable.verbose=TRUE)
set.seed(1)
n <- 10
m <- 2
samp <- function(n) sample(1:9,n,replace=T)
x <- data.table(A = samp(n),B = samp(n),C = samp(n),key="A")
y <- x[samp(m),list(B,C,D=samp(m))]
# this works:
x[,.SD,key="B,C"][y]
# B C A D
# 1: 7 6 6 5
# 2: 9 4 6 2
So that approach works, but I get the comment
...j is a named list. It's very inefficient...
The named list is .SD. Is there a better or more standard way to do this?
It seems that using key or keyby without .SD has no effect:
key(x[,,keyby="B,C"]) # A
key(x[,,key="B,C"]) # A
In version 1.9.5, the on argument was added, with this usage note in the changelog:
data.tables can join now without having to set keys by using the new on argument. For example: DT1[DT2, on=c(x = "y")] would join column 'y' of DT2 with 'x' of DT1. DT1[DT2, on="y"] would join on column 'y' on both data.tables.
In this case, since the merge-column names are the same in x and y, x[y,on=c("B","C")] works.
Historical answer (around version 1.8.11): As of version 1.8.11 [.data.table will have a key argument, which is equivalent to calling setkeyv beforehand. It's not exactly what this question is looking for, but I don't see a way of achieving this without copying the entire data (bad imo), so I think this is a reasonable compromise, but please let me know if you think otherwise.
Edit from Matthew
Specifically adding an argument named key to [.data.table was a new suggestion in the last few days that I haven't responded to yet. We've discussed secondary keys in the past, set2key for example. Secondary keys won't copy the data.
We'll discuss it off list but I think key in [.data.table will likely change name or be done differently. Reminder to internet: v1.8.11 is in development, unstable and experimental. When it gets published to CRAN, then it can be relied on.
Hi still trying to figure out data.table. If I have a data.table of values such as those below, what is the most efficient way to replace the values with those from another data.table?
set.seed(123456)
a=data.table(
date_id = rep(seq(as.Date('2013-01-01'),as.Date('2013-04-10'),'days'),5),
px =rnorm(500,mean=50,sd=5),
vol=rnorm(500,mean=500000,sd=150000),
id=rep(letters[1:5],each=100)
)
b=data.table(
date_id=rep(seq(as.Date('2013-01-01'),length.out=600,by='days'),5),
id=rep(letters[1:5],each=600),
px=NA_real_,
vol=NA_real_
)
setkeyv(a,c('date_id','id'))
setkeyv(b,c('date_id','id'))
What I'm trying to do is replace the px and vol in b with those in a where date_id and id match I'm a little flummoxed with this - I would suppose that something along the lines of might be the way to go but I don't think this will work in practice.
b[which(b$date_id %in% a$date_id & b$id %in% a$id),list(px:=a$px,vol:=a$vol)]
EDIT
I tried the following
t = a[b,roll=T]
t[!is.na(px),list(px.1:=px,vol.1=vol),by=list(date_id,id)]
and got the error message
Error in `:=`(px.1, px) :
:= is defined for use in j only, and (currently) only once; i.e., DT[i,col:=1L] and DT[,newcol:=sum(colB),by=colA] are ok, but not DT[i,col]:=1L, not DT[i]$col:=1L and not DT[,{newcol1:=1L;newcol2:=2L}]. Please see help(":="). Check is.data.table(DT) is TRUE.
If you are wanting to replace the values within b you can use the prefix i.. From the NEWS regarding version 1.7.10
The prefix i. can now be used in j to refer to join inherited
columns of i that are otherwise masked by columns in x with
the same name.
b[a, `:=`(px = i.px, vol = i.vol)]
Doesn't sound like you need the roll from your description, and it seems like you want to do this instead when you get your error:
t[!is.na(px),`:=`(px.1=px,vol.1=vol),by=list(date_id,id)]
sorry for the ugly code, but I'm not sure exactly what's going wrong
for (i in 1:1)
tab_sector[1:48,i] <-
tapply(get(paste("employee",1997-1+i, "[birth<=(1997-1+i)]",sep="")),
ordered(sic2digit[birth<=(1997-1+i)],levels=tab_sector_list))
# Error in get(paste("employee", 1997 - 1 + i,
# "[birth<=(1997-1+i))]", : object 'employee97[birth<=(1997-1+i)]' not found
but the variable is there:
head(employee97[birth<=(1997-1+i)])
# [1] 1 2 2 1 3 4
a simpler version where "employee" is not conditioned by "birth" works
It would help if you told us what you are trying to accomplish.
In your code the get function is looking for a variable whose name is "'employee97[birth<=(1997-1+i)]", the code that works is finding a variable whose name is "employee1997" then subsetting it, those are very different. The get function does not do subsetting.
Part of what you are trying to do is FAQ 7.21, the most important part of which is the end where it suggests storing your data in lists to make accessing easier.
You can't get an indexed element, e.g. get("x[i]") fails: you need get("x")[i].
Your code is almost too messy too see what's going on, but this is an attempt at a translation:
for (i in 1:1){
ind <- 1997-1+i
v1 <- get(paste0("employee",ind))
tab_sector[1:48,i] <- tapply(v1[birth<=ind],
ordered(sic2digit[birth<=ind],levels=tab_sector_list))
}
I am scoring a psychometric instrument at work and want to recode a few variables. Basically, each question has five possible responses, worth 0 to 4 respectively. That is how they were coded into our database, so I don't need to do anything except sum those. However, there are three questions that have reversed scores (so, when someone answers 0, we score that as 4). Thus, I am "reversing" those ones.
The data frame basically looks like this:
studyid timepoint date inst_q01 inst_q02 ... inst_q20
1 2 1995-03-13 0 2 ... 4
2 2 1995-06-15 1 3 ... 4
Here's what I've done so far.
# Survey Processing
# Find missing values (-9) and confusions (-1), and sum them
project_f03$inst_nmiss <- rowSums(project_f03[,4:23]==-9)
project_f03$inst_nconfuse <- rowSums(project_f03[,4:23]==-1)
project_f03$inst_nmisstot <- project_f03$inst_nmiss + project_f03$inst_nconfuse
# Recode any missing values into NAs
for(x in 4:23) {project_f03[project_f03[,x]==-9 | project_f03[,x]==-1,x] <- NA}
rm(x)
Now, everything so far is pretty fine, I am about to recode the three reversed ones. Now, my initial thought was to do a simple loop through the three variables, and do a series of assignment statements something like below:
# Questions 3, 11, and 16 are reversed
for(x in c(3,11,16)+3) {
project_f03[project_f03[,x]==4,x] <- 5
project_f03[project_f03[,x]==3,x] <- 6
project_f03[project_f03[,x]==2,x] <- 7
project_f03[project_f03[,x]==1,x] <- 8
project_f03[project_f03[,x]==0,x] <- 9
project_f03[,x] <- project_f03[,x]-5
}
rm(x)
So, the five assignment statements just reassign new values, and the loop just takes it through all three of the variables in question. Since I was reversing the scale, I thought it was easiest to offset everything by 5 and then just subtract five after all recodes were done. The main issue, though, is that there are NAs and those NAs result in errors in the loop (naturally, NA==4 returns an NA in R). Duh - forgot a basic rule!
I've come up with three alternatives, but I'm not sure which is the best.
First, I could obviously just move the NA-creating code after the loop, and it should work fine. Pros: easiest to implement. Cons: Only works if I am receiving data with no innate (versus created) NAs.
Second, I could change the logic statement to be something like:
project_f03[!is.na(project_f03[,x]) && project_f03[,x]==4,x] which should eliminate the logic conflict. Pros: not too hard, I know it works. Cons: A lot of extra code, seems like a kludge.
Finally, I could change the logic from
project_f03[project_f03[,x]==4,x] <- 5 to
project_f03[project_f03[,x] %in% 4,x] <- 5. This seems to work fine, but I'm not sure if it's a good practice, and wanted to get thoughts. Pros: quick fix for this issue and seems to work; preserves general syntatic flow of "blah blah LOGIC blah <- bleh". Cons: Might create black hole? Not sure what the potential implications of using %in% like this might be.
EDITED TO MAKE CLEAR
This question has one primary component: Is it safe to utilize %in% as described in the third point above when doing logical operations, or are there reasons not to do so?
The second component is: What are recommended ways of reversing the values, like some have described in answers and comments?
The straightforward answer is that there is no black hole to using %in%. But in instances where I want to just discard the NA values, I'd use which: project_f03[which(project_f03[,x]==4),x] <- 5
%in% could shorten that earlier bit of code you had:
for(x in 4:23) {project_f03[project_f03[,x]==-9 | project_f03[,x]==-1,x] <- NA}
#could be
for(x in 4:23) {project_f03[project_f03[,x] %in% c(-9,-1), x] <- NA}
Like #flodel suggested, you can replace that whole block of code in your for-loop with project_f03[,x] <- rev(0:4)[match(project_f03[,x], 0:4, nomatch=10)]. It should preserve NA. And there are probably more opportunities to simplify code.
It doesn't answer your question, but should fix your problem:
cols <- c(3,11,16)+3
project_f03[, cols] <- abs(project_f03[, cols]-4)
## or a lot of easier (as #TylerRinker suggested):
project_f03[, cols] <- max(project_f03[, cols]) - project_f03[, cols]
I'm a bit surprised by how data.table works:
> library(data.table)
data.table 1.8.2 For help type: help("data.table")
> dt <- data.table(a=11:20, b=21:30, c=31:40, key="a")
> dt[list(12)]
a b c
1: 12 22 32
> dt[list(12), b]
a b
1: 12 22
> dt[list(12)][,b]
[1] 22
What I'm trying to do is obtain the value of a single column (or expression) in rows matched by a selection. I see that I've got to pass the key as a list as a raw number would indicate a row number and not a key value. So the first of the above is clear to me. But why the second and the thirs subset expression yield different results seems rather confusing to me. I'd like to get the third result, but would exect being able to write it the second way.
Is there any good reason why subsetting a data.table for rows and columns at the same time will always include the key value as well as the computed result? Is there a syntactically shorter way to obtain a single result except by double subsetting as above?
I'm using data.table 1.8.2 on R 2.15.1. If you cannot reproduce my example, you might as well consider a factor as key:
dt <- data.table(a=paste("a", 11:20, sep=""), b=21:30, c=31:40, key="a")
dt["a11", b]
Regarding this question:
Is there any good reason why subsetting a data.table for rows and columns at the same time will always include the key value as well as the computed result?
I believe that the (good enough for me) reason is simply that Matthew Dowle hasn't yet gotten around to adding that option (likely because he has prioritized work on much more useful features such as ":= with by").
In comments following my answer here, Matthew seemed to indicate that it is on his TODO list, noting that "[this] is what drop=TRUE will do (with a speed advantage) when drop is added".
Until then, any of the following will get the job done:
dt[list(12)][,b]
# [1] 22
dt[list(12)][[2]]
# [1] 22
dt[dt[list(12), which=TRUE], b]
# [1] 22
One possibility is to use:
dt[a == 12]
and
dt[a == 12, b]
This will work as expected, but it prevents binary search and requires sequential search instead (is there a plan to change this behavior ??), making it potentially slower.
UPDATE Sep 2014: now in v1.9.3
From NEWS :
DT[column==values] is now optimized to use DT's key when key(DT)[1]=="column", otherwise a secondary key (a.k.a. index) is automatically added so the next DT[column==values] is much faster. DT[column %in% values] is equivalent; i.e., both == and %in% accept vector values. No code changes are needed; existing code should automatically benefit. Secondary keys can be added manually using set2key() and existence checked using key2(). These optimizations and function names/arguments are experimental and may be turned off with options(datatable.auto.index=FALSE).