data.table and "by must evaluate to list" Error - r

I would like to use the data.table package in R to dynamically generate aggregations, but I am running into an error. Below, let my.dt be of type data.table.
sex <- c("M","F","M","F")
age <- c(19, 23, 26, 21)
dependent.variable <- c(1400, 1500, 1250, 1100)
my.dt <- data.table(sex, age, dependent.variable)
grouping.vars <- c("sex", "age")
for (i in 1:2) {
my.dt[,sum(dependent.variable), by=grouping.vars[i]]
}
If I run this, I get errors:
Error in `[.data.table`(my.dt, , sum(dependent.variable), by = grouping.vars[i] :
by must evaluate to list
Yet the following works without error:
my.dt[,sum(dependent.variable), by=sex]
I see why the error is occurring, but I do not see how to use a vector with the by parameter.

[UPDATE] 2 years after question was asked ...
On running the code in the question, data.table is now more helpful and returns this (using 1.8.2) :
Error in `[.data.table`(my.dt, , sum(dependent.variable), by = grouping.vars[i]) :
'by' appears to evaluate to column names but isn't c() or key(). Use by=list(...)
if you can. Otherwise, by=eval(grouping.vars[i]) should work. This is for efficiency
so data.table can detect which columns are needed.
and following the advice in the second sentence of error :
my.dt[,sum(dependent.variable), by=eval(grouping.vars[i])]
sex V1
1: M 2650
2: F 2600
Old answer from Jul 2010 (by can now be double and character, though) :
Strictly speaking the by needs to evaluate to a list of vectors each with storage mode integer, though. So the numeric vector age could also be coerced to integer using as.integer(). This is because data.table uses radix sorting (very fast) but the radix algorithm is specifically for integers only (see wikipedia's entry for 'radix sort'). Integer storage for key columns and ad hoc by is one of the reasons data.table is fast. A factor is of course an integer lookup to unique strings.
The idea behind by being a list() of expressions is that you are not restricted to column names. It is usual to write expressions of column names directly in the by. A common one is to aggregate by month; for example :
DT[,sum(col1), by=list(region,month(datecol))]
or a very fast way to group by yearmonth is by using a non epoch based date, such as yyyymmddL as seen in some of the examples in the package, like this :
DT[,sum(col1), by=list(region,month=datecol%/%100L)]
Notice how you can name the columns inside the list() like that.
To define and reuse complex grouping expressions :
e = quote(list(region,month(datecol)))
DT[,sum(col1),by=eval(e)]
DT[,sum(col2*col3/col4),by=eval(e)]
Or if you don't want to re-evaluate the by expressions each time, you can save the result once and reuse the result for efficiency; if the by expressions themselves take a long time to calculate/allocate, or you need to reuse it many times :
byval = DT[,list(region,month(datecol))]
DT[,sum(col1),by=byval]
DT[,sum(col2*col3/col4),by=byval]
Please see http://datatable.r-forge.r-project.org/ for latest info and status. A new presentation will be up there soon and hoping to release v1.5 to CRAN soon too. This contains several bug fixes and new features detailed in the NEWS file. The datatable-help list has about 30-40 posts a month which may be of interest too.

I did two changes to your original code:
sex <- c("M","F","M","F")
age <- c(19, 23, 26, 21)
age<-as.factor(age)
dependent.variable <- c(1400, 1500, 1250, 1100)
my.dt <- data.table(sex, age, dependent.variable)
for ( a in 1:2){
print(my.dt[,sum(dependent.variable), by=list(sex,age)[a]])
}
Numerical vector age should be forced into factors. As to by parameter, do not use quote for column names but group them into list(...). At least this is what the author has suggested.

Related

`$<-.data.frame`(`*tmp*`, Numero, value = numeric(0) error [duplicate]

I have a numeric column ("value") in a dataframe ("df"), and I would like to generate a new column ("valueBin") based on "value." I have the following conditional code to define df$valueBin:
df$valueBin[which(df$value<=250)] <- "<=250"
df$valueBin[which(df$value>250 & df$value<=500)] <- "250-500"
df$valueBin[which(df$value>500 & df$value<=1000)] <- "500-1,000"
df$valueBin[which(df$value>1000 & df$value<=2000)] <- "1,000 - 2,000"
df$valueBin[which(df$value>2000)] <- ">2,000"
I'm getting the following error:
"Error in $<-.data.frame(*tmp*, "valueBin", value = c(NA, NA, NA, :
replacement has 6530 rows, data has 6532"
Every element of df$value should fit into one of my which() statements. There are no missing values in df$value. Although even if I run just the first conditional statement (<=250), I get the exact same error, with "...replacement has 6530 rows..." although there are way fewer than 6530 records with value<=250, and value is never NA.
This SO link notes a similar error when using aggregate() was a bug, but it recommends installing the version of R I have. Plus the bug report says its fixed.
R aggregate error: "replacement has <foo> rows, data has <bar>"
This SO link seems more related to my issue, and the issue here was an issue with his/her conditional logic that caused fewer elements of the replacement array to be generated. I guess that must be my issue as well, and figured at first I must have a "<=" instead of an "<" or vice versa, but after checking I'm pretty sure they're all correct to cover every value of "value" without overlaps.
R error in '[<-.data.frame'... replacement has # items, need #
The answer by #akrun certainly does the trick. For future googlers who want to understand why, here is an explanation...
The new variable needs to be created first.
The variable "valueBin" needs to be already in the df in order for the conditional assignment to work. Essentially, the syntax of the code is correct. Just add one line in front of the code chuck to create this name --
df$newVariableName <- NA
Then you continue with whatever conditional assignment rules you have, like
df$newVariableName[which(df$oldVariableName<=250)] <- "<=250"
I blame whoever wrote that package's error message... The debugging was made especially confusing by that error message. It is irrelevant information that you have two arrays in the df with different lengths. No. Simply create the new column first. For more details, consult this post https://www.r-bloggers.com/translating-weird-r-errors/
You could use cut
df$valueBin <- cut(df$value, c(-Inf, 250, 500, 1000, 2000, Inf),
labels=c('<=250', '250-500', '500-1,000', '1,000-2,000', '>2,000'))
data
set.seed(24)
df <- data.frame(value= sample(0:2500, 100, replace=TRUE))
TL;DR ...and late to the party, but that short explanation might help future googlers..
In general that error message means that the replacement doesn't fit into the corresponding column of the dataframe.
A minimal example:
df <- data.frame(a = 1:2); df$a <- 1:3
throws the error
Error in $<-.data.frame(*tmp*, a, value = 1:3) : replacement
has 3 rows, data has 2
which is clear, because the vector a of df has 2 entries (rows) whilst the vector we try to replace has 3 entries (rows).

Most efficient way (fastest) to modify a data.frame using indexing

Little introduction to the question :
I am developing an ecophysiological model, and I use a reference class list called S that store every object the model need for input/output (e.g. meteo, physiological parameters etc...).
This list contains 5 objects (see example below):
- two dataframes, S$Table_Day (the outputs from the model) and S$Met_c(the meteo in input), which both have variables in columns, and observations (input or output) in row.
- a list of parameters S$Parameters.
- a matrix
- a vector
The model runs many functions with a daily time step. Each day is computed in a for loop that runs from the first day i=1 to the last day i=n. This list is passed to the functions that often take data from S$Met_c and/or S$Parameters in input and compute something that is stored in S$Table_Day, using indexes (the ith day). S is a Reference Class list because they avoid copy on modification, which is very important considering the number of computations.
The question itself :
As the model is very slow, I am trying to decrease computation time by micro-benchmarking different solutions.
Today I found something surprising when comparing two solutions to store my data. Storing data by indexing in one of the preallocated dataframes is longer than storing it into an undeclared vector. After reading this, I thought preallocating memory was always faster, but it seems that R performs more operations while modifying by index (probably comparing the length, type etc...).
My question is : is there a better way to perform such operations ? In other words, is there a way for me to use/store more efficiently the inputs/outputs (in a data.frame, a list of vector or else) to keep track of all computations of each day ? For example would it be better to use many vectors (one for each variable) and regroup them in more complex objects (e.g. list of dataframe) at then end ?
By the way, am I right to use Reference Classes to avoid copy of the big objects in S while passing it to functions and modify it from within them ?
Reproducible example for the comparison:
SimulationClass <- setRefClass("Simulation",
fields = list(Table_Day = "data.frame",
Met_c= "data.frame",
PerCohortFruitDemand_c="matrix",
Parameters= "list",
Zero_then_One="vector"))
S= SimulationClass$new()
# Initializing the table with dummy numbers :
S$Table_Day= data.frame(one= 1:10000, two= rnorm(n = 10000), three= runif(n = 10000),Bud_dd= rep(0,10000))
S$Met_c= data.frame(DegreeDays= rnorm(n=10000, mean = 10, sd = 1))
f1= function(i){
a= cumsum(S$Met_c$DegreeDays[i:(i-1000)])
}
f2= function(i){
S$Table_Day$Bud_dd[(i-1000):i]= cumsum(S$Met_c$DegreeDays[i:(i-1000)])
}
res= microbenchmark(f1(1000),f2(1000),times = 10000)
autoplot(res)
And the result :
Also if someone has any experience in programming such models, I am deeply interested in any advice for model development.
I read more about the question, and I'll just write here for prosperity some of the solutions that were proposed on other posts.
Apparently, reading and writing are both worth to consider when trying to reduce the computation time of assignation to a data.frame by index.
The sources are all found in other discussions:
How to optimize Read and Write to subsections of a matrix in R (possibly using data.table)
Faster i, j matrix cell fill
Time in getting single elements from data.table and data.frame objects
Several solutions appeared relevant :
Use a matrix instead of a data.frame if possible to leverage in place modification (Advanced R).
Use a list instead of a data.frame, because [<-.data.frame is not a primitive function (Advanced R).
Write functions in C++ and use Rcpp (from this source)
Use .subset2 to read instead of [ (third source)
Use data.table as recommanded by #JulienNavarre and #Emmanuel-Lin and the different sources, and use either set for data.frame or := if using a data.table is not a problem.
Use [[ instead of [ when possible (index by one value only). This one is not very effective, and very restrictive, so I removed it from the following comparison.
Here is the analysis of performance using the different solutions :
The code :
# Loading packages :
library(data.table)
library(microbenchmark)
library(ggplot2)
# Creating dummy data :
SimulationClass <- setRefClass("Simulation",
fields = list(Table_Day = "data.frame",
Met_c= "data.frame",
PerCohortFruitDemand_c="matrix",
Parameters= "list",
Zero_then_One="vector"))
S= SimulationClass$new()
S$Table_Day= data.frame(one= 1:10000, two= rnorm(n = 10000), three= runif(n = 10000),Bud_dd= rep(0,10000))
S$Met_c= data.frame(DegreeDays= rnorm(n=10000, mean = 10, sd = 1))
# Transforming data objects into simpler forms :
mat= as.matrix(S$Table_Day)
Slist= as.list(S$Table_Day)
Metlist= as.list(S$Met_c)
MetDT= as.data.table(S$Met_c)
SDT= as.data.table(S$Table_Day)
# Setting up the functions for the tests :
f1= function(i){
S$Table_Day$Bud_dd[i]= cumsum(S$Met_c$DegreeDays[i])
}
f2= function(i){
mat[i,4]= cumsum(S$Met_c$DegreeDays[i])
}
f3= function(i){
mat[i,4]= cumsum(.subset2(S$Met_c, "DegreeDays")[i])
}
f4= function(i){
Slist$Bud_dd[i]= cumsum(.subset2(S$Met_c, "DegreeDays")[i])
}
f5= function(i){
Slist$Bud_dd[i]= cumsum(Metlist$DegreeDays[i])
}
f6= function(i){
set(S$Table_Day, i=as.integer(i), j="Bud_dd", cumsum(S$Met_c$DegreeDays[i]))
}
f7= function(i){
set(S$Table_Day, i=as.integer(i), j="Bud_dd", MetDT[i,cumsum(DegreeDays)])
}
f8= function(i){
SDT[i,Bud_dd := MetDT[i,cumsum(DegreeDays)]]
}
i= 6000:6500
res= microbenchmark(f1(i),f3(i),f4(i),f5(i),f7(i),f8(i), times = 10000)
autoplot(res)
And the resulting autoplot :
With f1 the reference base assignment, f2 using a matrix instead of a data.frame, f3 using the combination of .subset2 and matrix, f4 using a list and .subset2, f5 using two lists (both reading and writing), f6 using data.table::set, f7 using data.table::set and data.table for cumulative sum, and f8using data.table :=.
As we can see the best solution is to use lists for reading and writing. This is pretty surprising to see that data.table is the worst solution. I believe I did something wrong with it, because it is supposed to be the best. If you can improve it, please tell me.

Use a string in R to refer to part of an object [duplicate]

I am at the point with R where I would like to start writing my own functions because I tend to need to do the same things over and over. However, I am struggling to see how I can generalize what I write. Looking at source code has not helped me learn very well because often it seems that .Internal or .Primitive functions (or other commands I do not know) are used extensively. I would like to simply start by turning my normal copy-pasted solutions into functions - fancier things can come later!
As an example: I do a lot of data formatting that requires doing some operation, and then filling in a data frame with zeros for all other combinations that did not have any data (e.g., years that did not have observations and were therefore not originally recorded, etc). I need to do this over and over for different data sets that have different sets of variables, but the idea and implementation is always the same.
My non-function way of solving this has been (for a specific implementation and minimal example):
df <- data.frame(County = c(1, 45, 57),
Year = c(2002, 2003, 2003),
Level = c("Mean", "Mean", "Mean"),
Obs = c(1.4, 1.9, 10.2))
#Create expanded version of data frame
Counties <- seq(from = 1, to = 77, by = 2)
Years <- seq(from = 1999, to = 2014, by = 1)
Levels <- c("Max", "Mean")
Expansion <- expand.grid(Counties, Years, Levels)
Expansion[4] <- 0
colnames(Expansion) <- colnames(df)
#Merge and order them so that the observed value is on top
df_full <- merge(Expansion, df, all = TRUE)
df_full$duplicate <- with(df_full,
paste(Year, County, Level))
df_full <- df_full[order(df_full$Year,
df_full$County,
df_full$Level,
-abs(df_full$Obs)), ]
#Deduplicate by taking the first that shows up (the observation)
df_full <- df_full[ !duplicated(df_full$duplicate), ]
df_full$duplicate <- NULL
I would like to generalize this so that I could somehow put in a data frame (and probably select the columns I need to order by since that sometimes changes) and then get the expanded version out. My first implementation consisted of a function with too many arguments (the data-frame and then all the column names I wanted to order/expand.grid by) and it also did not work:
gridExpand <- function(df, col1, col2=NULL, col3=NULL, measure){
#Started with "Expansion" being a global outside of the function
#It is identical the first part of the above code
ex <- merge(Expansion, df, all = TRUE)
ex$dupe <- with(ex,
paste(col1, col2, col3))
ex <- ex[order(with(ex,
col1, col2, col3, -abs(measure)))]
ex <- ex[ !duplicated(ex$dupe)]
ex <- subset(ex, select = -(dupe))
}
df_full <- gridExpand(df, Year, County, Level, Obs)
Error in paste(col1, col2, col3) : object 'Year' not found
I am assuming that this did not work because R has no way to know where 'Year' came from. I could potentially try paste(df, "$Year") but it would create "df$Year" which obviously will not work. And I do not ever see anyone else do this in their functions so clearly I am missing how it is that people reference things in data frame relevant functions.
I would ideally like to know of some resources that could help with thinking about generalization, or if someone can point me in the right direction to solving this particular problem I think it might help me see what I am doing wrong. I do not know of a better way to ask for help - I have been trying to read tutorials on writing functions for about 3 months and it is not clicking.
At a glance, the biggest thing that you can do is to not use non-standard-evaluation shortcuts inside your functions: things like $, subset() and with(). These are functions intended for convenient interactive use, not extensible programmatic use. (See, e.g., the Warning in ?subset which should probably be added to ?with, fortunes::fortune(312), fortunes::fortune(343).)
fortunes::fortune(312)
The problem here is that the $ notation is a magical shortcut and like
any other magic if used incorrectly is likely to do the programmatic
equivalent of turning yourself into a toad. -- Greg Snow (in
response to a user that wanted to access a column whose name is stored
in y via x$y rather than x[[y]])
R-help (February 2012)
fortunes::fortune(343)
Sooner or later most R beginners are bitten by this all too convenient shortcut. As an R
newbie, think of R as your bank account: overuse of $-extraction can lead to undesirable
consequences. It's best to acquire the [[ and [ habit early.
-- Peter Ehlers (about the use of $-extraction)
R-help (March 2013)
When you start writing functions that work on data frames, if you need to reference column names you should pass them in as strings, and then use [ or [[ to get the column based on the string stored in a variable name. This is the simplest way to make functions flexible with user-specified column names. For example, here's a simple stupid function that tests if a data frame has a column of the given name:
does_col_exist_1 = function(df, col) {
return(!is.null(df$col))
}
does_col_exist_2 = function(df, col) {
return(!is.null(df[[col]])
# equivalent to df[, col]
}
These yield:
does_col_exist_1(mtcars, col = "jhfa")
# [1] FALSE
does_col_exist_1(mtcars, col = "mpg")
# [1] FALSE
does_col_exist_2(mtcars, col = "jhfa")
# [1] FALSE
does_col_exist_2(mtcars, col = "mpg")
# [1] TRUE
The first function is wrong because $ doesn't evaluate what comes after it, no matter what value I set col to when I call the function, df$col will look for a column literally named "col". The brackets, however, will evaluate col and see "oh hey, col is set to "mpg", let's look for a column of that name."
If you want lots more understanding of this issue, I'd recommend the Non-Standard Evaluation Section of Hadley Wickham's Advanced R book.
I'm not going to re-write and debug your functions, but if I wanted to my first step would be to remove all $, with(), and subset(), replacing with [. There's a pretty good chance that's all you need to do.

Learning to write functions in R

I am at the point with R where I would like to start writing my own functions because I tend to need to do the same things over and over. However, I am struggling to see how I can generalize what I write. Looking at source code has not helped me learn very well because often it seems that .Internal or .Primitive functions (or other commands I do not know) are used extensively. I would like to simply start by turning my normal copy-pasted solutions into functions - fancier things can come later!
As an example: I do a lot of data formatting that requires doing some operation, and then filling in a data frame with zeros for all other combinations that did not have any data (e.g., years that did not have observations and were therefore not originally recorded, etc). I need to do this over and over for different data sets that have different sets of variables, but the idea and implementation is always the same.
My non-function way of solving this has been (for a specific implementation and minimal example):
df <- data.frame(County = c(1, 45, 57),
Year = c(2002, 2003, 2003),
Level = c("Mean", "Mean", "Mean"),
Obs = c(1.4, 1.9, 10.2))
#Create expanded version of data frame
Counties <- seq(from = 1, to = 77, by = 2)
Years <- seq(from = 1999, to = 2014, by = 1)
Levels <- c("Max", "Mean")
Expansion <- expand.grid(Counties, Years, Levels)
Expansion[4] <- 0
colnames(Expansion) <- colnames(df)
#Merge and order them so that the observed value is on top
df_full <- merge(Expansion, df, all = TRUE)
df_full$duplicate <- with(df_full,
paste(Year, County, Level))
df_full <- df_full[order(df_full$Year,
df_full$County,
df_full$Level,
-abs(df_full$Obs)), ]
#Deduplicate by taking the first that shows up (the observation)
df_full <- df_full[ !duplicated(df_full$duplicate), ]
df_full$duplicate <- NULL
I would like to generalize this so that I could somehow put in a data frame (and probably select the columns I need to order by since that sometimes changes) and then get the expanded version out. My first implementation consisted of a function with too many arguments (the data-frame and then all the column names I wanted to order/expand.grid by) and it also did not work:
gridExpand <- function(df, col1, col2=NULL, col3=NULL, measure){
#Started with "Expansion" being a global outside of the function
#It is identical the first part of the above code
ex <- merge(Expansion, df, all = TRUE)
ex$dupe <- with(ex,
paste(col1, col2, col3))
ex <- ex[order(with(ex,
col1, col2, col3, -abs(measure)))]
ex <- ex[ !duplicated(ex$dupe)]
ex <- subset(ex, select = -(dupe))
}
df_full <- gridExpand(df, Year, County, Level, Obs)
Error in paste(col1, col2, col3) : object 'Year' not found
I am assuming that this did not work because R has no way to know where 'Year' came from. I could potentially try paste(df, "$Year") but it would create "df$Year" which obviously will not work. And I do not ever see anyone else do this in their functions so clearly I am missing how it is that people reference things in data frame relevant functions.
I would ideally like to know of some resources that could help with thinking about generalization, or if someone can point me in the right direction to solving this particular problem I think it might help me see what I am doing wrong. I do not know of a better way to ask for help - I have been trying to read tutorials on writing functions for about 3 months and it is not clicking.
At a glance, the biggest thing that you can do is to not use non-standard-evaluation shortcuts inside your functions: things like $, subset() and with(). These are functions intended for convenient interactive use, not extensible programmatic use. (See, e.g., the Warning in ?subset which should probably be added to ?with, fortunes::fortune(312), fortunes::fortune(343).)
fortunes::fortune(312)
The problem here is that the $ notation is a magical shortcut and like
any other magic if used incorrectly is likely to do the programmatic
equivalent of turning yourself into a toad. -- Greg Snow (in
response to a user that wanted to access a column whose name is stored
in y via x$y rather than x[[y]])
R-help (February 2012)
fortunes::fortune(343)
Sooner or later most R beginners are bitten by this all too convenient shortcut. As an R
newbie, think of R as your bank account: overuse of $-extraction can lead to undesirable
consequences. It's best to acquire the [[ and [ habit early.
-- Peter Ehlers (about the use of $-extraction)
R-help (March 2013)
When you start writing functions that work on data frames, if you need to reference column names you should pass them in as strings, and then use [ or [[ to get the column based on the string stored in a variable name. This is the simplest way to make functions flexible with user-specified column names. For example, here's a simple stupid function that tests if a data frame has a column of the given name:
does_col_exist_1 = function(df, col) {
return(!is.null(df$col))
}
does_col_exist_2 = function(df, col) {
return(!is.null(df[[col]])
# equivalent to df[, col]
}
These yield:
does_col_exist_1(mtcars, col = "jhfa")
# [1] FALSE
does_col_exist_1(mtcars, col = "mpg")
# [1] FALSE
does_col_exist_2(mtcars, col = "jhfa")
# [1] FALSE
does_col_exist_2(mtcars, col = "mpg")
# [1] TRUE
The first function is wrong because $ doesn't evaluate what comes after it, no matter what value I set col to when I call the function, df$col will look for a column literally named "col". The brackets, however, will evaluate col and see "oh hey, col is set to "mpg", let's look for a column of that name."
If you want lots more understanding of this issue, I'd recommend the Non-Standard Evaluation Section of Hadley Wickham's Advanced R book.
I'm not going to re-write and debug your functions, but if I wanted to my first step would be to remove all $, with(), and subset(), replacing with [. There's a pretty good chance that's all you need to do.

Porting set operations from R's data frames to data tables: How to identify duplicated rows?

[Update 1: As Matthew Dowle noted, I'm using data.table version 1.6.7 on R-Forge, not CRAN. You won't see the same behavior with an earlier version of data.table.]
As background: I am porting some little utility functions to do set operations on rows of a data frame or pairs of data frames (i.e. each row is an element in a set), e.g. unique - to create a set from a list, union, intersection, set difference, etc. These mimic Matlab's intersect(...,'rows'), setdiff(...,'rows'), etc., which don't appear to have counterparts in R (R's set operations are limited to vectors and lists, but not rows of matrices or data frames). Examples of these little functions are below. If this functionality for data frames already exists in some package or base R, I'm open to suggestions.
I have been migrating these to data tables and one necessary step in the current approach is to find duplicated rows. When duplicated() is executed an error is returned stating that data tables must have keys. This is an unfortunate roadblock - other than setting keys, which isn't a universal solution and adds to computational costs, is there some other way to find duplicated objects?
Here is a reproducible example:
library(data.table)
set.seed(0)
x <- as.data.table(matrix(sample(2, 100, replace = TRUE), ncol = 4))
y <- as.data.table(matrix(sample(2, 100, replace = TRUE), ncol = 4))
res3 <- dt_intersect(x,y)
Yielding this error message:
Error in duplicated.data.table(z_rbind) : data table must have keys
The code works as-is for data frames, though I've named each function with the pattern dt_operation.
Is there some way to get around this issue? Setting keys only works for integers, which is a constraint I can't assume for the input data. So, perhaps I'm missing a clever way to use data tables?
Example set operation functions, where the elements of the sets are rows of data:
dt_unique <- function(x){
return(unique(x))
}
dt_union <- function(x,y){
z_rbind <- rbind(x,y)
z_unique <- dt_unique(z_rbind)
return(z_unique)
}
dt_intersect <- function(x,y){
zx <- dt_unique(x)
zy <- dt_unique(y)
z_rbind <- rbind(zy,zx)
ixDupe <- which(duplicated(z_rbind))
z <- z_rbind[ixDupe,]
return(z)
}
dt_setdiff <- function(x,y){
zx <- dt_unique(x)
zy <- dt_unique(y)
z_rbind <- rbind(zy,zx)
ixRangeX <- (nrow(zy) + 1):nrow(z_rbind)
ixNotDupe <- which(!duplicated(z_rbind))
ixDiff <- intersect(ixNotDupe, ixRangeX)
diffX <- z_rbind[ixDiff,]
return(diffX)
}
Note 1: One intended use for these helper functions is to find rows where key values in x are not among the key values in y. This way, I can find where NAs may appear when calculating x[y] or y[x]. Although this usage allows for setting of keys for the z_rbind object, I'd prefer not to constrain myself to just this use case.
Note 2: For related posts, here is a post on running unique on data frames, with excellent results for running it with the updated data.table package.
And this is an earlier post on running unique on data tables.
duplicated.data.table needs the same fix unique.data.table got [EDIT: Now done in v1.7.2]. Please raise another bug report: bug.report(package="data.table"). For the benefit of others watching, you're already using v1.6.7 from R-Forge, not 1.6.6 on CRAN.
But, on Note 1, there's a 'not join' idiom :
x[-x[y,which=TRUE]]
See also FR#1384 (New 'not' and 'whichna' arguments?) to make that easier for users, and that links to the keys that don't match thread which goes into more detail.
Update. Now in v1.8.3, not-join has been implemented.
DT[-DT["a",which=TRUE,nomatch=0],...] # old idiom
DT[!"a",...] # same result, now preferred.

Resources