Subset data.table by logical column - r

I have a data.table with a logical column. Why the name of the logical column can not be used directly for the i argument? See the example.
dt <- data.table(x = c(T, T, F, T), y = 1:4)
# Works
dt[dt$x]
dt[!dt$x]
# Works
dt[x == T]
dt[x == F]
# Does not work
dt[x]
dt[!x]

From ?data.table
Advanced: When i is a single variable name, it is not considered an
expression of column names and is instead evaluated in calling scope.
So dt[x] will try to evaluate x in the calling scope (in this case the global environment)
You can get around this by using ( or { or force
dt[(x)]
dt[{x}]
dt[force(x)]

x is not defined in the global environment. If you try this,
> with(dt, dt[x])
x y
1: TRUE 1
2: TRUE 2
3: TRUE 4
It would work. Or this:
> attach(dt)
> dt[!x]
x y
1: FALSE 3
EDIT:
according to the documentation the j parameter takes column name, in fact:
> dt[x]
Error in eval(expr, envir, enclos) : object 'x' not found
> dt[j = x]
[1] TRUE TRUE FALSE TRUE
then, the i parameter takes either numerical or logical expression (like x itself should be), however it seems it (data.table) can't see x as logical without this:
> dt[i = x]
Error in eval(expr, envir, enclos) : object 'x' not found
> dt[i = as.logical(x)]
x y
1: TRUE 1
2: TRUE 2
3: TRUE 4

This should also work and is arguably more natural:
setkey(dt, x)
dt[J(TRUE)]
dt[J(FALSE)]

Related

Why does sample function gives me an error

I am doing a simple sampling using the following code in R.
x=1:1000
sample1=sample(x, size=30,replace = F)
It used to work before but it is not working. I get this error message:
Error in sample.int(length(x), size, replace, prob) : invalid 'replace' argument
Do not use F and T as boolean values in R! Always use full names FALSE and TRUE. The variables F and T can be re-assigned any value.
Restart R (if it displays [Previously saved workspace restored], also run rm(list=ls())) and try the following:
1 == 1
is.logical(c(TRUE, FALSE))
is.logical(c(T, F))
v <- 1:5
v <= 3
(v <= 3) == T
# [1] TRUE TRUE TRUE FALSE FALSE
F <- 'a'
T <- FALSE
v <- 1:5
v <= 3
(v <= 3) == T
# [1] FALSE FALSE FALSE TRUE TRUE
I've included the important output.

r data.table behaviour with booleans as column selector

I am bit surprised by the behaviour of data.table. I want to select from one row in the data.table all non-NA values.
With NA values it's working:
t = data.table(a=1,b=NA)
t[, !is.na(t), with=F]
Without NA values it doesn't working:
t = data.table(a=1, b=2)
t[, !is.na(t), with=F]
The basic difference is that t[, !c(F, F), with=F] doesn't work. Interestingly t[, c(T, T), with=F] is doing fine.
I know there are many ways to achieve the desired output, but I am only interested in this - for me strange - behaviour of data.table.
I've investigated the data.table:::`[.data.table` source code
And it indeed looks like a bug to me. What basically happens, is that the !is.na() call is divided into ! and is.na() calls. Then, it sums this vector up, and if the length is zero it returns null.data.table(). The issue is, that for dt <- data.table(a = 1, b = 2), sum(is.na(dt)) will always be zero.
Below is a shortened code to illustrate what goes under the hood
sim_dt <- function(...) {
## data.table catches the call
jsub <- substitute(...)
cat("This is your call:", paste0(jsub, collapse = ""))
## data.table separates the `!` from the call and sets notj = TRUE instead
## and saves `is.na(t)` into `jsub`
if (is.call(jsub) && deparse(jsub[[1L]], 500L, backtick=FALSE) %in% c("!", "-")) { # TODO is deparse avoidable here?
notj = TRUE
jsub = jsub[[2L]]
} else notj = FALSE
cat("\nnotj:", notj)
cat("\nThis is the new jsub: ", paste0(jsub, collapse = "("), ")", sep = "")
## data.table evaluates just the `jsub` part which obviously return a vector of `FALSE`s (because `!` was removed)
cat("\nevaluted j:", j <- eval(jsub, setattr(as.list(seq_along(dt)), 'names', names(dt)), parent.frame()))# else j will be evaluated for the first time on next line
## data.table checks if `j` is a logical vector and looks if there are any TRUEs and gets an empty vector
if (is.logical(j)) cat("\nj after `which`:", j <- which(j))
cat("\njs length:", length(j), "\n\n")
## data.table checks if `j` is empty (and it's obviously is) and returns a null.data.table
if (!length(j)) return(data.table:::null.data.table()) else return(dt[, j, with = FALSE])
}
## Your data.table
dt <- data.table(a = 1, b = 2)
sim_dt(!is.na(dt))
# This is your call: !is.na(dt)
# notj: TRUE
# This is the new jsub: is.na(dt)
# evaluted j: FALSE FALSE
# j after `which`:
# js length: 0
#
# Null data.table (0 rows and 0 cols)
dt <- data.table(a = 1, b = NA)
sim_dt(!is.na(dt))
# This is your call: !is.na(dt)
# notj: TRUE
# This is the new jsub: is.na(dt)
# evaluted j: FALSE TRUE
# j after `which`: 2
# js length: 1
#
# b
# 1: NA
As #Roland has already mentioned is.na(t) output is a matrix where you need a vector to select column.
But column selection should work in example given by OP as it got only single row in data.table. All we need to do is to wrap it in () to get that evaluated. e.g. :
library(data.table)
t = data.table(a=1, b=2)
t[,(!c(FALSE,FALSE)),with=FALSE]
# a b
# 1: 1 2
t[,(!is.na(t)),with=FALSE]
# a b
# 1: 1 2

using eval in data.table

I'm trying to understand the behaviour of eval in a data.table as a "frame".
With following data.table:
set.seed(1)
foo = data.table(var1=sample(1:3,1000,r=T), var2=rnorm(1000), var3=sample(letters[1:5],1000,replace = T))
I'm trying to replicate this instruction
foo[var1==1 , sum(var2) , by=var3]
using a function of eval:
eval1 = function(s) eval( parse(text=s) ,envir=sys.parent() )
As you can see, test 1 and 3 are working, but I don't understand which is the "correct" envir to set in eval for test 2:
var_i="var1"
var_j="var2"
var_by="var3"
# test 1 works
foo[eval1(var_i)==1 , sum(var2) , by=var3 ]
# test 2 doesn't work
foo[var1==1 , sum(eval1(var_j)) , by=var3]
# test 3 works
foo[var1==1 , sum(var2) , by=eval1(var_by)]
The j-exp, checks for it's variables in the environment of .SD, which stands for Subset of Data. .SD is itself a data.table that holds the columns for that group.
When you do:
foo[var1 == 1, sum(eval(parse(text=var_j))), by=var3]
directly, the j-exp gets internally optimised/replaced to sum(var2). But sum(eval1(var_j)) doesn't get optimised, and stays as it is.
Then when it gets evaluated for each group, it'll have to find var2, which doesn't exist in the parent.frame() from where the function is called, but in .SD. As an example, let's do this:
eval1 <- function(s) eval(parse(text=s), envir=parent.frame())
foo[var1 == 1, { var2 = 1L; eval1(var_j) }, by=var3]
# var3 V1
# 1: e 1
# 2: c 1
# 3: a 1
# 4: b 1
# 5: d 1
It find var2 from it's parent frame. That is, we have to point to the right environment to evaluate in, with an additional argument with value = .SD.
eval1 <- function(s, env) eval(parse(text=s), envir = env, enclos = parent.frame())
foo[var1 == 1, sum(eval1(var_j, .SD)), by=var3]
# var3 V1
# 1: e 11.178035
# 2: c -12.236446
# 3: a -8.984715
# 4: b -2.739386
# 5: d -1.159506

How to use data.table inside a function?

As a minimal working example, for instance, I want to be able to dynamically pass expressions to a data.table object to create new columns or modify existing ones:
dt <- data.table(x = 1, y = 2)
dynamicDT <- function(...) {
dt[, list(...)]
}
dynamicDT(z = x + y)
I was expecting:
z
1: 3
but instead, I get the error:
Error in eval(expr, envir, enclos) : object 'x' not found
So how can I fix this?
Attempts:
I've seen this post, which suggests using quote or substitute, but
> dynamicDT(z = quote(x + y))
Error in `rownames<-`(`*tmp*`, value = paste(format(rn, right = TRUE), :
length of 'dimnames' [1] not equal to array extent
or
> dynamicDT <- function(...) {
+ dt[, list(substitute(...))]
+ }
> dynamicDT(z = x + y)
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, :
first argument must be atomic
haven't worked for me.
This should be a better alternative to David's answer:
dynamicDT <- function(...) {
dt[, eval(substitute(...))]
}
dynamicDT(z := x + y)
# x y z
#1: 1 2 3
You will need to use eval(parse(text = )) combination. parse will transform the string into an expression, while eval will evaluate it.
library(data.table)
dt <- data.table(x = 1, y = 2)
dynamicDT <- function(temp = "") {
dt[, eval(parse(text = temp))]
}
In order to get your previous desired output
dynamicDT("z := x + y")
## x y z
## 1: 1 2 3
In order to get your current desired output
dynamicDT("z = x + y")
## [1] 3
In order to parse multiple arguments you can do
dynamicDT('c("a","b") := list(x + y, x - y)')
## x y a b
##1: 1 2 3 -1

R ff ffbase ffwhich error in a function call?

Here is My code call ffwhich in a function:
library(ffbase)
rm(a,b)
test <- function(x) {
a <- 1
b <- 3
ffwhich(x, x > a & x < b)
}
x <- ff(1:10)
test(x)
Error in eval(expr, envir, enclos) (from <text>#1) : object 'a' not found
traceback()
6: eval(expr, envir, enclos)
5: eval(e)
4: which(eval(e))
3: ffwhich.ff_vector(x, x > a & x < b)
2: ffwhich(x, x > a & x < b) at #4
1: test(x)
It may caused by lazy evaluation? The eval() can not find the a and b which is bounded in function test. How can I use ffwhich in a function?
R 2.15.2
ffbase 0.6-3
ff 2.2-10
OS opensuse 12.2 64 bit
Yes, it looks like an eval issue like Arun is indicating. I normally use the following when using ffwhich which is like an eval.
library(ffbase)
rm(a,b)
test <- function(x) {
a <- 1
b <- 3
idx <- x > a & x < b
idx <- ffwhich(idx, idx == TRUE)
idx
}
x <- ff(1:10)
test(x)
I was having the same problem, and the answer given was not solving it, because we can not pass the argument "condition" to the function.
I just got a way to do that.
Here it is ::
require(ffdf)
# the data ::
x <- as.ffdf( data.frame(a = c(1:4,1),b=5:1))
x[,]
# Now the function below is working ::
idx_ffdf <- function(data, condition){
exp <-substitute( (condition) %in% TRUE)
# substitute will take the value of condition (non-evaluated).
# %in% TRUE makes the condition be false when there is NAs...
idx <- do.call(ffwhich, list(data, exp) ) # here is the trick: do.call !!!
return(idx)
}
# testing :
idx <- idx_ffdf(x,a==1)
idx[] # gives the correct 1,5 ...
idx <- idx_ffdf(x,b>3)
idx[] # gives the correct 1,2 ...
Hope this helps somebody !

Resources