I'm trying to understand the behaviour of eval in a data.table as a "frame".
With following data.table:
set.seed(1)
foo = data.table(var1=sample(1:3,1000,r=T), var2=rnorm(1000), var3=sample(letters[1:5],1000,replace = T))
I'm trying to replicate this instruction
foo[var1==1 , sum(var2) , by=var3]
using a function of eval:
eval1 = function(s) eval( parse(text=s) ,envir=sys.parent() )
As you can see, test 1 and 3 are working, but I don't understand which is the "correct" envir to set in eval for test 2:
var_i="var1"
var_j="var2"
var_by="var3"
# test 1 works
foo[eval1(var_i)==1 , sum(var2) , by=var3 ]
# test 2 doesn't work
foo[var1==1 , sum(eval1(var_j)) , by=var3]
# test 3 works
foo[var1==1 , sum(var2) , by=eval1(var_by)]
The j-exp, checks for it's variables in the environment of .SD, which stands for Subset of Data. .SD is itself a data.table that holds the columns for that group.
When you do:
foo[var1 == 1, sum(eval(parse(text=var_j))), by=var3]
directly, the j-exp gets internally optimised/replaced to sum(var2). But sum(eval1(var_j)) doesn't get optimised, and stays as it is.
Then when it gets evaluated for each group, it'll have to find var2, which doesn't exist in the parent.frame() from where the function is called, but in .SD. As an example, let's do this:
eval1 <- function(s) eval(parse(text=s), envir=parent.frame())
foo[var1 == 1, { var2 = 1L; eval1(var_j) }, by=var3]
# var3 V1
# 1: e 1
# 2: c 1
# 3: a 1
# 4: b 1
# 5: d 1
It find var2 from it's parent frame. That is, we have to point to the right environment to evaluate in, with an additional argument with value = .SD.
eval1 <- function(s, env) eval(parse(text=s), envir = env, enclos = parent.frame())
foo[var1 == 1, sum(eval1(var_j, .SD)), by=var3]
# var3 V1
# 1: e 11.178035
# 2: c -12.236446
# 3: a -8.984715
# 4: b -2.739386
# 5: d -1.159506
Related
I want to calculate a column called result using different formulas based on the other variable.
Here is an example.
library(data.table)
# my formulas
formulas <- c('(a+b)/2', 'a*b', 'a/b', 'b^2+a')
temp <- data.table(a = rnorm(n = 100, mean = 0, sd = 1),
b = rnorm(n = 100, mean = 0, sd = 1),
formula = sample(formulas, size = 100, replace = T))
temp[, 'result':=eval(parse(text = formula))]
The problem here is that no matter what the formula column is, all the values in results are calculated using the average, which is the first formula. I know I could always create one column per formula, but I still want to ask if there is a better way?
How can I fix it? Is it the most efficient method?
Here is a snippet of the table:
> temp[1:10]
a b formula result
1: -3.2133845 -0.78415565 b^2+a -1.998770087
2: 1.0723745 -0.31782577 a/b 0.377274341
3: 0.2269515 -0.15369020 a*b 0.036630652
4: 0.3339993 -0.86385430 a*b -0.264927499
5: 2.1118212 0.33736843 a/b 1.224594821
6: 0.9475773 -0.95697168 (a+b)/2 -0.004697187
7: 0.1912716 -1.71286598 a/b -0.760797195
8: 0.7773886 -0.01156844 (a+b)/2 0.382910072
9: -1.3132885 0.42693258 (a+b)/2 -0.443177939
10: 0.4569847 -0.70861707 b^2+a -0.125816205
eval(parse uses only the first expression. so we need to loop by either using a group by sequence of row
temp[, result := eval(parse(text = formula)), 1:nrow(temp)]
head(temp, 10)
# a b formula result
# 1: -0.8745498 1.59139467 b^2+a 1.6579872
# 2: 1.8701160 -0.04637923 a/b -40.3222721
# 3: 0.7160009 1.20070549 a*b 0.8597063
# 4: -0.3374944 1.18557927 a/b -0.2846662
# 5: 1.0452618 -0.60531471 (a+b)/2 0.2199735
# 6: 1.1730676 -0.29384938 a/b -3.9920710
# 7: 1.4913614 -0.44015278 a/b -3.3882811
# 8: 1.6538467 -0.18478224 a/b -8.9502472
# 9: -0.1331562 1.84615754 b^2+a 3.2751415
#10: 1.3096422 -0.39569134 a/b -3.3097570
-checking
1.8701160/-0.04637923
#[1] -40.32227
0.7160009 * 1.20070549
#[1] 0.8597062
Or with Map
temp[, result := unlist(Map(function(x, a, b)
eval(parse(text = x))), formula, a, b)]
I have the following function:
DT <- data.table(col1 = 1:4, col2 = c(2:5))
fun <- function(DT, fct){
DT_out <- DT[,new_col := fct]
return(DT_out)
}
fun(input, fct = function(x = col1, y = col2){y - x})
In reality I have some processing before and after this code snippet, thus I do not wish to use directly the statement DT[,new_col := fct] with a fixed fct (because the fct should be flexible). I know this question is very similar to this one, but I cannot figure out how to reformulate the code such that two columns as arguments for the function are allowed. The code above gives the error:
Error in `[.data.table`(DT, , `:=`(new_col, fct)) :
RHS of assignment is not NULL, not an an atomic vector (see ?is.atomic) and not a list column.
One option if you don't mind adding quotes around the variable names
fun <- function(DT, fun, ...){
fun_args <- c(...)
DT[,new_col := do.call(fun, setNames(mget(fun_args), names(fun_args)))]
}
fun(DT, fun = function(x, y){y - x}, x = 'col1', y = 'col2')
DT
# col1 col2 new_col
# 1: 1 2 1
# 2: 2 3 1
# 3: 3 4 1
# 4: 4 5 1
Or use .SDcols (same result as above)
fun <- function(DT, fun, ...){
fun_args <- c(...)
DT[, new_col := do.call(fun, setNames(.SD, names(fun_args))),
.SDcols = fun_args]
}
I am bit surprised by the behaviour of data.table. I want to select from one row in the data.table all non-NA values.
With NA values it's working:
t = data.table(a=1,b=NA)
t[, !is.na(t), with=F]
Without NA values it doesn't working:
t = data.table(a=1, b=2)
t[, !is.na(t), with=F]
The basic difference is that t[, !c(F, F), with=F] doesn't work. Interestingly t[, c(T, T), with=F] is doing fine.
I know there are many ways to achieve the desired output, but I am only interested in this - for me strange - behaviour of data.table.
I've investigated the data.table:::`[.data.table` source code
And it indeed looks like a bug to me. What basically happens, is that the !is.na() call is divided into ! and is.na() calls. Then, it sums this vector up, and if the length is zero it returns null.data.table(). The issue is, that for dt <- data.table(a = 1, b = 2), sum(is.na(dt)) will always be zero.
Below is a shortened code to illustrate what goes under the hood
sim_dt <- function(...) {
## data.table catches the call
jsub <- substitute(...)
cat("This is your call:", paste0(jsub, collapse = ""))
## data.table separates the `!` from the call and sets notj = TRUE instead
## and saves `is.na(t)` into `jsub`
if (is.call(jsub) && deparse(jsub[[1L]], 500L, backtick=FALSE) %in% c("!", "-")) { # TODO is deparse avoidable here?
notj = TRUE
jsub = jsub[[2L]]
} else notj = FALSE
cat("\nnotj:", notj)
cat("\nThis is the new jsub: ", paste0(jsub, collapse = "("), ")", sep = "")
## data.table evaluates just the `jsub` part which obviously return a vector of `FALSE`s (because `!` was removed)
cat("\nevaluted j:", j <- eval(jsub, setattr(as.list(seq_along(dt)), 'names', names(dt)), parent.frame()))# else j will be evaluated for the first time on next line
## data.table checks if `j` is a logical vector and looks if there are any TRUEs and gets an empty vector
if (is.logical(j)) cat("\nj after `which`:", j <- which(j))
cat("\njs length:", length(j), "\n\n")
## data.table checks if `j` is empty (and it's obviously is) and returns a null.data.table
if (!length(j)) return(data.table:::null.data.table()) else return(dt[, j, with = FALSE])
}
## Your data.table
dt <- data.table(a = 1, b = 2)
sim_dt(!is.na(dt))
# This is your call: !is.na(dt)
# notj: TRUE
# This is the new jsub: is.na(dt)
# evaluted j: FALSE FALSE
# j after `which`:
# js length: 0
#
# Null data.table (0 rows and 0 cols)
dt <- data.table(a = 1, b = NA)
sim_dt(!is.na(dt))
# This is your call: !is.na(dt)
# notj: TRUE
# This is the new jsub: is.na(dt)
# evaluted j: FALSE TRUE
# j after `which`: 2
# js length: 1
#
# b
# 1: NA
As #Roland has already mentioned is.na(t) output is a matrix where you need a vector to select column.
But column selection should work in example given by OP as it got only single row in data.table. All we need to do is to wrap it in () to get that evaluated. e.g. :
library(data.table)
t = data.table(a=1, b=2)
t[,(!c(FALSE,FALSE)),with=FALSE]
# a b
# 1: 1 2
t[,(!is.na(t)),with=FALSE]
# a b
# 1: 1 2
I would like to build a function to use within a loop function, however the child cannot find the i from the parent function's for (i in 1:3). How would I avoid this problem? I was thinking it could be be solved with assign(X, envir=sys.frame(which=-1)) , although I'm not sure where to place it or the frame to select in this case...
#simplified issue
parent.fxn = function(X) {
for (i in 1:3)
child.fxn(X)
}
child.fxn = function(X) {
return(X[i])
}
#ex. run_script
#parent.fxn(1:10)
#Error in child.fxn(X) : object 'i' not found
For reference, here is a version of the actual script I tried to write (below)... the error from my script (below) was very slightly different from my simplified example (above), although I'm not sure why.
library(cluster)
library(data.table)
library(magrittr)
data = data.table(x=runif(300), y=runif(300), z=runif(300))
master.fxn = function(DT, nlvls=3, data.cols=c("x", "y", "z")) {
Lvls = paste0("Level_", 1:nlvls);
for (i in 1:nlvls) {
core.fxn(DT, Lvls, return.k=TRUE)
}
}
core.fxn = function(X, mkey, ...) {
X[, c(mkey[i]):=cluster.fxn(.SD, ...), .SDcols=data.cols, by=c(mkey[0:(i-1)])] %>%
setkeyv(c(mkey[1:i]))
}
cluster.fxn = function(X, return.ac=FALSE, return.k=FALSE) {
a = agnes(X, metric="euclidean", method="ward", stand=TRUE)
if (return.ac) return(a$ac)
if (return.k) return(cutree(a, k=2))
}
#ex. run_script
#master.fxn(data)
#Error in eval(bysub, parent.frame(), parent.frame()) : object 'i' not found
And finally, here is a working version of the actual script, which does not use a function that utilizes [i] within the loop, but achieves the desired outcome. These scripts are condensed to exemplify the error. It would really simplify things in the final product if I could get the core.fxn to work within the master.fxn loop.
library(cluster)
library(data.table)
library(magrittr)
data = data.table(x=runif(300), y=runif(300), z=runif(300))
master.fxn = function(DT, nlvls=3, data.cols=c("x", "y", "z")) {
Lvls = paste0("Level_", 1:nlvls);
for (i in 1:nlvls) {
DT[, c(Lvls[i]):=cluster.fxn(.SD, return.k=TRUE), .SDcols=data.cols, by=c(Lvls[0:(i-1)])] %>%
setkeyv(c(Lvls[1:i]))
}
}
cluster.fxn = function(X, return.ac=FALSE, return.k=FALSE) {
a = agnes(X, metric="euclidean", method="ward", stand=TRUE)
if (return.ac) return(a$ac)
if (return.k) return(cutree(a, k=2))
}
#ex. run_script
#> data
# x y z
# 1: 0.1934689 0.67631296 0.3083592
# 2: 0.5267910 0.93186454 0.9583132
# 3: 0.5533244 0.37712457 0.4022132
# 4: 0.1886627 0.07535931 0.1171205
# 5: 0.7499003 0.90682684 0.6104284
# ---
#296: 0.7196245 0.80206991 0.6657839
#297: 0.2453930 0.06807955 0.8126690
#298: 0.3481978 0.23024162 0.4734052
#299: 0.2123976 0.27191432 0.1753336
#300: 0.7312911 0.89491793 0.5417281
#
#>master.fxn(data)
#data[, .SD[1], by=.(Level_1, Level_2, Level_3)]
#Level_1 Level_2 Level_3 x y z
#1: 1 1 1 0.0584953 0.77933040 0.76432541
#2: 1 1 2 0.1814877 0.65263178 0.41425295
#3: 1 2 1 0.9932725 0.99409350 0.96849477
#4: 1 2 2 0.9102010 0.76071068 0.69283525
#5: 2 1 1 0.9040033 0.85361443 0.30636660
#6: 2 1 2 0.8026868 0.08595128 0.43176372
#7: 2 2 1 0.2167962 0.73551203 0.01174373
#8: 2 2 2 0.5592571 0.84508641 0.37382253
#>
R Code
library(data.table)
x <- 4
f1 <- function(){
x <- 1
dt <- data.table(x=1:4,y=1:12)
dt[x==get("x", envir=parent.env(environment()))]
}
f1()
I got this:
x y
1: 1 1
2: 1 5
3: 1 9
f2 is a new function that remove x <- 1 in the function.
f2 <- function(){
#x <- 1
dt <- data.table(x=1:4,y=1:12)
dt[x==get("x", envir=parent.env(environment()))]
}
f2()
I got this:
x y
1: 4 4
2: 4 8
3: 4 12
that's right, my question is how to write a function to replace the get("x", envir=parent.env(environment()))?
Thanks!
I only just realized that the OP is grabbing x from the function's enclosing environment instead of passing it as an argument. I consider this bad practice and don't really have a recommendation for that case. I might delete this answer (which only covers passing x to the function) if it's too much of a distraction.
library(data.table)
dt <- data.table(x=1:4,y=1:12)
ff = function(x, ...){
mDT = data.table(x)
dt[mDT, on=.(x), ...]
}
ff(4L, verbose = TRUE)
# Calculated ad hoc index in 0 secs
# Starting bmerge ...done in 0 secs
# x y
# 1: 4 4
# 2: 4 8
# 3: 4 12
This only addresses the OP's specific example, of DT[x == get("x", ...)], and not broader expressions. For those, constructing and evaluating an expression should work:
fs = function(x, ...){
e = substitute(x == ..x, list(..x = x))
dt[eval(e), ...]
}
fs(4L, verbose = TRUE)
# Creating new index 'x'
# Starting bmerge ...done in 0 secs
# x y
# 1: 4 4
# 2: 4 8
# 3: 4 12
fs(3L, verbose = TRUE)
# Using existing index 'x'
# Starting bmerge ...done in 0 secs
# x y
# 1: 3 3
# 2: 3 7
# 3: 3 11
The verbose output indicates that fs creates indices, which can be helpful for speed. See vignette("datatable-secondary-indices-and-auto-indexing").
Eventually, there might be syntax so we can simply write ...
dt[..x == x]
perhaps using the proposed inherits = TRUE argument from the link for safety (so that x must be a column and either (i) x must exist in the parent environment or ..x must be a column name).
#Frank, Thanks! Based on this post variable usage in data.table, I wrote a function:
`..` <- function(x){
stopifnot(inherits(x, "character"))
stopifnot(length(x)==1)
get(x, parent.frame(4))
}
x <- 4
f1 <- function(){
x <- 1
dt <- data.table(x=1:4,y=1:12)
dt[x==..("x")]
}
f1()
f2 <- function(){
#x <- 1
dt <- data.table(x=1:4,y=1:12)
dt[x==..("x")]
}
f2()
Both f1 and f2 got the correct results!
Why parent.frame(4)?
We see the code first:
current_frame <- sys.nframe()
dt <- data.table()
dt[, sys.nframe() - current_frame]
We got 4, this should be the reason.
I found the old solution does not work for data.table 1.11.4, and I wrote a new one:
.. <- function (x, env = parent.frame())
{
stopifnot(inherits(x, "character"))
stopifnot(length(x) == 1)
get(x, envir = parent.env(env))
}
x <- 4
f1 <- function(){
x <- 1
dt <- data.table(x=1:4,y=1:12)
dt[x==..("x")]
}
f1j <- function(){
x <- 1
dt <- data.table(x=1:4,y=1:12)
dt[, ..("x")]
}
f2 <- function(){
#x <- 1
dt <- data.table(x=1:4,y=1:12)
dt[x==..("x")]
}
f2j <- function(){
#x <- 1
dt <- data.table(x=1:4,y=1:12)
dt[,..("x")]
}
stopifnot(all(f1()$y==c(1,5,9)))
stopifnot(all(f1j()==c(1)))
stopifnot(all(f2()$y==c(4,8,12)))
stopifnot(all(f2j()==c(4)))
I have tested that this worked for data.table_1.10.4-3 and data.table_1.11.4.
Actually, I am confused about R's parent.frame and how to find the right variables in R, I just tested some possible ways until I got the expected results.