correct braces placement in := within data.table - r

Here is an example of a problem I am having. Am I misusing or is this a bug?
require(data.table)
x <- data.table(a = 1:4)
# this does not work
x[ , {b = a + 3; `:=`(c = b)}]
# Error in `:=`(c = b) : unused argument(s) (c = b)
# this works fine
x[ ,`:=`(c = a + 3)]

not a bug,
it's just that the ordering of the braces should be different:
That is, use the braces to wrap only the RHS argument in `:=`(LHS, RHS)
Example:
# sample data
x <- data.table(a = 1:4)
# instead of:
x[ , {b = a + 3; `:=`(c, b)}] # <~~ Notice braces are wrapping LHS AND RHS
# use this:
x[ , `:=`(c, {b = a + 3; b})] # <~~ Braces wrapping only RHS
x
# a c
# 1: 1 4
# 2: 2 5
# 3: 3 6
# 4: 4 7
However, more succinctly and naturally:
you are probably looking for this:
x[ , c := {b = a + 3; b}]
Update from Matthew
Exactly. Using := in other incorrect ways gives this (long) error :
x := 1
# Error: := is defined for use in j only, and (currently) only once; i.e.,
# DT[i,col:=1L] and DT[,newcol:=sum(colB),by=colA] are ok, but not
# DT[i,col]:=1L, not DT[i]$col:=1L and not DT[,{newcol1:=1L;newcol2:=2L}].
# Please see help(":="). Check is.data.table(DT) is TRUE.
but not in the case that the question showed, giving just :
x[ , {b = a + 3; `:=`(c = b)}]
# Error in `:=`(c = b) : unused argument(s) (c = b)
I've just changed this in v1.8.9. Both these incorrect ways of using := now give a more succinct error :
x[ , {b = a + 3; `:=`(c = b)}]
# Error in `:=`(c = b) :
# := and `:=`(...) are defined for use in j only, in particular ways. See
# help(":="). Check is.data.table(DT) is TRUE.
and we'll embellish ?":=". Thanks #Alex for highlighting!

Related

`:=` used for multiple simultaneous assign in data table does not respect updated values

:= used for multiple simultaneous assign in data table does not respect updated values. The column x is incremented, and then I intend to assign updated value of x to y. Why is the value not equal to intended ?
> z = data.table(x = 1:5, y= 1:5)
> z[, `:=` (x = x + 1, y = x)]
> # Actual
> z
x y
1: 2 1
2: 3 2
3: 4 3
4: 5 4
5: 6 5
> # Expected
> z
x y
1: 2 2
2: 3 3
3: 4 4
4: 5 5
5: 6 6
Here are two more alternatives for you to consider. As noted, data.table doesn't do the dynamic scoping in the way that dplyr::mutate does, so y = x still refers to z$x in the second part of your statement. You can consider Filing an issue if you strongly prefer this way.
explicitly assign the new x inline:
z[, `:=` (x = (x <- x + 1), y = x)]
In the environment where j is evaluated, now an object x is created to overwrite z$x temporarily. This should be very similar to what dplyr is doing internally -- evaluating the arguments of mutate sequentially and updating the column values iteratively.
Switch to LHS := RHS form (see ?set):
z[ , c('x', 'y') := {
x = x + 1
.(x, x)
}]
. is shorthand in data.table for list. In LHS := RHS form, RHS must evaluate to a list; each element of that list will be one column in the assignment.
More compactly:
z[ , c('x', 'y') := {x = x + 1; .(x, x)}]
; allows you to write multiple statements on the same line (e.g. 3+4; 4+5 will run 3+4 then 4+5). { creates a way to wrap multiple statements and return the final value, see ?"{". Implicitly you're using this whenever you write if (x) { do_true } else { do_false } or function(x) { function_body }.
The value of x is not updated while doing the calculation for y. You might use the same assignment as x for y
library(data.table)
z[, `:=` (x = x + 1, y = x + 1)]
Or update it separately.
z[, x := x + 1][, y:= x]
This behavior is different as compared to mutate from dplyr where the following works.
library(dplyr)
z %>% mutate(x = x + 1, y = x)

r data.table behaviour with booleans as column selector

I am bit surprised by the behaviour of data.table. I want to select from one row in the data.table all non-NA values.
With NA values it's working:
t = data.table(a=1,b=NA)
t[, !is.na(t), with=F]
Without NA values it doesn't working:
t = data.table(a=1, b=2)
t[, !is.na(t), with=F]
The basic difference is that t[, !c(F, F), with=F] doesn't work. Interestingly t[, c(T, T), with=F] is doing fine.
I know there are many ways to achieve the desired output, but I am only interested in this - for me strange - behaviour of data.table.
I've investigated the data.table:::`[.data.table` source code
And it indeed looks like a bug to me. What basically happens, is that the !is.na() call is divided into ! and is.na() calls. Then, it sums this vector up, and if the length is zero it returns null.data.table(). The issue is, that for dt <- data.table(a = 1, b = 2), sum(is.na(dt)) will always be zero.
Below is a shortened code to illustrate what goes under the hood
sim_dt <- function(...) {
## data.table catches the call
jsub <- substitute(...)
cat("This is your call:", paste0(jsub, collapse = ""))
## data.table separates the `!` from the call and sets notj = TRUE instead
## and saves `is.na(t)` into `jsub`
if (is.call(jsub) && deparse(jsub[[1L]], 500L, backtick=FALSE) %in% c("!", "-")) { # TODO is deparse avoidable here?
notj = TRUE
jsub = jsub[[2L]]
} else notj = FALSE
cat("\nnotj:", notj)
cat("\nThis is the new jsub: ", paste0(jsub, collapse = "("), ")", sep = "")
## data.table evaluates just the `jsub` part which obviously return a vector of `FALSE`s (because `!` was removed)
cat("\nevaluted j:", j <- eval(jsub, setattr(as.list(seq_along(dt)), 'names', names(dt)), parent.frame()))# else j will be evaluated for the first time on next line
## data.table checks if `j` is a logical vector and looks if there are any TRUEs and gets an empty vector
if (is.logical(j)) cat("\nj after `which`:", j <- which(j))
cat("\njs length:", length(j), "\n\n")
## data.table checks if `j` is empty (and it's obviously is) and returns a null.data.table
if (!length(j)) return(data.table:::null.data.table()) else return(dt[, j, with = FALSE])
}
## Your data.table
dt <- data.table(a = 1, b = 2)
sim_dt(!is.na(dt))
# This is your call: !is.na(dt)
# notj: TRUE
# This is the new jsub: is.na(dt)
# evaluted j: FALSE FALSE
# j after `which`:
# js length: 0
#
# Null data.table (0 rows and 0 cols)
dt <- data.table(a = 1, b = NA)
sim_dt(!is.na(dt))
# This is your call: !is.na(dt)
# notj: TRUE
# This is the new jsub: is.na(dt)
# evaluted j: FALSE TRUE
# j after `which`: 2
# js length: 1
#
# b
# 1: NA
As #Roland has already mentioned is.na(t) output is a matrix where you need a vector to select column.
But column selection should work in example given by OP as it got only single row in data.table. All we need to do is to wrap it in () to get that evaluated. e.g. :
library(data.table)
t = data.table(a=1, b=2)
t[,(!c(FALSE,FALSE)),with=FALSE]
# a b
# 1: 1 2
t[,(!is.na(t)),with=FALSE]
# a b
# 1: 1 2

How to use data.table inside a function?

As a minimal working example, for instance, I want to be able to dynamically pass expressions to a data.table object to create new columns or modify existing ones:
dt <- data.table(x = 1, y = 2)
dynamicDT <- function(...) {
dt[, list(...)]
}
dynamicDT(z = x + y)
I was expecting:
z
1: 3
but instead, I get the error:
Error in eval(expr, envir, enclos) : object 'x' not found
So how can I fix this?
Attempts:
I've seen this post, which suggests using quote or substitute, but
> dynamicDT(z = quote(x + y))
Error in `rownames<-`(`*tmp*`, value = paste(format(rn, right = TRUE), :
length of 'dimnames' [1] not equal to array extent
or
> dynamicDT <- function(...) {
+ dt[, list(substitute(...))]
+ }
> dynamicDT(z = x + y)
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, :
first argument must be atomic
haven't worked for me.
This should be a better alternative to David's answer:
dynamicDT <- function(...) {
dt[, eval(substitute(...))]
}
dynamicDT(z := x + y)
# x y z
#1: 1 2 3
You will need to use eval(parse(text = )) combination. parse will transform the string into an expression, while eval will evaluate it.
library(data.table)
dt <- data.table(x = 1, y = 2)
dynamicDT <- function(temp = "") {
dt[, eval(parse(text = temp))]
}
In order to get your previous desired output
dynamicDT("z := x + y")
## x y z
## 1: 1 2 3
In order to get your current desired output
dynamicDT("z = x + y")
## [1] 3
In order to parse multiple arguments you can do
dynamicDT('c("a","b") := list(x + y, x - y)')
## x y a b
##1: 1 2 3 -1

Can I use variables newly created in `j` in the same `j` argument?

In the j argument in data.table, is there syntax allowing me to reference previously created variables while in the same j statement? I'm thinking of something like Lisp's let* construct.
library(data.table)
set.seed(22)
DT <- data.table(a = rep(1:5, each = 10),
b = sample(c(0,1), 50, rep = TRUE))
DT[ ,
list(attempts = .N,
successes = sum(b),
rate = successes / attempts),
by = a]
This results in
# Error in `[.data.table`(DT, , list(attempts = .N, successes = sum(b), :
# object 'successes' not found
I understand why, but is there a different way to accomplish this in the same j?
This will do the trick:
DT[ , {
list(attempts = attempts <- .N,
successes = successes <- sum(b),
rate = successes/attempts)
}, by = a]
# a attempts successes rate
# 1: 1 10 5 0.5
# 2: 2 10 6 0.6
# 3: 3 10 3 0.3
# 4: 4 10 5 0.5
# 5: 5 10 5 0.5
FWIW, this closely related data.table feature request would make possible +/- the syntax used in your question. Quoting from the linked page:
Summary:
Iterative RHS of := (and `:=`(...)), and multiple := inside j = {...} syntax
Detailed description
e.g. DT[, `:=`( m1 = mean(a), m2 = sd(a), s = m1/m2 ), by = group]
where s can use previous lhs names ( using the word 'iterative' tries to convey that ).
Try this instead:
DT[,
{successes = sum(b);
attempts = .N;
list(attempts = attempts,
successes = successes,
rate = successes / attempts)
},
by = a]
or
DT[,
list(attempts = .N,
successes = sum(b)),
by = a][, rate := successes / attempts]

Subset data.table by logical column

I have a data.table with a logical column. Why the name of the logical column can not be used directly for the i argument? See the example.
dt <- data.table(x = c(T, T, F, T), y = 1:4)
# Works
dt[dt$x]
dt[!dt$x]
# Works
dt[x == T]
dt[x == F]
# Does not work
dt[x]
dt[!x]
From ?data.table
Advanced: When i is a single variable name, it is not considered an
expression of column names and is instead evaluated in calling scope.
So dt[x] will try to evaluate x in the calling scope (in this case the global environment)
You can get around this by using ( or { or force
dt[(x)]
dt[{x}]
dt[force(x)]
x is not defined in the global environment. If you try this,
> with(dt, dt[x])
x y
1: TRUE 1
2: TRUE 2
3: TRUE 4
It would work. Or this:
> attach(dt)
> dt[!x]
x y
1: FALSE 3
EDIT:
according to the documentation the j parameter takes column name, in fact:
> dt[x]
Error in eval(expr, envir, enclos) : object 'x' not found
> dt[j = x]
[1] TRUE TRUE FALSE TRUE
then, the i parameter takes either numerical or logical expression (like x itself should be), however it seems it (data.table) can't see x as logical without this:
> dt[i = x]
Error in eval(expr, envir, enclos) : object 'x' not found
> dt[i = as.logical(x)]
x y
1: TRUE 1
2: TRUE 2
3: TRUE 4
This should also work and is arguably more natural:
setkey(dt, x)
dt[J(TRUE)]
dt[J(FALSE)]

Resources