expression syntax for data.table := in R - r

I am having some trouble getting an eval within data.table in R to work with an expression. Here is some code:
dtb = data.table(a=1:100, b=100:1, id=1:10)
dtb[,`:=`(c=a+b, d=a/b),by=id] #this works fine
expr = expression({`:=`(c=a+b, d=a/b)}) #try to couch everything in an expression
dtb[,eval(expr),by=id] #this does not work
Error in `:=`(c = a + b, d = a/b) :
unused argument(s) (c = a + b, d = a/b)
expr = expression(`:=`(c=a+b, d=a/b)) #this works fine
dtb[,eval(expr),by=id]
Why does including {} break this?

See the definition of :=:
function (LHS, RHS)
stop(":= is defined for use in j only, and (currently) only once; i.e., DT[i,col:=1L] and DT[,newcol:=sum(colB),by=colA] are ok, but not DT[i,col]:=1L, not DT[i]$col:=1L and not DT[,{newcol1:=1L;newcol2:=2L}]. Please see help(\":=\"). Check is.data.table(DT) is TRUE.")
The assignment of a column doesn't happen within a call of :=--the function itself doesn't do anything besides produce an error. The assignment happens when [.data.table detectsj is an expression of the form `:=`(...) and then sets everything up for a call to the C code. When you enclose expr in brackets, you're making the first part of the expression { instead of :=, which passes by the above detection and eventually results in an evaluation of := with arguments c and d.
I guess that leads to the question, why do you need to enclose it in { }?

Issue #376 to trap the { around := has now been implemented in v1.8.11. From NEWS:
o FR #2496 is now implemented to trap and remove the { around := in j to obtain desired result. Now, DT[,{`:=`(...)}] and DT[, {`:=`(...)}, by=(...)] both work as intended but with a warning. Thanks to Alex for reporting on SO: expression syntax for data.table := in R

Related

Using substitute to robustly capture the input code within glue() in a package

I'm building some utility functions to simplify writing cast(statement as type) in an SQL query easier from R.
The way I'm doing so is through one workhorse function, as_type which is called by several one-use functions (e.g. as_bigint); crucially, I also think calling as_type directly is a valid use case.
The basic structure of the code is like:
as_type = function(x, type) {
if (is.na(match(type, known_types())))
stop("Attempt to cast to unknown type ", type)
sprintf('cast(%s as %s)', deparse(substitute(x, parent.frame())), type)
}
as_bigint = function(x) as_type(x, 'bigint')
known_types = function() 'bigint'
# more complicated than this but for works the purposes of this Q
query_encode = glue::glue
With expected usages like
query_encode("select {as_bigint('1')}")
query_encode("select {as_type('1', 'bigint')}")
(in reality there are several more valid types and as_ functions for other valid SQL types; only query_encode is exported)
Unfortunately, calling as_type directly fails because, as noted in ?substitute (h/t Brodie G on Twitter):
If [a component of the parse tree] is not a bound symbol in [the second argument to substitute] env, it is unchanged
query_encode("select {as_bigint('1')}")
# select cast("1" as bigint)
query_encode("select {as_type('1', 'bigint')}")
# select cast(x as bigint)
I've cooked up the following workaround but it hardly feels robust:
as_type = function(x, type) {
if (is.na(match(type, known_types())))
stop("Attempt to cast to unknown Presto type ", type)
prev_call = as.character(tail(sys.calls(), 2L)[[1L]][[1L]])
valid_parent_re = sprintf('^as_(?:%s)$', paste(known_type(), collapse = '|'))
eval_env =
if (grepl(valid_parent_re, prev_call)) parent.frame() else environment()
sprintf(
'cast(%s as %s)',
gsub('"', "'", deparse(substitute(x, eval_env)), fixed = TRUE),
type
)
}
I.e., examine sys.calls() and check if as_type is being called from one of the as_ functions; set env argument to substitute as parent.frame() if so, current environment if not.
This works for now:
query_encode("select {as_bigint('1')}")
# select cast("1" as bigint)
query_encode("select {as_type('1', 'bigint')}")
# select cast("1" as bigint)
The question is, is this the best way of going about this? Phrased as such, it feels like an opinion-based question, but what I mean is -- (1) is this approach as fragile as it feels like at first glance and (2) assuming so, what's an alternative that is more robust?
E.g. it's notable that is.name(x) is FALSE from as_type, but it's not clear to me how to use this to proceed.
Here is the possible alternate approach I allude to in the comments:
.as_type <- function(x_sub, type) {
if(!isTRUE(type %in% known_types()))
stop("Attempt to cast to unknown type ", type)
sprintf('cast(%s as %s)', deparse(paste0(x_sub, collapse='\n')), type)
}
as_bigint <- function(x) .as_type(substitute(x), 'bigint')
as_type <- function(x, type) .as_type(substitute(x), type)
known_types <- function() 'bigint'
query_encode <- glue::glue
Then
query_encode("select {as_bigint('1')}")
## select cast("1" as bigint)
query_encode("select {as_type('1', 'bigint')}")
## select cast("1" as bigint)
In terms of what you actually want to do, I think we're stuck with variations on what you're doing, and I agree it feels a bit dirty. This is dirty in a different way, but not that dirty and seems like it might work. The only dirtyness really is the need to have each function call substitute, but that's not that big a deal.
In terms of fragility, to the extent you don't export the as_ functions, then it seems okay, although it does feel odd not to export those functions. I would export them, but if you do that, then you need far more robust checking as then people can rename the functions, etc. One thing to watch out for is that it is possible for the compiler to mess with frame counts. It really shouldn't, but Luke Tierney seems more comfortable doing that than I would be.
I believe you might have overlooked glue transformers. Going from character to call to end up with character again is a big detour that you don't need to take.
Transformers allow you to apply functions to the glue input and output, before and after evaluation, you can read more about them here. Keeping your format we can build :
library(glue)
cast_transformer <- function(regex = "as_(.*?)\\((.*)\\)$", ...) {
function(text, envir) {
type <- sub(regex, "\\1", text)
known_types <- "bigint"
if(type %in% known_types)
{
val <- sub(regex, "\\2", text)
glue("cast({val} as {type})")
} else {
eval(parse(text = text, keep.source = FALSE), envir)
}
}
}
glue("select {as_bigint('1')}",.transformer = cast_transformer())
#> select cast('1' as bigint)
Because we are now parsing the expression, there is no function as_bigint, you can still keep the syntax if it's convenient to you but nothing stops you from simplifying it to something like :
glue("select {bigint: '1'}",.transformer = cast_transformer("(\\D+): (.*)$"))
#> select cast('1' as bigint)
Choose the default regex that you like and define the wrapper query_encode <- function(query) glue(query, .transformer = cast_transformer()) and you're good to go.

Using invalid character "²" for squared. Extend Julia syntax with custom operators

In my equations we have many expressions with a^2, and so on. I would like to map "²" to ^2, to obtain something like that:
julia> a² == a^2
true
The above is not however a legal code in Julia. Any idea on how could I implement it ?
Here is a sample macro #hoo that does what you requested in a simplified scenario (since the code is long I will start with usage).
julia> x=5
5
julia> #hoo 3x² + 4x³
575
julia> #macroexpand #hoo 2x³+3x²
:(2 * Main.x ^ 3 + 3 * Main.x ^ 2)
Now, let us see the macro code:
const charsdict=Dict(Symbol.(split("¹²³⁴⁵⁶⁷⁸⁹","")) .=> 1:9)
const charsre = Regex("[$(join(String.(keys(charsdict))))]")
function proc_expr(e::Expr)
for i=1:length(e.args)
el = e.args[i]
typeof(el) == Expr && proc_expr(el)
if typeof(el) == Symbol
mm = match(charsre, String(el))
if mm != nothing
a1 = Symbol(String(el)[1:(mm.offset-1)])
a2 = charsdict[Symbol(mm.match)]
e.args[i] = :($a1^$a2)
end
end
end
end
macro hoo(expr)
typeof(expr) != Expr && return expr
proc_expr(expr)
expr
end
Of course it would be quite easy to expand this concept into "pure-math" library for Julia.
I don't think that there is any reasonable way of doing this.
When parsing your input, Julia makes no real difference between the unicode character ² and any other characters you might use in a variable name. Attempting to make this into an operator would be similar to trying to make the suffix square into an operator
julia> asquare == a^2
The a and the ² are not parsed as two separate things, just like the a and the square in asquare would not be.
a^2, on the other hand, is parsed as three separate things. This is because ^ is not a valid character for a variable name and it is therefore parsed as an operator instead.

Why does this happen when a user-defined R function does not return a value?

In the function shown below, there is no return. However, after executing it, I can confirm that the value entered d normally.
There is no return. Any suggestions in this regard will be appreciated.
Code
#installed plotly, dplyr
accumulate_by <- function(dat, var) {
var <- lazyeval::f_eval(var, dat)
lvls <- plotly:::getLevels(var)
dats <- lapply(seq_along(lvls), function(x) {
cbind(dat[var %in% lvls[seq(1, x)], ], frame = lvls[[x]])
})
dplyr::bind_rows(dats)
}
d <- txhousing %>%
filter(year > 2005, city %in% c("Abilene", "Bay Area")) %>%
accumulate_by(~date)
In the function, the last assignment is creating 'dats' which is returned with bind_rows(dats) We don't need an explicit return statement. Suppose, if there are two objects to be returned, we can place it in a list
In some languages like python, for memory efficiency, generators are used which will yield instead of creating the whole output in memory i.e. Consider two functions in python
def get_square(n):
result = []
for x in range(n):
result.append(x**2)
return result
When we run it
get_square(4)
#[0, 1, 4, 9]
The same function can be written as a generator. Instead of returning anything,
def get_square(n):
for x in range(n):
yield(x**2)
Running the function
get_square(4)
#<generator object get_square at 0x0000015240C2F9E8>
By casting with list, we get the same output
list(get_square(4))
#[0, 1, 4, 9]
There is always a return :) You just don't have to be explicit about it.
All R expressions return something. Including control structures and user-defined functions. (Control-structures are just functions, by the way, so you can just remember that everything is a value or a function call, and everything evaluates to a value).
For functions, the return value is the last expression evaluated in the execution of the function. So, for
f <- function(x) 2 + x
when you call f(3) you will invoke the function + with two parameters, 2 and x. These evaluate to 2 and 3, respectively, so `+`(2, 3) evaluates to 5, and that is the result of f(3).
When you call the return function -- and remember, this is a function -- you just leave the control-flow of a function early. So,
f <- function(x) {
if (x < 0) return(0)
x + 2
}
works as follows: When you call f, it will call the if function to figure out what to do in the first statement. The if function will evaluate x < 0 (which means calling the function < with parameters x and 0). If x < 0 is true, if will evaluate return(0). If it is false, it will evaluate its else part (which, because if has a special syntax when it comes to functions, isn't shown, but is NULL). If x < 0 is not true, f will evaluate x + 2 and return that. If x < 0 is true, however, the if function will evaluate return(0). This is a call to the function return, with parameter 0, and that call will terminate the execution of f and make the result 0.
Be careful with return. It is a function so
f <- function(x) {
if (x < 0) return;
x + 2
}
is perfectly valid R code, but it will not return when x < 0. The if call will just evaluate to the function return but not call it.
The return function is also a little special in that it can return from the parent call of control structures. Strictly speaking, return isn't evaluated in the frame of f in the examples above, but from inside the if calls. It just handles this special so it can return from f.
With non-standard evaluation this isn't always the case.
With this function
f <- function(df) {
with(df, if (any(x < 0)) return("foo") else return("bar"))
"baz"
}
you might think that
f(data.frame(x = rnorm(10)))
should return either "foo" or "bar". After all, we return in either case in the if statement. However, the if statement is evaluated inside with and it doesn't work that way. The function will return baz.
For non-local returns like that, you need to use callCC, and then it gets more technical (as if this wasn't technical enough).
If you can, try to avoid return completely and rely on functions returning the last expression they evaluate.
Update
Just to follow up on the comment below about loops. When you call a loop, you will most likely call one of the built-in primitive functions. And, yes, they return NULL. But you can write your own, and they will follow the rule that they return the last expression they evaluate. You can, for example, implement for in terms of while like this:
`for` <- function(itr_var, seq, body) {
itr_var <- as.character(substitute(itr_var))
body <- substitute(body)
e <- parent.frame()
j <- 1
while (j < length(seq)) {
assign(x = itr_var, value = seq[[j]], envir = e)
eval(body, envir = e)
j <- j + 1
}
"foo"
}
This function, will definitely return "foo", so this
for(i in 1:5) { print(i) }
evalutes to "foo". If you want it to return NULL, you have to be explicit about it (or just let the return value be the result of the while loop -- if that is the primitive while it returns NULL).
The point I want to make is that functions return the last expression they evaluate has to do with how the functions are defined, not how you call them. The loops use non-standard evaluation, so the last expression in the loop body you provide them might be the last value they evaluate and might not. For the primitive loops, it is not.
Except for their special syntax, there is nothing magical about loops. They follow the rules all functions follow. With non-standard evaluation it can get a bit tricky to work out from a function call what the last expression they will evaluate might be, because the function body looks like it is what the function evaluates. It is, to a degree, if the function is sensible, but the loop body is not the function body. It is a parameter. If it wasn't for the special syntax, and you had to provide loop bodies as normal parameters, there might be less confusion.

Evaluate expression with local variables

I'm writing a genetic program in order to test the fitness of randomly generated expressions. Shown here is the function to generate the expression as well a the main function. DIV and GT are defined elsewhere in the code:
function create_single_full_tree(depth, fs, ts)
"""
Creates a single AST with full depth
Inputs
depth Current depth of tree. Initially called from main() with max depth
fs Function Set - Array of allowed functions
ts Terminal Set - Array of allowed terminal values
Output
Full AST of typeof()==Expr
"""
# If we are at the bottom
if depth == 1
# End of tree, return function with two terminal nodes
return Expr(:call, fs[rand(1:length(fs))], ts[rand(1:length(ts))], ts[rand(1:length(ts))])
else
# Not end of expression, recurively go back through and create functions for each new node
return Expr(:call, fs[rand(1:length(fs))], create_single_full_tree(depth-1, fs, ts), create_single_full_tree(depth-1, fs, ts))
end
end
function main()
"""
Main function
"""
# Define functional and terminal sets
fs = [:+, :-, :DIV, :GT]
ts = [:x, :v, -1]
# Create the tree
ast = create_single_full_tree(4, fs, ts)
#println(typeof(ast))
#println(ast)
#println(dump(ast))
x = 1
v = 1
eval(ast) # Error out unless x and v are globals
end
main()
I am generating a random expression based on certain allowed functions and variables. As seen in the code, the expression can only have symbols x and v, as well as the value -1. I will need to test the expression with a variety of x and v values; here I am just using x=1 and v=1 to test the code.
The expression is being returned correctly, however, eval() can only be used with global variables, so it will error out when run unless I declare x and v to be global (ERROR: LoadError: UndefVarError: x not defined). I would like to avoid globals if possible. Is there a better way to generate and evaluate these generated expressions with locally defined variables?
Here is an example for generating an (anonymous) function. The result of eval can be called as a function and your variable can be passed as parameters:
myfun = eval(Expr(:->,:x, Expr(:block, Expr(:call,:*,3,:x) )))
myfun(14)
# returns 42
The dump function is very useful to inspect the expression that the parsers has created. For two input arguments you would use a tuple for example as args[1]:
julia> dump(parse("(x,y) -> 3x + y"))
Expr
head: Symbol ->
args: Array{Any}((2,))
1: Expr
head: Symbol tuple
args: Array{Any}((2,))
1: Symbol x
2: Symbol y
typ: Any
2: Expr
[...]
Does this help?
In the Metaprogramming part of the Julia documentation, there is a sentence under the eval() and effects section which says
Every module has its own eval() function that evaluates expressions in its global scope.
Similarly, the REPL help ?eval will give you, on Julia 0.6.2, the following help:
Evaluate an expression in the given module and return the result. Every Module (except those defined with baremodule) has its own 1-argument definition of eval, which evaluates expressions in that module.
I assume, you are working in the Main module in your example. That's why you need to have the globals defined there. For your problem, you can use macros and interpolate the values of x and y directly inside the macro.
A minimal working example would be:
macro eval_line(a, b, x)
isa(a, Real) || (warn("$a is not a real number."); return :(throw(DomainError())))
isa(b, Real) || (warn("$b is not a real number."); return :(throw(DomainError())))
return :($a * $x + $b) # interpolate the variables
end
Here, #eval_line macro does the following:
Main> #macroexpand #eval_line(5, 6, 2)
:(5 * 2 + 6)
As you can see, the values of macro's arguments are interpolated inside the macro and the expression is given to the user accordingly. When the user does not behave,
Main> #macroexpand #eval_line([1,2,3], 7, 8)
WARNING: [1, 2, 3] is not a real number.
:((Main.throw)((Main.DomainError)()))
a user-friendly warning message is provided to the user at parse-time, and a DomainError is thrown at run-time.
Of course, you can do these things within your functions, again by interpolating the variables --- you do not need to use macros. However, what you would like to achieve in the end is to combine eval with the output of a function that returns Expr. This is what the macro functionality is for. Finally, you would simply call your macros with an # sign preceding the macro name:
Main> #eval_line(5, 6, 2)
16
Main> #eval_line([1,2,3], 7, 8)
WARNING: [1, 2, 3] is not a real number.
ERROR: DomainError:
Stacktrace:
[1] eval(::Module, ::Any) at ./boot.jl:235
EDIT 1. You can take this one step further, and create functions accordingly:
macro define_lines(linedefs)
for (name, a, b) in eval(linedefs)
ex = quote
function $(Symbol(name))(x) # interpolate name
return $a * x + $b # interpolate a and b here
end
end
eval(ex) # evaluate the function definition expression in the module
end
end
Then, you can call this macro to create different line definitions in the form of functions to be called later on:
#define_lines([
("identity_line", 1, 0);
("null_line", 0, 0);
("unit_shift", 0, 1)
])
identity_line(5) # returns 5
null_line(5) # returns 0
unit_shift(5) # returns 1
EDIT 2. You can, I guess, achieve what you would like to achieve by using a macro similar to that below:
macro random_oper(depth, fs, ts)
operations = eval(fs)
oper = operations[rand(1:length(operations))]
terminals = eval(ts)
ts = terminals[rand(1:length(terminals), 2)]
ex = :($oper($ts...))
for d in 2:depth
oper = operations[rand(1:length(operations))]
t = terminals[rand(1:length(terminals))]
ex = :($oper($ex, $t))
end
return ex
end
which will give the following, for instance:
Main> #macroexpand #random_oper(1, [+, -, /], [1,2,3])
:((-)([3, 3]...))
Main> #macroexpand #random_oper(2, [+, -, /], [1,2,3])
:((+)((-)([2, 3]...), 3))
Thanks Arda for the thorough response! This helped, but part of me thinks there may be a better way to do this as it seems too roundabout. Since I am writing a genetic program, I will need to create 500 of these ASTs, all with random functions and terminals from a set of allowed functions and terminals (fs and ts in the code). I will also need to test each function with 20 different values of x and v.
In order to accomplish this with the information you have given, I have come up with the following macro:
macro create_function(defs)
for name in eval(defs)
ex = quote
function $(Symbol(name))(x,v)
fs = [:+, :-, :DIV, :GT]
ts = [x,v,-1]
return create_single_full_tree(4, fs, ts)
end
end
eval(ex)
end
end
I can then supply a list of 500 random function names in my main() function, such as ["func1, func2, func3,.....". Which I can eval with any x and v values in my main function. This has solved my issue, however, this seems to be a very roundabout way of doing this, and may make it difficult to evolve each AST with each iteration.

Rewrite recursive BNF rule with iteration

Look at the following recursive BNF rule
(1) X = Xa | b
This produces sentences like
X = b
X = ba
X = baa
X = baaa
...
This can be written as
(2) X = b a*
where the right hand side is not recursive
Now take a look at the following recursive BNF rule
(3) X = { X } | b
This produces sentences like
X = b
X = {b}
X = {{b}}
X = {{{b}}}
...
Is there some way to rewrite rule (3) in a non recursive way, analogous as we did when we rewrote rule (1) to rule (2).
Observe that X = {* b }* is no good since the parenthesis need to be balanced.
I do not know if the question above is possible to answer. The reason for the question above was that I wanted to avoid infinite loop in my parser (written in Java). One way was to insure that the BNF rules are not recursive, hence my question. But another way is to use the recursive rule, but avoid the infinite loop inside my (Java) program. Turns out that you can avoid loops by lazy instantiation.
For instance look at the following rules:
expression = term ('+' term)*;
term = factor ('*' factor)*;
factor = '(' expression ')' | Num;
expression() calls term(), which calls factor(), which calls expression(), thus we can end up with infinite loop. To avoid that we can use lazy instantiation, so instead of writing something like:
public Parser expression() {
expression = new ...
return expression;
}
we instead write:
public Parser expression() {
if (expression == null) {
expression = new ...
}
return expression;
}
Observe that you must declare expression as an instance variable to get this to work.

Resources