Magic in the way R evaluates function arguments - r

Consider the following R code:
y1 <- dataset %>% dplyr::filter(W == 1)
This works, but there seems to some magic here. Usually, when we have an expression like foo(bar), we should be able to do this:
baz <= bar
foo(baz)
However, in the presented code snippet, we cannot evaluate W == 1 outside of dplyr::filter()! W is not a defined variable.
What's going on?

dplyr uses a concept called Non-standard Evaluation (NSE) to make columns from the data frame argument accessible to its functions without quoting or using dataframe$column syntax. Basically:
[Non-standard evaluation] is a catch-all term that means they don’t follow the usual R rules of evaluation. Instead, they capture the expression that you typed and evaluate it in a custom way.1
In this case, the custom evaluation takes the argument(s) given to dplyr::filter, and parses them so that W can be used to refer to the dataset$W. The reason that you can't then take that variable and use it elsewhere is that NSE is only applied to the scope of the function.
NSE makes a trade-off: functions which modify scope are less safe and/or unusable in programming where you're building a program that uses functions to modify other functions:
This is an example of the general tension between functions that are designed for interactive use and functions that are safe to program with. A function that uses substitute() might reduce typing, but it can be difficult to call from another function.2
For example, if you wanted to write a function which would use the same code, but swap out W == 1 for W == 0 (or some completely different filter), NSE would make that more difficult to accomplish.
In 2017 the tidyverse started to build a solution to this in tidy evaluation.

Related

Function inside function in R

This thread discusses two basic approaches to using functions inside other functions in R: What are the benefits of defining and calling a function inside another function in R?
The top answer says the second approach, naming externally and just calling via name in the outer function, is faster: " f2 needs to be redefined every time you call f1, which adds some overhead (not very much overhead, but definitely there)". My question is, is this overhead caused by the assignment itself or by passing through the function itself?
For example, consider this third option besides the two in that thread:
#Approach 1
fun1a <- function(x) {
fun1b <- function(y){return(y^2)}
return(fun1b(x))
}
#Approach 2
fun2a <- function(y){return(y^2)}
fun2b <- function(x){return(fun2a(x))}
#Approach 3
fun3 <- function(x) {
return(function(x){return(x^2)})
}
It was confirmed that Approach 2 is faster than Approach 1 because Approach 1 needs to redefine fun1b in the function repeatedly. But if you use Approach 3 --basically, Approach 1, but not assigning fun1b to a named function everytime you run it -- is that always faster?
If so, why would anyone not just use Approach 3 for everything? i.e. what disadvantages does it have compared to Approach 2 (or 1)
Some of these (but not all) are already mentioned in the link in the question but here is a longer list.
Visibility Functions defined within functions are not visible outside that function increasing the modularity of the software if that function is not also used elsewhere. It provides a sort of poor man's namespace. For example, an alternative to using an anonymous function in a lapply appearing within a function would be to define it as a named function within the outer function to keep it from being visible outside the outer function. The name might form a sort of documentation for the inner function.
Scope Functions defined within functions can access variables defined in the outer function without passing them as arguments.
Cache Functions defined within functions and passed back out can use the outer function to cache results so that they are remembered the next time the passed out function is run. Here makeIncr is a factory function which constructs a new counter function each time it is run. The counter functions return the next number in sequence each time they are run.
makeIncr <- function(init) function() { init <<- init + 1; init }
counter1 <- makeIncr(0)
counter1()
## [1] 1
counter1()
## [1] 2
counter2 <- makeIncr(0)
counter2()
## [1] 1
Object Orientation Functions defined within functions can be used to emulate a limited form of object orientation. See an example by running: demo(scoping)
Debugging can be a bit more awkward with functions within functions. For example, debug(makeIncr) using makeIncr above does not debug the counters which would have to be debugged separately.
I am not sure that the performance issue discussed is really material since the functions would be byte compiled the first time the outer function is run. In most cases you would want to make a decision based on other factors.

Pipe with additional Arguments

I read in several places that pipes in Julia only work with functions that take only one argument. This is not true, since I can do the following:
function power(a, b = 2) a^b end
3 |> power
> 9
and it works fine.
However, I but can't completely get my head around the pipe. E.g. why is this not working?? :
3 |> power()
> MethodError: no method matching power()
What I would actually like to do is using a pipe and define additional arguments, e.g. keyword arguments so that it is actually clear which argument to pass when piping (namely the only positional one):
function power(a; b = 2) a^b end
3 |> power(b = 3)
Is there any way to do something like this?
I know I could do a work-around with the Pipe package, but to honest it feels kind of clunky to write #pipe at the start of half of the lines.
In R the magritrr package has convincing logic (in my opinion): it passes what's left of the pipe by default as the first argument to the function on the right - I'm looking for something similar.
power as defined in the first snippet has two methods. One with one argument, one with two. So the point about |> working only with one-argument methods still holds.
The kind of thing you want to do is called "partial application", and very common in functional languages. You can always write
3 |> (a -> power(a, 3))
but that gets clunky quickly. Other language have syntax like power(%1, 3) to denote that lambda. There's discussion to add something similar to Julia, but it's difficult to get right. Pipe is exactly the macro-based fix for it.
If you have control over the defined method, you can also implement methods with an interface that return partially applied versions as you like -- many predicates in Base do this already, e.g., ==(1). There's also the option of Base.Fix2(power, 3), but that's not really an improvement, if you ask me (apart from maybe being nicer to the compiler).
And note that magrittrs pipes are also "macro"-based. The difference is that argument passing in R is way more complicated, and you can't see from outside whether an argument is used as a value or as an expression (essentially, R passes a thunk containing the expression and a pointer to the parent environment, and automatically evaluates and caches it if you use it as a value; see substitute)

Define new operator for tilde / formula

I'm trying to code a new operator, a double tilde ~~, to denote a different kind of formula to be passed onto another function (e.g., mirroring the functionality of ~~ in the lavaan package lavaan syntax).
The issue is y ~~ x returns y ~ ~x, where the second ~ is returned with the predictors.
I am at a total loss here. It seems ~ is a primitive function .Primitive("~") with no methods, unlike, say, +. So existing tutorials for S3 methods are useless.
Is this a dead end and am I doing something really against the programming language? Or is there an easy solution I am missing?
I guess, if you accept the comment, I can make an answer out of it:
~ is an operator in R like +,-, /,*. Although it is possible to use many kinds of characters for your variables using ticks `xxx` and qoute "xxx" you also need to access them with ticks (see ?Reserved). (I'm gonna use quotes instead of ticks here, but consider using ticks for a more accepted style guide.)
R is a functional programming language and therefore you can access every single language statement as a function, e.g. a + b is the same as "+"(a, b). When you write a + b it is just syntactic sugar - language-wise it is translated into a primitive function call with two arguments.
To complicate things, there is an order of evaluation. So if you write a~~b it gets translated into "~"(a, ~b). It is because ~ is a primitive operator desiged as a sigle character. You still can define the function "~~" <- function(a,b) {a + b}, but you can only call it by "~~"(a,b) directly for it to work.
On the other hand, you need to be able to specify how a binary operator looks like. Having defined a function "asdf" <- function(a,b) {a + b} is not enough and this will not work: a asdf b
R has something to define binary operators (R: What are operators like %in% called and how can I learn about them?), see large portion of binary operators used like in magrittr's %>% or doParallel's %dopar%. Thus it is better to stick to the binary operator syntax using %, i.e. <tick>%~~%<tick> <- function(a,b) {a+b}. Then you can easily access it by using syntactic sugar a %~~% b.
Strange stuff, I agree. As for magic tricks: try this at home "for"(a, 1:10, {print(a)}). Bonus question: why is a visible in the parent frame ?

Create a promise/lazily evaluated expression in R

I would like to create a promise in R programmatically. I know that the language supports it. But for some reason, there does not seem a way to do this.
To give more detail: I would like to have components of a list lazily evaluated. E.g.
x <- list(node=i, children=promise(some_expensive_function(i))
I only want to access the second component of the list for very few values of the list. Pre-populating the list with lazy expressions results in very clear, compact and readable code. The background of this algorithm is a tree search. Essentially, I am trying to emulate coroutine behaviour here. Right now I am using closures for this, but the code lacks elegancy.
Is there a third-party package that exposes the hidden promise construction mechanism in R? Or is this mechanism explicitly tied to environment bindings rather than expressions?
P.S. Yes, I am aware of delayedAssign. It does not do what I want. Yes, I can juggle around with intermediate environments, but its also messy.
Any programming language that has first-class functions (including R) can pretty easily implement lazy evaluation through thunks (Wikipedia entry on this).
The basic idea is that functions are not evaluated until they're called, so just wrap the elements of your list in anonymous functions that return their value when called.
delayed <- list(function() 1, function() 2, function () 3)
lapply(delayed, function(x) x())
Those are just numbers wrapped in there, but you can easily place some_expensive_function(i) in there instead to provide the argument but delay evaluation.
Edit: noticed the using closures thing just now, so I assume you're using a similar method currently. Can you elaborate on the "inelegance" of it? This is all eye-of-the-beholder, but thunking seems fairly straightforward and a lot less boilerplate if you're just looking for lazy evaluation.
At the moment your use case is too vague to get my head around. I'm wondering if one of quote, expression or call is what you are asking for:
x <- list(node=i, children=quote(mean(i)) )
x
#----------
$node
[1] 768
$children
mean(i)
#------------
x <- list(node=i, children=call('mean',i))
x
#-------------
$node
[1] 768
$children
mean(768L)
#-----------
x <- list(node=i, children=expression(mean(i)) )
x
#------------
$node
[1] 768
$children
expression(mean(i))
A test of the last one obviously evaluation in the globalenv():
eval( x$children)
#[1] 768
I have ended up using environments and delayedAssign for this one.
node <- new.env()
node$name <- X[1, 1]
r$level <- names(X)[1]
delayedAssign('subtaxa', split_taxons_lazy(X[-1]), assign.env=node)
node
This works well for my case, and while I would prefer it use lists, it does not seem to be possible in R. Thanks for the comments!

How to not fall into R's 'lazy evaluation trap'

"R passes promises, not values. The promise is forced when it is first evaluated, not when it is passed.", see this answer by G. Grothendieck. Also see this question referring to Hadley's book.
In simple examples such as
> funs <- lapply(1:10, function(i) function() print(i))
> funs[[1]]()
[1] 10
> funs[[2]]()
[1] 10
it is possible to take such unintuitive behaviour into account.
However, I find myself frequently falling into this trap during daily development. I follow a rather functional programming style, which means that I often have a function A returning a function B, where B is in some way depending on the parameters with which A was called. The dependency is not as easy to see as in the above example, since calculations are complex and there are multiple parameters.
Overlooking such an issue leads to difficult to debug problems, since all calculations run smoothly - except that the result is incorrect. Only an explicit validation of the results reveals the problem.
What comes on top is that even if I have noticed such a problem, I am never really sure which variables I need to force and which I don't.
How can I make sure not to fall into this trap? Are there any programming patterns that prevent this or that at least make sure that I notice that there is a problem?
You are creating functions with implicit parameters, which isn't necessarily best practice. In your example, the implicit parameter is i. Another way to rework it would be:
library(functional)
myprint <- function(x) print(x)
funs <- lapply(1:10, function(i) Curry(myprint, i))
funs[[1]]()
# [1] 1
funs[[2]]()
# [1] 2
Here, we explicitly specify the parameters to the function by using Curry. Note we could have curried print directly but didn't here for illustrative purposes.
Curry creates a new version of the function with parameters pre-specified. This makes the parameter specification explicit and avoids the potential issues you are running into because Curry forces evaluations (there is a version that doesn't, but it wouldn't help here).
Another option is to capture the entire environment of the parent function, copy it, and make it the parent env of your new function:
funs2 <- lapply(
1:10, function(i) {
fun.res <- function() print(i)
environment(fun.res) <- list2env(as.list(environment())) # force parent env copy
fun.res
}
)
funs2[[1]]()
# [1] 1
funs2[[2]]()
# [1] 2
but I don't recommend this since you will be potentially copying a whole bunch of variables you may not even need. Worse, this gets a lot more complicated if you have nested layers of functions that create functions. The only benefit of this approach is that you can continue your implicit parameter specification, but again, that seems like bad practice to me.
As others pointed out, this might not be the best style of programming in R. But, one simple option is to just get into the habit of forcing everything. If you do this, realize you don't need to actually call force, just evaluating the symbol will do it. To make it less ugly, you could make it a practice to start functions like this:
myfun<-function(x,y,z){
x;y;z;
## code
}
There is some work in progress to improve R's higher order functions like the apply functions, Reduce, and such in handling situations like these. Whether this makes into R 3.2.0 to be released in a few weeks depend on how disruptive the changes turn out to be. Should become clear in a week or so.
R has a function that helps safeguard against lazy evaluation, in situations like closure creation: forceAndCall().
From the online R help documentation:
forceAndCall is intended to help defining higher order functions like apply to behave more reasonably when the result returned by the function applied is a closure that captured its arguments.

Resources