"R passes promises, not values. The promise is forced when it is first evaluated, not when it is passed.", see this answer by G. Grothendieck. Also see this question referring to Hadley's book.
In simple examples such as
> funs <- lapply(1:10, function(i) function() print(i))
> funs[[1]]()
[1] 10
> funs[[2]]()
[1] 10
it is possible to take such unintuitive behaviour into account.
However, I find myself frequently falling into this trap during daily development. I follow a rather functional programming style, which means that I often have a function A returning a function B, where B is in some way depending on the parameters with which A was called. The dependency is not as easy to see as in the above example, since calculations are complex and there are multiple parameters.
Overlooking such an issue leads to difficult to debug problems, since all calculations run smoothly - except that the result is incorrect. Only an explicit validation of the results reveals the problem.
What comes on top is that even if I have noticed such a problem, I am never really sure which variables I need to force and which I don't.
How can I make sure not to fall into this trap? Are there any programming patterns that prevent this or that at least make sure that I notice that there is a problem?
You are creating functions with implicit parameters, which isn't necessarily best practice. In your example, the implicit parameter is i. Another way to rework it would be:
library(functional)
myprint <- function(x) print(x)
funs <- lapply(1:10, function(i) Curry(myprint, i))
funs[[1]]()
# [1] 1
funs[[2]]()
# [1] 2
Here, we explicitly specify the parameters to the function by using Curry. Note we could have curried print directly but didn't here for illustrative purposes.
Curry creates a new version of the function with parameters pre-specified. This makes the parameter specification explicit and avoids the potential issues you are running into because Curry forces evaluations (there is a version that doesn't, but it wouldn't help here).
Another option is to capture the entire environment of the parent function, copy it, and make it the parent env of your new function:
funs2 <- lapply(
1:10, function(i) {
fun.res <- function() print(i)
environment(fun.res) <- list2env(as.list(environment())) # force parent env copy
fun.res
}
)
funs2[[1]]()
# [1] 1
funs2[[2]]()
# [1] 2
but I don't recommend this since you will be potentially copying a whole bunch of variables you may not even need. Worse, this gets a lot more complicated if you have nested layers of functions that create functions. The only benefit of this approach is that you can continue your implicit parameter specification, but again, that seems like bad practice to me.
As others pointed out, this might not be the best style of programming in R. But, one simple option is to just get into the habit of forcing everything. If you do this, realize you don't need to actually call force, just evaluating the symbol will do it. To make it less ugly, you could make it a practice to start functions like this:
myfun<-function(x,y,z){
x;y;z;
## code
}
There is some work in progress to improve R's higher order functions like the apply functions, Reduce, and such in handling situations like these. Whether this makes into R 3.2.0 to be released in a few weeks depend on how disruptive the changes turn out to be. Should become clear in a week or so.
R has a function that helps safeguard against lazy evaluation, in situations like closure creation: forceAndCall().
From the online R help documentation:
forceAndCall is intended to help defining higher order functions like apply to behave more reasonably when the result returned by the function applied is a closure that captured its arguments.
Related
Here is a piece of R code that writes to each element of a matrix in a reference class. It runs incredibly slowly, and I’m wondering if I’ve missed a simple trick that will speed this up.
nx = 2000
ny = 10
ref_matrix <- setRefClass(
"ref_matrix",fields = list(data = "matrix"),
)
out <- ref_matrix(data = matrix(0.0,nx,ny))
#tracemem(out$data)
for (iy in 1:ny) {
for (ix in 1:nx) {
out$data[ix,iy] <- ix + iy
}
}
It seems that each write to an element of the matrix triggers a check that involves a copy of the entire matrix. (Uncommenting the tracemen() call shows this.) Now, I’ve found a discussion that seems to confirm this:
https://r-devel.r-project.narkive.com/8KtYICjV/rd-copy-on-assignment-to-large-field-of-reference-class
and this also seems to be covered by Speeding up field access in R reference classes
but in both of these this behaviour can be bypassed by not declaring a class for the field, and this works for the example in the first link which uses a 1D vector, b, which can just be set as b <<- 1:10000. But I’ve not found an equivalent way of creating a 2D array without using a explicit “matrix” instance.
Am I just missing something simple, or is this actually not possible?
Let me add a couple of things. First, I’m very new to R, so could easily have missed something. Second, I’m really just curious about the way reference classes work in this case and whether there’s a simple way to use them efficiently; I’m not looking for a really fast way to set the elements of a matrix - I can do that by not having the matrix in a reference class at all, and if I really care about speed I can write a C routine to do it and call it from R.
Here’s some background that might explain why I’m interested in this, which you’re welcome to ignore.
I got here by wanting to see how different languages, and even different compiler options and different ways of coding the same operation, compared for efficiency when accessing 2D rectangular arrays. I’ve been playing with a test program that creates two 2D arrays of the same size, and calls a subroutine that sets the first to the elements of the second plus their index values. (Almost any operation would do, but this one isn’t completely trivial to optimise.) I have this in a number of languages now, C, C++, Julia, Tcl, Fortran, Swift, etc., even hand-coded assembler (spoiler alert: assembler isn’t worth the effort any more) and thought I’d try R. The obvious implementation in R passes the two arrays to a subroutine that does the work, but because R doesn’t normally pass by reference, that routine has to make a copy of the modified array and return that as the function value. I thought using a reference class would avoid the relatively minor overhead of that copy, so I tried that and was surprised to discover that, far from speeding things up, it slowed them down enormously.
Use outer:
out$data <- outer(1:ny, 1:nx, `+`)
Also, don't use reference classes (or R6 classes) unless you actually need reference semantics. KISS and all that.
Some R function will make R copy the object AFTER the function call, like nrow, while some others don't, like sum.
For example the following code:
x = as.double(1:1e8)
system.time(x[1] <- 100)
y = sum(x)
system.time(x[1] <- 200) ## Fast (takes 0s), after calling sum
foo = function(x) {
return(sum(x))
}
y = foo(x)
system.time(x[1] <- 300) ## Slow (takes 0.35s), after calling foo
Calling foo is NOT slow, because x isn't copied. However, changing x again is very slow, as x is copied. My guess is that calling foo will leave a reference to x, so when changing it after, R makes another copy.
Any one knows why R does this? Even when the function doesn't change x at all? Thanks.
I definitely recommend Hadley's Advanced R book, as it digs into some of the internals that you will likely find interesting and relevant. Most relevant to your question (and as mentioned by #joran and #lmo), the reason for the slow-down was an additional reference that forced copy-on-modify.
An excerpt that might be beneficial from Memory#Modification:
There are two possibilities:
R modifies x in place.
R makes a copy of x to a new location, modifies the copy, and then
uses the name x to point to the new location.
It turns out that R can do either depending on the circumstances. In
the example above, it will modify in place. But if another variable
also points to x, then R will copy it to a new location. To explore
what’s going on in greater detail, we use two tools from the pryr
package. Given the name of a variable, address() will tell us the
variable’s location in memory and refs() will tell us how many names
point to that location.
Also of interest are the sections on R's C interface and Performance. The pryr package also has tools for working with these sorts of internals in an easier fashion.
One last note from Hadley's book (same Memory section) that might be helpful:
While determining that copies are being made is not hard, preventing
such behaviour is. If you find yourself resorting to exotic tricks to
avoid copies, it may be time to rewrite your function in C++, as
described in Rcpp.
There are several questions on how to avoid using eval(parse(...))
r-evalparse-is-often-suboptimal
avoiding-the-infamous-evalparse-construct
Which sparks the questions:
Why Specifically should eval(parse()) be avoided?
And most importantly, What are the dangers?
Are there any dangerous if the code is not used in production? (I'm thinking, any danger of getting back unintended results. Clearly if you are not careful about what you are parsing, you will have issues. But is that any more dangerous than being sloppy with get()?)
Most of the arguments against eval(parse(...)) arise not because of security concerns, after all, no claims are made about R being a safe interface to expose to the Internet, but rather because such code is generally doing things that can be accomplished using less obscure methods, i.e. methods that are both quicker and more human parse-able. The R language is supposed to be high-level, so the preference of the cognoscenti (and I do not consider myself in that group) is to see code that is both compact and expressive.
So the danger is that eval(parse(..)) is a backdoor method of getting around lack of knowledge and the hope in raising that barrier is that people will improve their use of the R language. The door remains open but the hope is for more expressive use of other features. Carl Witthoft's question earlier today illustrated not knowing that the get function was available, and the question he linked to exposed a lack of understanding of how the [[ function behaved (and how $ was more limited than [[). In both cases an eval(parse(..)) solution could be constructed, but it was clunkier and less clear than the alternative.
The security concerns only really arise if you start calling eval on strings that another user has passed to you. This is a big deal if you are creating an application that runs R in the background, but for data analysis where you are writing code to be run by yourself, then you shouldn't need to worry about the effect of eval on security.
Some other problems with eval(parse( though.
Firstly, code using eval-parse is usually much harder to debug than non-parsed code, which is problematic because debugging software is twice as difficult as writing it in the first place.
Here's a function with a mistake in it.
std <- function()
{
mean(1to10)
}
Silly me, I've forgotten about the colon operator and created my vector wrongly. If I try and source this function, then R notices the problem and throws an error, pointing me at my mistake.
Here's the eval-parse version.
ep <- function()
{
eval(parse(text = "mean(1to10)"))
}
This will source, because the error is inside a valid string. It is only later, when we come to run the code that the error is thrown. So by using eval-parse, we've lost the source-time error checking capability.
I also think that this second version of the function is much more difficult to read.
The other problem with eval-parse is that it is much slower than directly executed code. Compare
system.time(for(i in seq_len(1e4)) mean(1:10))
user system elapsed
0.08 0.00 0.07
and
system.time(for(i in seq_len(1e4)) eval(parse(text = "mean(1:10)")))
user system elapsed
1.54 0.14 1.69
Usually there's a better way of 'computing on the language' than working with code-strings; evalparse heavy-code needs a lot of safe-guarding to guarantee a sensible output, in my experience.
The same task can usually be solved by working on R code as a language object directly; Hadley Wickham has a useful guide on meta-programming in R here:
The defmacro() function in the gtools library is my favourite substitute (no half-assed R pun intended) for the evalparse construct
require(gtools)
# both action_to_take & predicate will be subbed with code
F <- defmacro(predicate, action_to_take, expr =
if(predicate) action_to_take)
F(1 != 1, action_to_take = print('arithmetic doesnt work!'))
F(pi > 3, action_to_take = return('good!'))
[1] 'good!'
# the raw code for F
print(F)
function (predicate = stop("predicate not supplied"), action_to_take = stop("action_to_take not supplied"))
{
tmp <- substitute(if (predicate) action_to_take)
eval(tmp, parent.frame())
}
<environment: 0x05ad5d3c>
The benefit of this method is that you are guaranteed to get back syntactically-legal R code. More on this useful function can be found here:
Hope that helps!
In some programming languages, eval() is a function which evaluates
a string as though it were an expression and returns a result; in
others, it executes multiple lines of code as though they had been
included instead of the line including the eval. The input to eval is
not necessarily a string; in languages that support syntactic
abstractions (like Lisp), eval's input will consist of abstract
syntactic forms.
http://en.wikipedia.org/wiki/Eval
There are all kinds of exploits that one can take advantage of if eval is used improperly.
An attacker could supply a program with the string
"session.update(authenticated=True)" as data, which would update the
session dictionary to set an authenticated key to be True. To remedy
this, all data which will be used with eval must be escaped, or it
must be run without access to potentially harmful functions.
http://en.wikipedia.org/wiki/Eval
In other words, the biggest danger of eval() is the potential for code injection into your application. The use of eval() can also cause performance issues in some languages depending on what is being used for.
Specifically in R, it's probably because you can use get() in place of eval(parse()) and your results will be the same without having to resort to eval()
I would like to create a promise in R programmatically. I know that the language supports it. But for some reason, there does not seem a way to do this.
To give more detail: I would like to have components of a list lazily evaluated. E.g.
x <- list(node=i, children=promise(some_expensive_function(i))
I only want to access the second component of the list for very few values of the list. Pre-populating the list with lazy expressions results in very clear, compact and readable code. The background of this algorithm is a tree search. Essentially, I am trying to emulate coroutine behaviour here. Right now I am using closures for this, but the code lacks elegancy.
Is there a third-party package that exposes the hidden promise construction mechanism in R? Or is this mechanism explicitly tied to environment bindings rather than expressions?
P.S. Yes, I am aware of delayedAssign. It does not do what I want. Yes, I can juggle around with intermediate environments, but its also messy.
Any programming language that has first-class functions (including R) can pretty easily implement lazy evaluation through thunks (Wikipedia entry on this).
The basic idea is that functions are not evaluated until they're called, so just wrap the elements of your list in anonymous functions that return their value when called.
delayed <- list(function() 1, function() 2, function () 3)
lapply(delayed, function(x) x())
Those are just numbers wrapped in there, but you can easily place some_expensive_function(i) in there instead to provide the argument but delay evaluation.
Edit: noticed the using closures thing just now, so I assume you're using a similar method currently. Can you elaborate on the "inelegance" of it? This is all eye-of-the-beholder, but thunking seems fairly straightforward and a lot less boilerplate if you're just looking for lazy evaluation.
At the moment your use case is too vague to get my head around. I'm wondering if one of quote, expression or call is what you are asking for:
x <- list(node=i, children=quote(mean(i)) )
x
#----------
$node
[1] 768
$children
mean(i)
#------------
x <- list(node=i, children=call('mean',i))
x
#-------------
$node
[1] 768
$children
mean(768L)
#-----------
x <- list(node=i, children=expression(mean(i)) )
x
#------------
$node
[1] 768
$children
expression(mean(i))
A test of the last one obviously evaluation in the globalenv():
eval( x$children)
#[1] 768
I have ended up using environments and delayedAssign for this one.
node <- new.env()
node$name <- X[1, 1]
r$level <- names(X)[1]
delayedAssign('subtaxa', split_taxons_lazy(X[-1]), assign.env=node)
node
This works well for my case, and while I would prefer it use lists, it does not seem to be possible in R. Thanks for the comments!
There are several questions on how to avoid using eval(parse(...))
r-evalparse-is-often-suboptimal
avoiding-the-infamous-evalparse-construct
Which sparks the questions:
Why Specifically should eval(parse()) be avoided?
And most importantly, What are the dangers?
Are there any dangerous if the code is not used in production? (I'm thinking, any danger of getting back unintended results. Clearly if you are not careful about what you are parsing, you will have issues. But is that any more dangerous than being sloppy with get()?)
Most of the arguments against eval(parse(...)) arise not because of security concerns, after all, no claims are made about R being a safe interface to expose to the Internet, but rather because such code is generally doing things that can be accomplished using less obscure methods, i.e. methods that are both quicker and more human parse-able. The R language is supposed to be high-level, so the preference of the cognoscenti (and I do not consider myself in that group) is to see code that is both compact and expressive.
So the danger is that eval(parse(..)) is a backdoor method of getting around lack of knowledge and the hope in raising that barrier is that people will improve their use of the R language. The door remains open but the hope is for more expressive use of other features. Carl Witthoft's question earlier today illustrated not knowing that the get function was available, and the question he linked to exposed a lack of understanding of how the [[ function behaved (and how $ was more limited than [[). In both cases an eval(parse(..)) solution could be constructed, but it was clunkier and less clear than the alternative.
The security concerns only really arise if you start calling eval on strings that another user has passed to you. This is a big deal if you are creating an application that runs R in the background, but for data analysis where you are writing code to be run by yourself, then you shouldn't need to worry about the effect of eval on security.
Some other problems with eval(parse( though.
Firstly, code using eval-parse is usually much harder to debug than non-parsed code, which is problematic because debugging software is twice as difficult as writing it in the first place.
Here's a function with a mistake in it.
std <- function()
{
mean(1to10)
}
Silly me, I've forgotten about the colon operator and created my vector wrongly. If I try and source this function, then R notices the problem and throws an error, pointing me at my mistake.
Here's the eval-parse version.
ep <- function()
{
eval(parse(text = "mean(1to10)"))
}
This will source, because the error is inside a valid string. It is only later, when we come to run the code that the error is thrown. So by using eval-parse, we've lost the source-time error checking capability.
I also think that this second version of the function is much more difficult to read.
The other problem with eval-parse is that it is much slower than directly executed code. Compare
system.time(for(i in seq_len(1e4)) mean(1:10))
user system elapsed
0.08 0.00 0.07
and
system.time(for(i in seq_len(1e4)) eval(parse(text = "mean(1:10)")))
user system elapsed
1.54 0.14 1.69
Usually there's a better way of 'computing on the language' than working with code-strings; evalparse heavy-code needs a lot of safe-guarding to guarantee a sensible output, in my experience.
The same task can usually be solved by working on R code as a language object directly; Hadley Wickham has a useful guide on meta-programming in R here:
The defmacro() function in the gtools library is my favourite substitute (no half-assed R pun intended) for the evalparse construct
require(gtools)
# both action_to_take & predicate will be subbed with code
F <- defmacro(predicate, action_to_take, expr =
if(predicate) action_to_take)
F(1 != 1, action_to_take = print('arithmetic doesnt work!'))
F(pi > 3, action_to_take = return('good!'))
[1] 'good!'
# the raw code for F
print(F)
function (predicate = stop("predicate not supplied"), action_to_take = stop("action_to_take not supplied"))
{
tmp <- substitute(if (predicate) action_to_take)
eval(tmp, parent.frame())
}
<environment: 0x05ad5d3c>
The benefit of this method is that you are guaranteed to get back syntactically-legal R code. More on this useful function can be found here:
Hope that helps!
In some programming languages, eval() is a function which evaluates
a string as though it were an expression and returns a result; in
others, it executes multiple lines of code as though they had been
included instead of the line including the eval. The input to eval is
not necessarily a string; in languages that support syntactic
abstractions (like Lisp), eval's input will consist of abstract
syntactic forms.
http://en.wikipedia.org/wiki/Eval
There are all kinds of exploits that one can take advantage of if eval is used improperly.
An attacker could supply a program with the string
"session.update(authenticated=True)" as data, which would update the
session dictionary to set an authenticated key to be True. To remedy
this, all data which will be used with eval must be escaped, or it
must be run without access to potentially harmful functions.
http://en.wikipedia.org/wiki/Eval
In other words, the biggest danger of eval() is the potential for code injection into your application. The use of eval() can also cause performance issues in some languages depending on what is being used for.
Specifically in R, it's probably because you can use get() in place of eval(parse()) and your results will be the same without having to resort to eval()