I want to draw an environment diagram for the following code which contains an error to understand how R works exactly when evaluating a function.
# emphasize text
emph <- function(f, style = '**') {
function(...) {
if (length(style) == 1) {
paste(style, f(...), style)
} else {
paste(style[1], f(...), style[2])
}
}
}
# function to be decorated
tmbg <- function() {
'tmbg are okay'
}
# a decorator function with self-referencing name
tmbg <- emph(tmbg)
I got error while evaluating the call expression of the decorator function
tmbg()
> Error: evaluation nested too deeply: infinite recursion / options(expressions=)?
I could understand this is related to the lazy evaluation of function parameter in R. It feels like when evaluating tmbg() in global frame, the name of f used in the returned anonymous function binds again to tmbg in global frame which again returns the anonymous function and calls f, thus leads to infinite recursive calls. But this image is not so clear to me because I don't exactly know what is the evaluation model used in R especially with this "lazy evaluation".
Below I draw the essential parts of the environment diagrams and explain the evaluation rule used in Python for the equivalent code. I hope to get such environment diagrams for R as well, or at least get the same level of clarity for the environmental model used in R.
# This is the equivalent python code
def emph(f, style = ['**']):
def wrapper(*args):
if len(style) == 1:
return style[0] + f(*args) + style[0]
else:
return style[0] + f(*args) + style[1]
return wrapper
def tmbg():
return 'tmbg are okay'
tmbg = emph(tmbg)
tmbg()
When evaluating the assignment statement at line 12 tmbg = emph(tmbg), the call expression emph(tmbg) needs to be evaluated first. When evaluating the operator of the call expression, its formal parameter f binds to name tmbg in global frame which binds to a function we defined in global frame, as shown in the picture below.
Next, after finishing the evaluation of the call expression emph(tmbg), its returned function wrapper binds to the name tmbg in global frame. However the binding of f and the actual function tmbg is still hold in the local frame created by emph (f1 in the diagram below).
Therefore when evaluating tmbg() in global frame, there won't be any confusion about which is the decorator function (tmbg in global) and which is the function to be decorated (f in local frame). This is the different part compared to R.
It looks like what R does is that it changes the binding from f -> function tmbg() to f -> name tmbg in global frame, which again binds to function wrapper(*args) calling f itself and thus leads to this infinite recursion. But it might also be a completely different model that R does not really bind f to any object but a name tmbg and ignores what that name represents. When it starts to evaluate, it looks for name tmbg and it finds the global one which is created by tmbg <- emph(tmbg) and gets infinite recursion. But this sounds really weird as the local scope created by the function call does not count anymore (or partially counts) for the purpose of "lazy evaluation" as soon as we pass an expression as argument of that function. There has to be then a system running parallelly other than the environments created by the function calls managing the namespaces and the scopes.
In either case, it is not clear to me the environmental model and evaluation rule R. I want to be clear on these and draw an environment diagram for the R code as clear as the one below if possible.
The problem is not understanding environments. The problem is understanding lazy evaluation.
Due to lazy evaluation f is just a promise which is not evaluated until the anonymous function is run and by that time tmbg has been redefined. To force f to be evaluated when emph is run add the marked ### force statement to force it. No other lines are changed.
In terms of environments the anonymous function gets f from emph and in emph f is a promise which is not looked up in the caller until the anonymous function is run unless we add the force statement.
emph <- function(f, style = '**') {
force(f) ###
function(...) {
if (length(style) == 1) {
paste(style, f(...), style)
} else {
paste(style[1], f(...), style[2])
}
}
}
# function to be decorated
tmbg <- function() {
'tmbg are okay'
}
# a decorator function with self-referencing name
tmbg <- emph(tmbg)
tmbg()
## [1] "** tmbg are okay **"
We can look at the promise using the pryr package.
library(pryr)
emph <- function(f, style = '**') {
str(promise_info(f))
force(f)
cat("--\n")
str(promise_info(f))
function(...) {
if (length(style) == 1) {
paste(style, f(...), style)
} else {
paste(style[1], f(...), style[2])
}
}
}
# function to be decorated
tmbg <- function() {
'tmbg are okay'
}
tmbg <- emph(tmbg)
which results in this output that shows that f is at first unevaluated but after force is invoked it contains the value of f. Had we not used force the anonymous function would have accessed f in the state shown in the first promise_info() output so all it would know is a symbol tmbg and where to look for it (Global Environment).
List of 4
$ code : symbol tmbg
$ env :<environment: R_GlobalEnv>
$ evaled: logi FALSE
$ value : NULL
--
List of 4
$ code : symbol tmbg
$ env : NULL
$ evaled: logi TRUE
$ value :function ()
..- attr(*, "srcref")= 'srcref' int [1:8] 1 13 3 5 13 5 1 3
.. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x00000000102c3730>
Related
I ran into an issue trying to use %dopar% and foreach() together with an R6 class. Searching around, I could only find two resources related to this, an unanswered SO question and an open GitHub issue on the R6 repository.
In one comment (i.e., GitHub issue) an workaround is suggested by reassigning the parent_env of the class as SomeClass$parent_env <- environment(). I would like to understand what exactly does environment() refer to when this expression (i.e., SomeClass$parent_env <- environment()) is called within the %dopar% of foreach?
Here is a minimal reproducible example:
Work <- R6::R6Class("Work",
public = list(
values = NULL,
initialize = function() {
self$values <- "some values"
}
)
)
Now, the following Task class uses the Work class in the constructor.
Task <- R6::R6Class("Task",
private = list(
..work = NULL
),
public = list(
initialize = function(time) {
private$..work <- Work$new()
Sys.sleep(time)
}
),
active = list(
work = function() {
return(private$..work)
}
)
)
In the Factory class, the Task class is created and the foreach is implemented in ..m.thread().
Factory<- R6::R6Class("Factory",
private = list(
..warehouse = list(),
..amount = NULL,
..parallel = NULL,
..m.thread = function(object, ...) {
cluster <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cluster)
private$..warehouse <- foreach::foreach(1:private$..amount, .export = c("Work")) %dopar% {
# What exactly does `environment()` encapsulate in this context?
object$parent_env <- environment()
object$new(...)
}
parallel::stopCluster(cluster)
},
..s.thread = function(object, ...) {
for (i in 1:private$..amount) {
private$..warehouse[[i]] <- object$new(...)
}
},
..run = function(object, ...) {
if(private$..parallel) {
private$..m.thread(object, ...)
} else {
private$..s.thread(object, ...)
}
}
),
public = list(
initialize = function(object, ..., amount = 10, parallel = FALSE) {
private$..amount = amount
private$..parallel = parallel
private$..run(object, ...)
}
),
active = list(
warehouse = function() {
return(private$..warehouse)
}
)
)
Then, it is called as:
library(foreach)
x = Factory$new(Task, time = 2, amount = 10, parallel = TRUE)
Without the following line object$parent_env <- environment(), it throws an error (i.e., as mentioned in the other two links): Error in { : task 1 failed - "object 'Work' not found".
I would like to know, (1) what are some potential pitfalls when assigning the parent_env inside foreach and (2) why does it work in the first place?
Update 1:
I returned environment() from within foreach(), such that private$..warehouse captures those environments
using rlang::env_print() in a debug session (i.e., the browser() statement was placed right after foreach has ended execution) here is what they consist of:
Browse[1]> env_print(private$..warehouse[[1]])
# <environment: 000000001A8332F0>
# parent: <environment: global>
# bindings:
# * Work: <S3: R6ClassGenerator>
# * ...: <...>
Browse[1]> env_print(environment())
# <environment: 000000001AC0F890>
# parent: <environment: 000000001AC20AF0>
# bindings:
# * private: <env>
# * cluster: <S3: SOCKcluster>
# * ...: <...>
Browse[1]> env_print(parent.env(environment()))
# <environment: 000000001AC20AF0>
# parent: <environment: global>
# bindings:
# * private: <env>
# * self: <S3: Factory>
Browse[1]> env_print(parent.env(parent.env(environment())))
# <environment: global>
# parent: <environment: package:rlang>
# bindings:
# * Work: <S3: R6ClassGenerator>
# * .Random.seed: <int>
# * Factory: <S3: R6ClassGenerator>
# * Task: <S3: R6ClassGenerator>
Disclaimer: a lot of what I say here are educated guesses and inferences based on what I know,
I can't guarantee everything is 100% correct.
I think there can be many pitfalls,
and which one applies really depends on what you do.
I think your second question is more important,
because if you understand that,
you'll be able to evaluate some of the pitfalls by yourself.
The topic is rather complex,
but you can probably start by reading about R's lexical scoping.
In essence, R has a sort of hierarchy of environments,
and when R code is executed,
variables whose values are not found in the current environment
(which is what environment() returns)
are sought in the parent environments
(not to be confused with the caller environments).
Based on the GitHub issue you linked,
R6 generators save a "reference" to their parent environments,
and they expect that everything their classes may need can be found in said parent or somewhere along the environment hierarchy,
starting at that parent and going "up".
The reason the workaround you're using works is because you're replacing the generator's parent environment with the one in the current foreach call inside the parallel worker
(which may be a different R process, not necessarily a different thread),
and, given your .export specification probably exports necessary values,
R's lexical scoping can then search for missing values starting from the foreach call in the separate thread/process.
For the specific example you linked,
I found that a simpler way to make it work
(at least on my Linux machine)
is to do the following:
library(doParallel)
cluster <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cluster)
parallel::clusterExport(cluster, setdiff(ls(), "cluster"))
x = Factory$new(Task, time = 1, amount = 3)
but leaving the ..m.thread function as:
..m.thread = function(object, amount, ...) {
private$..warehouse <- foreach::foreach(1:amount) %dopar% {
object$new(...)
}
}
(and manually call stopCluster when done).
The clusterExport call should have semantics similar to*:
take everything from the main R process' global environment except cluster,
and make it available in each parallel worker's global environment.
That way, any code inside the foreach call can use the generators when lexical scoping reaches their respective global environments.
foreach can be clever and exports some variables automatically
(as shown in the GitHub issue),
but it has limitations,
and the hierarchy used during lexical scoping can get very messy.
*I say "similar to" because I don't know what exactly R does to distinguish (global) environments if forks are used,
but since that export is needed,
I assume they are indeed independent of each other.
PS: I'd use a call to on.exit(parallel::stopCluster(cluster)) if you create workers inside a function call,
that way you avoid leaving processes around until they are somehow stopped if an error occurs.
In the function shown below, there is no return. However, after executing it, I can confirm that the value entered d normally.
There is no return. Any suggestions in this regard will be appreciated.
Code
#installed plotly, dplyr
accumulate_by <- function(dat, var) {
var <- lazyeval::f_eval(var, dat)
lvls <- plotly:::getLevels(var)
dats <- lapply(seq_along(lvls), function(x) {
cbind(dat[var %in% lvls[seq(1, x)], ], frame = lvls[[x]])
})
dplyr::bind_rows(dats)
}
d <- txhousing %>%
filter(year > 2005, city %in% c("Abilene", "Bay Area")) %>%
accumulate_by(~date)
In the function, the last assignment is creating 'dats' which is returned with bind_rows(dats) We don't need an explicit return statement. Suppose, if there are two objects to be returned, we can place it in a list
In some languages like python, for memory efficiency, generators are used which will yield instead of creating the whole output in memory i.e. Consider two functions in python
def get_square(n):
result = []
for x in range(n):
result.append(x**2)
return result
When we run it
get_square(4)
#[0, 1, 4, 9]
The same function can be written as a generator. Instead of returning anything,
def get_square(n):
for x in range(n):
yield(x**2)
Running the function
get_square(4)
#<generator object get_square at 0x0000015240C2F9E8>
By casting with list, we get the same output
list(get_square(4))
#[0, 1, 4, 9]
There is always a return :) You just don't have to be explicit about it.
All R expressions return something. Including control structures and user-defined functions. (Control-structures are just functions, by the way, so you can just remember that everything is a value or a function call, and everything evaluates to a value).
For functions, the return value is the last expression evaluated in the execution of the function. So, for
f <- function(x) 2 + x
when you call f(3) you will invoke the function + with two parameters, 2 and x. These evaluate to 2 and 3, respectively, so `+`(2, 3) evaluates to 5, and that is the result of f(3).
When you call the return function -- and remember, this is a function -- you just leave the control-flow of a function early. So,
f <- function(x) {
if (x < 0) return(0)
x + 2
}
works as follows: When you call f, it will call the if function to figure out what to do in the first statement. The if function will evaluate x < 0 (which means calling the function < with parameters x and 0). If x < 0 is true, if will evaluate return(0). If it is false, it will evaluate its else part (which, because if has a special syntax when it comes to functions, isn't shown, but is NULL). If x < 0 is not true, f will evaluate x + 2 and return that. If x < 0 is true, however, the if function will evaluate return(0). This is a call to the function return, with parameter 0, and that call will terminate the execution of f and make the result 0.
Be careful with return. It is a function so
f <- function(x) {
if (x < 0) return;
x + 2
}
is perfectly valid R code, but it will not return when x < 0. The if call will just evaluate to the function return but not call it.
The return function is also a little special in that it can return from the parent call of control structures. Strictly speaking, return isn't evaluated in the frame of f in the examples above, but from inside the if calls. It just handles this special so it can return from f.
With non-standard evaluation this isn't always the case.
With this function
f <- function(df) {
with(df, if (any(x < 0)) return("foo") else return("bar"))
"baz"
}
you might think that
f(data.frame(x = rnorm(10)))
should return either "foo" or "bar". After all, we return in either case in the if statement. However, the if statement is evaluated inside with and it doesn't work that way. The function will return baz.
For non-local returns like that, you need to use callCC, and then it gets more technical (as if this wasn't technical enough).
If you can, try to avoid return completely and rely on functions returning the last expression they evaluate.
Update
Just to follow up on the comment below about loops. When you call a loop, you will most likely call one of the built-in primitive functions. And, yes, they return NULL. But you can write your own, and they will follow the rule that they return the last expression they evaluate. You can, for example, implement for in terms of while like this:
`for` <- function(itr_var, seq, body) {
itr_var <- as.character(substitute(itr_var))
body <- substitute(body)
e <- parent.frame()
j <- 1
while (j < length(seq)) {
assign(x = itr_var, value = seq[[j]], envir = e)
eval(body, envir = e)
j <- j + 1
}
"foo"
}
This function, will definitely return "foo", so this
for(i in 1:5) { print(i) }
evalutes to "foo". If you want it to return NULL, you have to be explicit about it (or just let the return value be the result of the while loop -- if that is the primitive while it returns NULL).
The point I want to make is that functions return the last expression they evaluate has to do with how the functions are defined, not how you call them. The loops use non-standard evaluation, so the last expression in the loop body you provide them might be the last value they evaluate and might not. For the primitive loops, it is not.
Except for their special syntax, there is nothing magical about loops. They follow the rules all functions follow. With non-standard evaluation it can get a bit tricky to work out from a function call what the last expression they will evaluate might be, because the function body looks like it is what the function evaluates. It is, to a degree, if the function is sensible, but the loop body is not the function body. It is a parameter. If it wasn't for the special syntax, and you had to provide loop bodies as normal parameters, there might be less confusion.
I have one function inside another like this:
func2 <- function(x=1) {ko+x+1}
func3= function(l=1){
ko=2
func2(2)+l
}
func3(1)
it shows error: Error in func2(2) : object 'ko' not found. Basically I want to use object ko in func2 which will not be define until func3 is called. Is there any fix for this?
Yes, it can be fixed:
func2 <- function(x=1) {ko+x+1}
func3= function(l=1){
ko=2
assign("ko", ko, environment(func2))
res <- func2(2)+l
rm("ko", envir = environment(func2))
res
}
func3(1)
#[1] 6
As you see this is pretty complicated. That's often a sign that you are not following good practice. Good practice would be to pass ko as a parameter:
func2 <- function(x=1, ko) {ko+x+1}
func3= function(l=1){
ko=2
func2(2, ko)+l
}
func3(1)
#[1] 6
You don't really have one function "inside" the other currently (you are just calling a function within a different function). If you did move the one function inside the other function, then this would work
func3 <- function(l=1) {
func2 <- function(x=1) {ko+x+1}
ko <- 2
func2(2)+l
}
func3(1)
Functions retain information about the environment in which they were defined. This is called "lexical scoping" and it's how R operates.
But in general I agree with #Roland that it's better to write functions that have explicit arguments.
This is a good case for learning about closures and using a factory.
func3_factory <- function (y) {
ko <- y
func2 <- function (x = 1) { ko + x + 1 }
function (l = 1) { func2(2) + l }
}
ko <- 1
func3_ko_1 <- func3_factory(ko)
ko <- 7
func3_ko_7 <- func3_factory(ko)
# each function stores its own value for ko
func3_ko_1(1) # 5
func3_ko_7(1) # 11
# changing ko in the global scope doesn't affect the internal ko values in the closures
ko <- 100
func3_ko_1(1) # 5
func3_ko_7(1) # 11
When func3_factory returns a function, that new function is coupled with the environment in which it was created, which in this case includes a variable named ko which keeps whatever value was passed into the factory and a function named func2 which can also access that fixed value for ko. This combindation of a function and the environemnt it was defined in is called a closure. Anything that happens inside the returned function can access these values, and they stay the same even if that ko variable is changed outside the closure.
I would like to get the environment created by a function when it is runned WITHOUT modifying the function source (ie from outside of the function), is it possible ?
fn=function()
{#Here a new environment is created at each call, how to get it ?
#This environment can be access with environment() but only (to what I know)
#from inside the function
...
}
I would like something like this:
env=some_function(fn())
where env is the environment id created by fn at the call.
You could trace the function to bind the call environment to a symbol in the global environment:
fn <- function() {x <- 2; 1}
trace(fn, quote(efn <<- environment()), at = 1)
fn()
#Tracing fn() step 1
#[1] 1
untrace(fn)
efn$x
#[1] 2
S4 classes allow you to define validity checks using validObject() or setValidity(). However, this does not appear to work for ReferenceClasses.
I have tried adding assert_that() or if (badness) stop(message) clauses to the $initialize() method of a ReferenceClass. However, when I simulate loading the package (using devtools::load_all()), it must try to create some prototype class because the initialize method executes and fails (because no fields have been set).
What am I doing wrong?
Implement a validity method on the reference class
A = setRefClass("A", fields=list(x="numeric", y="numeric"))
setValidity("A", function(object) {
if (length(object$x) != length(object$y)) {
"x, y lengths differ"
} else NULL
})
and invoke the validity method explicitly
> validObject(A())
[1] TRUE
> validObject(A(x=1:5, y=5:1))
[1] TRUE
> validObject(A(x=1:5, y=5:4))
Error in validObject(A(x = 1:5, y = 5:4)) :
invalid class "A" object: x, y lengths differ
Unfortunately, setValidity() would need to be called explicitly as the penultimate line of an initialize method or constructor.
Ok so you can do this in initialize. It should have the form:
initialize = function (...) {
if (nargs()) return ()
# Capture arguments in list
args <- list(...)
# If the field name is passed to the initialize function
# then check whether it is valid and assign it. Otherwise
# assign a zero length value (character if field_name has
# that type)
if (!is.null(args$field_name)) {
assert_that(check_field_name(args$field_name))
field_name <<- field_name
} else {
field_name <<- character()
}
# Make sure you callSuper as this will then assign other
# fields included in ... that weren't already specially
# processed like `field_name`
callSuper(...)
}
This is based on the strategy set out in the lme4 package.