lapply() emptied list step by step while processing - r

First of all, excuse me for the bad title. I'm still so confused about this behavior, that I wasn't able to describe it; however I was able to reproduce it and broke it down to an (goofy) example.
Please, could you be so kind and explain why other.list appears to be full of NULLs after calling lapply()?
some.list <- rep(list(rnorm(1)),33)
other.list <- rep(list(), length = 33)
lapply(seq_along(some.list), function(i, other.list) {
other.list[[i]] <- some.list[[i]]
browser()
}, other.list)
I watched this in debugging mode in RStudio. For certain i, other.list[[i]] gets some.list[[i]] assigned, but it will be NULLed for the next iteration. I want to understand this behavior so bad!

The reason is that the assignment is taking place inside a function, and you've used the normal assignment operator <-, rather than the superassignment operator <<-. When inside a function scope, IOW when a function is executed, the normal assignment operator always assigns to a local variable in the evaluation environment that is created for that particular evaluation of that function (returned by a call to environment() from inside the function with fun=NULL). Thus, your global other.list variable, which is defined in the global environment (returned by globalenv()), will not be touched by such an assignment. The superassignment operator, on the other hand, will follow the closure environment chain (can be followed recursively via parent.env()) back until it finds a variable with the name on the LHS of the assignment, and then it assigns to that. The global environment is always at the base of the closure environment chain. If no such variable is found, the superassignment operator creates one in the global environment.
Thus, if you change <- to <<- in the assignment that takes place inside the function, you will be able to modify the global other.list variable.
See https://stat.ethz.ch/R-manual/R-devel/library/base/html/assignOps.html.
Here, I tried to make a little demo to demonstrate these concepts. In all my assignments, I'm assigning the actual environment that contains the variable being assigned to:
oldGlobal <- environment(); ## environment() is same as globalenv() in global scope
(function() {
newLocal1 <- environment(); ## creates a new local variable in this function evaluation's evaluation environment
print(newLocal1); ## <environment: 0x6014cbca8> (different for every evaluation)
oldGlobal <<- parent.env(environment()); ## target search hits oldGlobal in closure environment; RHS is same as globalenv()
newGlobal1 <<- globalenv(); ## target search fails; creates a new variable in the global environment
(function() {
newLocal2 <- environment(); ## creates a new local variable in this function evaluation's evaluation environment
print(newLocal2); ## <environment: 0x6014d2160> (different for every evaluation)
newLocal1 <<- parent.env(environment()); ## target search hits the existing newLocal1 in closure environment
print(newLocal1); ## same value that was already in newLocal1
oldGlobal <<- parent.env(parent.env(environment())); ## target search hits oldGlobal two closure environments up in the chain; RHS is same as globalenv()
newGlobal2 <<- globalenv(); ## target search fails; creates a new variable in the global environment
})();
})();
oldGlobal; ## <environment: R_GlobalEnv>
newGlobal1; ## <environment: R_GlobalEnv>
newGlobal2; ## <environment: R_GlobalEnv>

I haven't run your code, but two observations:
I usually avoid putting browser() as the last line inside a function because that gets treated as the return value
other.list does not get modified by your lapply. You need to understand the basics of environments and that any bindings you make inside lapply do not hold outside of it. It's a design feature and the whole point is that lapply can't have side effects - you should only use its return value. You can either use the <<- operator instead of <- though I don't recommend that, or you can use the assign function instead. Or you can do it properly the way lapply is meant to be used:
others.list <- lapply(seq_along(some.list), function(i, other.list) {
some.list[[i]]
})
Note that it's generally recommended to not make assignments inside lapply that change variables outside of it. lapply is meant to perform a function on every element and return a list, and that list should be all that lapply is used for

Related

Where are function constants stored if a function is created inside another function?

I am using a parent function to generate a child function by returning the function in the parent function call. The purpose of the parent function is to set a constant (y) in the child function. Below is a MWE. When I try to debug the child function I cannot figure out in which environment the variable is stored in.
power=function(y){
return(function(x){return(x^y)})
}
square=power(2)
debug(square)
square(3)
debugging in: square(3)
debug at #2: {
return(x^y)
}
Browse[2]> x
[1] 3
Browse[2]> y
[1] 2
Browse[2]> ls()
[1] "x"
Browse[2]> find('y')
character(0)
If you inspect the type of an R function, you’ll observe the following:
> typeof(square)
[1] "closure"
And that is, in fact, exactly the answer to your question: a closure is a function that carries an environment around.
R also tells you which environment this is (albeit not in a terribly useful way):
> square
function(x){return(x^y)}
<environment: 0x7ffd9218e578>
(The exact number will differ with each run — it’s just a memory address.)
Now, which environment does this correspond to? It corresponds to a local environment that was created when we executed power(2) (a “stack frame”). As the other answer says, it’s now the parent environment of the square function (in fact, in R every function, except for certain builtins, is associated with a parent environment):
> ls(environment(square))
[1] "y"
> environment(square)$y
[1] 2
You can read more about environments in the chapter in Hadley’s Advanced R book.
Incidentally, closures are a core feature of functional programming languages. Another core feature of functional languages is that every expression is a value — and, by implication, a function’s (return) value is the value of its last expression. This means that using the return function in R is both unnecessary and misleading!1 You should therefore leave it out: this results in shorter, more readable code:
power = function (y) {
function (x) x ^ y
}
There’s another R specific subtlety here: since arguments are evaluated lazily, your function definition is error-prone:
> two = 2
> square = power(two)
> two = 10
> square(5)
[1] 9765625
Oops! Subsequent modifications of the variable two are reflected inside square (but only the first time! Further redefinitions won’t change anything). To guard against this, use the force function:
power = function (y) {
force(y)
function (x) x ^ y
}
force simply forces the evaluation of an argument name, nothing more.
1 Misleading, because return is a function in R and carries a slightly different meaning compared to procedural languages: it aborts the current function exectuion.
The variable y is stored in the parent environment of the function. The environment() function returns the current environment, and we use parent.env() to get the parent environment of a particular environment.
ls(envir=parent.env(environment())) #when using the browser
The find() function doesn't seem helpful in this case because it seems to only search objects that have been attached to the global search path (search()). It doesn't try to resolve variable names in the current scope.

Auto assignment

Recently I have been working with a set of R scripts that I inherited from a colleague. It is for me a trusted source but more than once I found in his code auto-assignments like
x <<- x
Is there any scope where such an operation could make sense?
This is a mechanism for copying a value defined within a function into the global environment (or at least, somewhere within the stack of parent of environments): from ?"<<-"
The operators ‘<<-’ and ‘->>’ are normally only used in functions,
and cause a search to be made through parent environments for an
existing definition of the variable being assigned. If such a
variable is found (and its binding is not locked) then its value
is redefined, otherwise assignment takes place in the global
environment.
I don't think it's particularly good practice (R is a mostly-functional language, and it's generally better to avoid function side effects), but it does do something. (#Roland points out in comments and #BrianO'Donnell in his answer [quoting Thomas Lumley] that using <<- is good practice if you're using it to modify a function closure, as in demo(scoping). In my experience it is more often misused to construct global variables than to work cleanly with function closures.)
Consider this example, starting in an empty/clean environment:
f <- function() {
x <- 1 ## assignment
x <<- x ## global assignment
}
Before we call f():
x
## Error: object 'x' not found
Now call f() and try again:
f()
x
## [1] 1
<<-
is a global assignment operator and I would imagine there should hardly ever be a reason to use it because it effectively causes side effects.
The scope to use it would be in any case when one wants to define a global variable or a variable one level up from current environment.
Alan gives a good answer: Use the superassignment operator <<- to write upstairs.
Hadley also gives a good answer: How do you use "<<-" (scoping assignment) in R?.
For details on the 'superassignment' operator see Scope.
Here is some critical information on the operator from the section on Assignment Operators in the R manual:
"The operators <<- and ->> are normally only used in functions, and cause a search to be made through parent environments for an existing definition of the variable being assigned. If such a variable is found (and its binding is not locked) then its value is redefined, otherwise assignment takes place in the global environment."
Thomas Lumley sums it up nicely: "The good use of superassignment is in conjuction with lexical scope, where an environment stores state for a function or set of functions that modify the state by using superassignment."
For example:
x <- NA
test <- function(x) {
x <<- x
}
> test(5)
> x
#[1] 5
That's a simple use here, <<- will do a parent environment search (case of nested functions declarations) and if not found assign in the global environment.
Usually this is a really bad ideaTM as you have no real control on where the variable will be assigned and you have chances it will overwrite a variable used for another purpose somewhere.

Why does rm inside a function not delete objects?

rel.mem <- function(nm) {
rm(nm)
}
I defined the above function rel.mem -- takes a single argument and passes it to rm
> ls()
[1] "rel.mem"
> x<-1:10
> ls()
[1] "rel.mem" "x"
> rel.mem(x)
> ls()
[1] "rel.mem" "x"
Now you can see what I call rel.mem x is not deleted -- I know this is due to the incorrect environment on which rm is being attempted.
What is a good fix for this?
Criteria for a good fix:
The caller should not have to pass the environment
The callee (rel.mem) should be able to determine the environment by using an R language facility (call stack inspection, aspects, etc.)
The interface of the function rel.mem should be kept simple -- idiot proof: call rel.mem -- then rel.mem takes it from there -- no need to pass environments.
NOTES:
As many commenters have pointed out that one easy fix is to pass the environment.
What I meant by a good fix [and I should have clarified it] is that the callee function (in this case rel.mem) is able to calculate/find out the environment when the caller was referring to and then remove the object from the right environment.
The type of reasoning in "2" can be done in other languages by inspecting the call stack -- for example in Java I would throw a dummy exception -- catch it and then parse the call stack. In other languages still I could use Aspect Oriented techniques. The question is can something like that be done in R?
As one commenter has suggested that there may be multiple objects with the same name and thus the "right" environment is meaningless -- as I've stated above that in other languages it is possible (sometimes with some creative trickery) to interpret the call-stack -- this may not be possible in R
As one commenter has suggested that rm(list=nm, envir = parent.frame()) will remove this from the parent environment. This is correct -- however I'm looking for something that will work for an arbitrary call depth.
The quick answer is that you're in a different environment - essentially picture the variables in a box: you have a box for the function and one for the Global Environment. You just need to tell rm where to find that box.
So
rel_mem <- function(nm) {
# State the environment
rm(list=nm, envir = .GlobalEnv )
}
x = 10
rel_mem("x")
Alternatively, you can use the pos argument, e.g.
rel_mem <- function(nm) {
rm(list=nm, pos=1 )
}
If you type search() you will see a vector of environments, the global is number 1.
Another two options are
envir = parent.frame() if you want to go one level up the call stack
Use inherits = TRUE to go up the call stack until you find something
In the above code, notice that I'm passing the object as a character - I'm passing the "x" not x. We can be clever and avoid this using the substitute function
rel_mem <- function(nm) {
rm(list = as.character(substitute(nm)), envir = .GlobalEnv )
}
To finish I'll just add that deleting things in the .GlobalEnv from a function is generally a bad idea.
Further resources:
Environments:http://adv-r.had.co.nz/Environments.html
Substitute function: http://adv-r.had.co.nz/Computing-on-the-language.html#capturing-expressions
If you are using another function to find the global objects within your function such as ls(), you must state the environment in it explicitly too:
rel_mem <- function(nm) {
# State the environment in both functions
rm(list = ls(envir = .GlobalEnv) %>% .[startsWith(., "plot_")], envir = .GlobalEnv)
}

Assign to an environment by reference id (i.e. without passing env. to child functions)

Programmers often uses multiple small functions inside of larger functions. Along the way we may want to collect things in an environment for later reference. We could create an environment with new.env(hash=FALSE) and pass that along to the smaller functions and assign with assign. Well and dandy. I was wondering if we could use the reference id of the environment and not pass it along to the child functions but still assign to the environment by reference the environment id.
So here I make
myenv <- new.env(hash=FALSE)
## <environment: 0x00000000588cc918>
And as typical could assign like this if I passed along to the child functions the environment.
assign("elem1", 35, myenv)
myenv[["elem1"]]
# 35
What I want is to make the environment in the parent function and pass the reference id along instead so I want to do something like:
assign("elem2", 123, "0x00000000588cc918")
But predictably results in:
## Error in as.environment(pos) :
## no item called "0x00000000588cc918" on the search list
Is it possible to pass along just the environment id and use that instead? This seems cleaner than passing the environment from function to function and returning as a list and then operating on the environment in that list...and maybe more memory efficient too.
I would want to also access this environment by reference as well.
Environments are not like lists. Passing an environment to a function does not copy its contents even if the contents of the environment are modified within the function so you don't have to worry about inefficiency. Also, when an environment is passed to a function which modifies its contents the contents are preserved even after the function completes so unlike the situation with lists there is no need to pass the environment back.
For example, the code below passes environment e to function f and f modifies the contents of it but does not pass it back. After f completes the caller sees the change.
f <- function(x, env) {
env$X <- x
TRUE
}
e <- new.env()
f(1, e)
## [1] TRUE
e$X
## [1] 1
More about enviorments in Hadely's book: http://adv-r.had.co.nz/Environments.html

Query of List of Functions

I have the following example code from a course in coursera
makeVector <- function(x = numeric()) {
m <- NULL
set <- function(y) {
x <<- y
m <<- NULL
}
get <- function() x
setmean <- function(mean) m <<- mean
getmean <- function() m
list(set = set, get = get,
setmean = setmean,
getmean = getmean)
}
However , i dont understand the significant of the last list function. Can some one explain it ? Thank you.
The final call to list() builds a new list object and returns it from makeVector() (because it's the last statement in the function).
The new list object is populated with four components: set, get, setmean, and getmean. The value of each of these four components is populated with a function that was defined dynamically during the execution of the makeVector() function.
All four functions are a little bit special in that they all will end up sharing a pointer to the execution environment that was in effect at the time the makeVector() function was executing for that particular invocation of itself. Because they variously read and write the variables x and m that were local to that invocation, they all end up sharing those variables as pseudo-private variables. This is sort of an implementation of the functional pattern of object-oriented programming, where temporary local variables are closured by a new set of dynamically defined functions within a temporary scope. This pattern is seen in various languages, including R, Perl, and JavaScript.
Also note that the writability of the shared variables only works because the superassignment operator (<<-) was used to assign to them from the scope of the dynamic mutator functions (set() and setmean()). If the normal assignment operator (<-) had been used, then a new local variable would be temporarily created in the scope of the dynamic mutator function itself at the time it would be called, which would not affect the closure variables.

Resources