remove a temporary environment variable and release memory in R - r

I am working on a job in which a temporary Hash table is repeatedly used through a loop. The Hash table is represented by an environment variable in R. The problem is that as the loop proceeds the memory cost keeps rising no matter what method I used to delete the table (I tried rm() and gc() however neither was able to free the memory.) As a consequence I cannot accomplish an extraordinary long loop, say 10M cycles. It looks like a memory leak problem but I fail to find a solution elsewhere. I would like to ask what is the correct way to completely removing an environment variable and simultaneously releasing all memory it previously occupied. Thanks in advance for helping check the problem for me.
Here is a very simple example. I am using Windows 8 and R version 3.1.0.
> fun = function(){
H = new.env()
for(i in rnorm(100000)){
H[[as.character(i)]] = rnorm(100)
}
rm(list=names(H), envir=H, inherits=FALSE)
rm(H)
gc()
}
>
> for(k in 1:5){
print(k)
fun()
gc()
print(memory.size(F))
}
[1] 1
[1] 40.43
[1] 2
[1] 65.34
[1] 3
[1] 82.56
[1] 4
[1] 100.22
[1] 5
[1] 120.36

Environments in R are not a good choice for situations where the keys can vary a lot during the computation. The reason is that environments require keys to be symbols, and symbols are not garbage collected. So each run of your function is adding to the internal symbol table. Arranging for symbols to be garbage collected would be one possibility, though care would be needed since a lot of internals code assumes they are not. Another option would be to create better hash table support so environments don't have to try to serve this purpose for which they were not originally designed.

Related

Finalizer in R modifying objects - thread safe?

In an R package, I manage some external objects. I want to release them after the corresponding R object (an environment) has been garbage collected. There is a reason why I cannot release the external resources right away. Therefore I need to somehow save the information that the object has been garbage collected and use this information at a later time. The approch is implemented basically like the following:
Environment objects are created with some ID. I have some environment managingEnv, where I collect the information about the finalized objects in a vector of IDs of objects. When the objects are finalized, the finalizer writes their IDs into the managingEnv.
managingEnv <- new.env(parent = emptyenv())
managingEnv$garbageCollectedIds <- c()
createExternalObject <- function(id) {
ret <- new.env(parent = emptyenv())
ret$id <- id
reg.finalizer(ret, function(e) {
managingEnv$garbageCollectedIds <- c(managingEnv$garbageCollectedIds, e$id)
})
ret
}
If I then create some objects and run the garbage collection, the approach seems to work and I have at the end all IDs collected in the managingEnv and could later perform my action to release all these objects.
> createExternalObject(1)
<environment: 0x000002307ff36920>
> createExternalObject(2)
<environment: 0x000002307e94d390>
> createExternalObject(3)
<environment: 0x000002307e94ac90>
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 568005 30.4 1299461 69.4 1299461 69.4
Vcells 1519057 11.6 8388608 64.0 2044145 15.6
> managingEnv$garbageCollectedIds
[1] 2 1 3
Although the approach seems to work, I have encountered some instability in my package and R crashes sometimes randomly. After some research into the problem, I came to the conclusion that the approach that I use might be not safe.
The documentation of reg.finalizer says:
Note:
R's interpreter is not re-entrant and the finalizer could be run
in the middle of a computation. So there are many functions which
it is potentially unsafe to call from ‘f’: one example which
caused trouble is ‘options’. Finalizers are scheduled at garbage
collection but only run at a relatively safe time thereafter.
Is my approach really safe? Can I change this code to make if safe(r) or find another solution to the problem described above?
EDIT: I found the reasons for the instability in other parts of the code. So it seems like the approach described above is safe. I would still be interested in more details about how to know which operations are "potentially unsafe" in finalizers, to be able to reason that the approach described here is safe.

Hide data in global environment

Is there a function that allows for data to be hidden in the global environment but it can still be accessed?
For example, I have a very long script with up to 100+ lines and my global environment is looking messy, there is too much and it strains my brain finding what is necessary.
I have searched up similar questions and they involve creating a package, quite frankly I have no time to learn, at this moment.
If you name all the objects you don't want to appear in the global environment starting with a dot (.), for example:
.foo <- 'bar' the object will be accesible but will be hidden in the global environment or in any ls() call:
> .foo <- 'bar'
> .foo
[1] "bar"
> ls()
character(0)
>
Edit: Adding a working example
Possible solutions would be:
removing objects once they're not needed any more
put related variables into a list (hello lapply, sapply)
move into a separate custom environment (new.env())
simplify the script to not use as many objects
run script in batch mode and miss out on what the environment looks like altogether
Sounds like an XY problem, 100 lines is quite small, it's likely that you are using too many temporary variables or are numbering objects that should be in a list.
You also don't mention why you don't like your environment to be "messy", my guess is that maybe you don't like the output of ls() ?
Then maybe you'll be happy to learn about the pattern argument of ls() which will allow you to filter the result, mostly useful with prefixes or suffixes as in the following examples :
something <- 1
some_var <- 2
another_var <- 3
ls(pattern ="^some")
#> [1] "some_var" "something"
ls(pattern="var$")
#> [1] "another_var" "some_var"
Created on 2019-11-17 by the reprex package (v0.3.0)

mclapply returns NULL randomly

When I am using mclapply, from time to time (really randomly) it gives incorrect results. The problem is quite thoroughly described in other posts across the Internet, e.g. (http://r.789695.n4.nabble.com/Bug-in-mclapply-td4652743.html). However, no solution is provided. Does anyone know how to fix this problem? Thank you!
The problem reported by Winston Chang that you cite appears to have been fixed in R 2.15.3. There was a bug in mccollect that occurred when assigning the worker results to the result list:
if (is.raw(r)) res[[which(pid == pids)]] <- unserialize(r)
This fails if unserialize(r) returns a NULL, since assigning a NULL to a list in this way deletes the corresponding element of the list. This was changed in R 2.15.3 to:
if (is.raw(r)) # unserialize(r) might be null
res[which(pid == pids)] <- list(unserialize(r))
which is a safe way to assign an unknown value to a list.
So if you're using R <= 2.15.2, the solution is to upgrade to R >= 2.15.3. If you have a problem using R >= 2.15.3, then presumably it's a different problem then the one reported by Winston Chang.
I also read over the issues discussed in the R-help thread started by Elizabeth Purdom. Without a specific test case, my guess is that the problem is not due to a bug in mclapply because I can reproduce the same symptoms with the following function:
work <- function(i, poison) {
if (i == poison) quit(save='no')
i
}
If a worker started by mclapply dies while executing a task for any reason (receiving a signal, seg faulting, exiting), mclapply will return a NULL for all of the tasks that were assigned to that worker:
> library(parallel)
> mclapply(1:4, work, 3, mc.cores=2)
[[1]]
NULL
[[2]]
[1] 2
[[3]]
NULL
[[4]]
[1] 4
In this case, NULL's were returned for tasks 1 and 3 due to prescheduling, even though only task 3 actually failed.
If a worker dies when using a function such as parLapply or clusterApply, an error is reported:
> cl <- makePSOCKcluster(3)
> parLapply(cl, 1:4, work, 3)
Error in unserialize(node$con) : error reading from connection
I've seen many such reports, and I think they tend to happen in large programs that use lots of packages that are hard to turn into reproducible test cases.
Of course, in this example, you'll also get an error when using lapply, although the error won't be hidden as it is with mclapply. If the problem doesn't seem to happen when using lapply, it may be because the problem rarely occurs, so it only happens in very large runs that are executed in parallel using mclapply. But it is also possible that the error occurs, not because the tasks are executed in parallel, but because they are executed by forked processes. For example, various graphics operations will fail when executed in a forked process.
I'm adding this answer so others hitting this question won't have to wade through the long thread of comments (I am the bounty granter but not the OP).
mclapply initially populates the list it creates with NULLS. As the worker processes return values, these values overwrite the NULLS. If a process dies without ever returning a value, mclapply will return a NULL.
When memory becomes low, the Linux out of memory killer (oom killer)
https://lwn.net/Articles/317814/
will start silently killing processes. It does not print anything to the console to let you know what it's doing, although the oom killer activities show up in the system log. In this situation the output of mclapply will appear to have been randomly contaminated with NULLS.

Examining contents of .rdata file by attaching into a new environment - possible?

I am interested in listing objects in an RDATA file and loading only selected objects, rather than the whole set (in case some may be big or may already exist in the environment). I'm not quite clear on how to do this when there are conflicts in names, as attach() doesn't work as nicely.
1: For examining the contents of an R data file without loading it: This question is similar, but different from, the one asked at listing contents of an R data file without loading
In that case, the solution offered was:
attach(filename)
ls(pos = 2)
detach()
If there are naming conflicts between objects in the file and those in the global environment, this warning appears:
The following object(s) are masked _by_ '.GlobalEnv':
I tried creating a new environment, but I cannot seem to attach into that.
For instance, this produces the same error:
lsfile <- function(filename){
tmpEnv <- new.env()
evalq(attach(filename), envir = tmpEnv)
tmpls <- ls(pos = 2)
detach()
return(tmpls)
}
lsfile(filename)
Maybe I've made a mess of things with evalq (or eval). Is there some other way to avoid the naming conflict?
2: If I want to access an object - if there are no naming conflicts, I can just work with the one from the .rdat file, or copy it to a new one. If there are conflicts, how does one access the object in the file's namespace?
For instance, if my file is "sample.rdat", and the object is surveyData, and a surveyData object already exists in the global environment, then how can I access the one from the file:sample.rdat namespace?
I currently solve this problem by loading everything into a temporary environment, and then copy out what's needed, but this is inefficient.
Since this question has just been referenced let's clarify two things:
attach() simply calls load() so there is really no point in using it instead of load
if you want selective access to prevent masking it's much easier to simply load the file into a new environment:
e = local({load("foo.RData"); environment()})
You can then use ls(e) and access contents like e$x. You can still use attach on the environment if you really want it on the search path.
FWIW .RData files have no index (the objects are stored in one big pairlist), so you can't list the contained objects without loading. If you want convenient access, convert it to the lazy-load format instead which simply adds an index so each object can be loaded separately (see Get specific object from Rdata file)
I just use an env= argument to load():
> x <- 1; y <- 2; z <- "foo"
> save(x, y, z, file="/tmp/foo.RData")
> ne <- new.env()
> load(file="/tmp/foo.RData", env=ne)
> ls(env=ne)
[1] "x" "y" "z"
> ne$z
[1] "foo"
>
The cost of this approach is that you do read the whole RData file---but on the other hand that seems to be unavoidable anyway as no other method seems to offer a list of the 'content' of such a file.
You can suppress the warning by setting warn.conflicts=FALSE on the call to attach. If an object is masked by one in the global environment, you can use get to retreive it from your attached data.
x <- 1:10
save(x, file="x.rData")
#attach("x.rData", pos=2, warn.conflicts=FALSE)
attach("x.rData", pos=2)
(x <- 1)
# [1] 1
(x <- get("x", pos=2))
# [1] 1 2 3 4 5 6 7 8 9 10
Thanks to #Dirk and #Joshua.
I had an epiphany. The command/package foreach with SMP or MC seems to produce environments that only inherit, but do not seem to conflict with, the global environment.
lsfile <- function(list_files){
aggregate_ls = foreach(ix = 1:length(list_files)) %dopar% {
attach(list_files[ix])
tmpls <- ls(pos = 2)
return(tmpls)
}
return(aggregate_ls)
}
lsfile("f1.rdat")
lsfile(dir(pattern = "*rdat"))
This is useful to me because I can now parallelize this. This is a bare-bones version, and I will modify it to give more detailed information, but so far it seems to be the only way to avoid conflicts, even without ignore.
So, question #1 can be resolved by either ignoring the warnings (as #Joshua suggested) or by using whatever magic foreach summons.
For part 2, loading an object, I think #Joshua has the right idea - "get" will do.
The foreach magic can also work, by using the .noexport option. However, this has risks: whatever isn't specifically excluded will be inherited/exported from the global environment (I could do ls(), but there's always the possibility of attached datasets). For safety, this means that get() must still be used to avoid the risk of a naming conflict. Loading into a subenvironment avoids the naming conflict, but doesn't avoid the loading of unnecessary objects.
#Joshua's answer is far simpler than my foreach detour.

hiding personal functions in R

I have a few convenience functions in my .Rprofile, such as this handy function for returning the size of objects in memory. Sometimes I like to clean out my workspace without restarting and I do this with rm(list=ls()) which deletes all my user created objects AND my custom functions. I'd really like to not blow up my custom functions.
One way around this seems to be creating a package with my custom functions so that my functions end up in their own namespace. That's not particularly hard, but is there an easier way to ensure custom functions don't get killed by rm()?
Combine attach and sys.source to source into an environment and attach that environment. Here I have two functions in file my_fun.R:
foo <- function(x) {
mean(x)
}
bar <- function(x) {
sd(x)
}
Before I load these functions, they are obviously not found:
> foo(1:10)
Error: could not find function "foo"
> bar(1:10)
Error: could not find function "bar"
Create an environment and source the file into it:
> myEnv <- new.env()
> sys.source("my_fun.R", envir = myEnv)
Still not visible as we haven't attached anything
> foo(1:10)
Error: could not find function "foo"
> bar(1:10)
Error: could not find function "bar"
and when we do so, they are visible, and because we have attached a copy of the environment to the search path the functions survive being rm()-ed:
> attach(myEnv)
> foo(1:10)
[1] 5.5
> bar(1:10)
[1] 3.027650
> rm(list = ls())
> foo(1:10)
[1] 5.5
I still think you would be better off with your own personal package, but the above might suffice in the meantime. Just remember the copy on the search path is just that, a copy. If the functions are fairly stable and you're not editing them then the above might be useful but it is probably more hassle than it is worth if you are developing the functions and modifying them.
A second option is to just name them all .foo rather than foo as ls() will not return objects named like that unless argument all = TRUE is set:
> .foo <- function(x) mean(x)
> ls()
character(0)
> ls(all = TRUE)
[1] ".foo" ".Random.seed"
Here are two ways:
1) Have each of your function names start with a dot., e.g. .f instead of f. ls will not list such functions unless you use ls(all.names = TRUE) therefore they won't be passed to your rm command.
or,
2) Put this in your .Rprofile
attach(list(
f = function(x) x,
g = function(x) x*x
), name = "MyFunctions")
The functions will appear as a component named "MyFunctions" on your search list rather than in your workspace and they will be accessible almost the same as if they were in your workspace. search() will display your search list and ls("MyFunctions") will list the names of the functions you attached. Since they are not in your workspace the rm command you normally use won't remove them. If you do wish to remove them use detach("MyFunctions") .
Gavin's answer is wonderful, and I just upvoted it. Merely for completeness, let me toss in another one:
R> q("no")
followed by
M-x R
to create a new session---which re-reads the .Rprofile. Easy, fast, and cheap.
Other than that, private packages are the way in my book.
Another alternative: keep the functions in a separate file which is sourced within .RProfile. You can re-source the contents directly from within R at your leisure.
I find that often my R environment gets cluttered with various objects when I'm creating or debugging a function. I wanted a way to efficiently keep the environment free of these objects while retaining personal functions.
The simple function below was my solution. It does 2 things:
1) deletes all non-function objects that do not begin with a capital letter and then
2) saves the environment as an RData file
(requires the R.oo package)
cleanup=function(filename="C:/mymainR.RData"){
library(R.oo)
# create a dataframe listing all personal objects
everything=ll(envir=1)
#get the objects that are not functions
nonfunction=as.vector(everything[everything$data.class!="function",1])
#nonfunction objects that do not begin with a capital letter should be deleted
trash=nonfunction[grep('[[:lower:]]{1}',nonfunction)]
remove(list=trash,pos=1)
#save the R environment
save.image(filename)
print(paste("New, CLEAN R environment saved in",filename))
}
In order to use this function 3 rules must always be kept:
1) Keep all data external to R.
2) Use names that begin with a capital letter for non-function objects that I want to keep permanently available.
3) Obsolete functions must be removed manually with rm.
Obviously this isn't a general solution for everyone...and potentially disastrous if you don't live by rules #1 and #2. But it does have numerous advantages: a) fear of my data getting nuked by cleanup() keeps me disciplined about using R exclusively as a processor and not a database, b) my main R environment is so small I can backup as an email attachment, c) new functions are automatically saved (I don't have to manually manage a list of personal functions) and d) all modifications to preexisting functions are retained. Of course the best advantage is the most obvious one...I don't have to spend time doing ls() and reviewing objects to decide whether they should be rm'd.
Even if you don't care for the specifics of my system, the "ll" function in R.oo is very useful for this kind of thing. It can be used to implement just about any set of clean up rules that fit your personal programming style.
Patrick Mohr
A nth, quick and dirty option, would be to use lsf.str() when using rm(), to get all the functions in the current workspace. ...and let you name the functions as you wish.
pattern <- paste0('*',lsf.str(), '$', collapse = "|")
rm(list = ls()[-grep(pattern, ls())])
I agree, it may not be the best practice, but it gets the job done! (and I have to selectively clean after myself anyway...)
Similar to Gavin's answer, the following loads a file of functions but without leaving an extra environment object around:
if('my_namespace' %in% search()) detach('my_namespace'); source('my_functions.R', attach(NULL, name='my_namespace'))
This removes the old version of the namespace if it was attached (useful for development), then attaches an empty new environment called my_namespace and sources my_functions.R into it. If you don't remove the old version you will build up multiple attached environments of the same name.
Should you wish to see which functions have been loaded, look at the output for
ls('my_namespace')
To unload, use
detach('my_namespace')
These attached functions, like a package, will not be deleted by rm(list=ls()).

Resources