Finalizer in R modifying objects - thread safe? - r

In an R package, I manage some external objects. I want to release them after the corresponding R object (an environment) has been garbage collected. There is a reason why I cannot release the external resources right away. Therefore I need to somehow save the information that the object has been garbage collected and use this information at a later time. The approch is implemented basically like the following:
Environment objects are created with some ID. I have some environment managingEnv, where I collect the information about the finalized objects in a vector of IDs of objects. When the objects are finalized, the finalizer writes their IDs into the managingEnv.
managingEnv <- new.env(parent = emptyenv())
managingEnv$garbageCollectedIds <- c()
createExternalObject <- function(id) {
ret <- new.env(parent = emptyenv())
ret$id <- id
reg.finalizer(ret, function(e) {
managingEnv$garbageCollectedIds <- c(managingEnv$garbageCollectedIds, e$id)
})
ret
}
If I then create some objects and run the garbage collection, the approach seems to work and I have at the end all IDs collected in the managingEnv and could later perform my action to release all these objects.
> createExternalObject(1)
<environment: 0x000002307ff36920>
> createExternalObject(2)
<environment: 0x000002307e94d390>
> createExternalObject(3)
<environment: 0x000002307e94ac90>
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 568005 30.4 1299461 69.4 1299461 69.4
Vcells 1519057 11.6 8388608 64.0 2044145 15.6
> managingEnv$garbageCollectedIds
[1] 2 1 3
Although the approach seems to work, I have encountered some instability in my package and R crashes sometimes randomly. After some research into the problem, I came to the conclusion that the approach that I use might be not safe.
The documentation of reg.finalizer says:
Note:
R's interpreter is not re-entrant and the finalizer could be run
in the middle of a computation. So there are many functions which
it is potentially unsafe to call from ‘f’: one example which
caused trouble is ‘options’. Finalizers are scheduled at garbage
collection but only run at a relatively safe time thereafter.
Is my approach really safe? Can I change this code to make if safe(r) or find another solution to the problem described above?
EDIT: I found the reasons for the instability in other parts of the code. So it seems like the approach described above is safe. I would still be interested in more details about how to know which operations are "potentially unsafe" in finalizers, to be able to reason that the approach described here is safe.

Related

Is rJava object is exportable in future(Package for Asynchronous computing in R)

I'm trying to speed up my R code using future package by using mutlicore plan on Linux. In future definition I'm creating a java object and trying to pass it to .jcall(), But I'm getting a null value for java object in future. Could anyone please help me out to resolve this. Below is sample code-
library("future")
plan(multicore)
library(rJava)
.jinit()
# preprocess is a user defined function
my_value <- preprocess(a = value){
# some preprocessing task here
# time consuming statistical analysis here
return(lreturn) # return a list of 3 components
}
obj=.jnew("java.custom.class")
f <- future({
.jcall(obj, "V", "CustomJavaMethod", my_value)
})
Basically I'm dealing with large streaming data. In above code I'm sending the string of streaming data to user defined function for statistical analysis and returning the list of 3 components. Then want to send this list to custom java class [ java.custom.class ]for further processing using custom Java method [ CustomJavaMethod ].
Without using future my code is running fine. But I'm getting 12 streaming records in one minute and then my code is getting slow, observed delay in processing.
Currently I'm using Unix with 16 cores. After using future package my process is done fast. I have traced back my code, in .jcall something happens wrong.
Hope this clarifies my pain.
(Author of the future package here:)
Unfortunately, there are certain types of objects in R that cannot be sent to another R process for further processing. To clarify, this is a limitation to those type of objects - not to the parallel framework use (here the future framework). This simplest example of such an objects may be a file connection, e.g. con <- file("my-local-file.txt", open = "wb"). I've documented some examples in Section 'Non-exportable objects' of the 'Common Issues with Solutions' vignette (https://cran.r-project.org/web/packages/future/vignettes/future-4-issues.html).
As mentioned in the vignette, you can set an option (*) such that the future framework looks for these type of objects and gives an informative error before attempting to launch the future ("early stopping"). Here is your example with this check activated:
library("future")
plan(multisession)
## Assert that global objects can be sent back and forth between
## the main R process and background R processes ("workers")
options(future.globals.onReference = "error")
library("rJava")
.jinit()
end <- .jnew("java/lang/String", " World!")
f <- future({
start <- .jnew("java/lang/String", "Hello")
.jcall(start, "Ljava/lang/String;", "concat", end)
})
# Error in FALSE :
# Detected a non-exportable reference ('externalptr') in one of the
# globals ('end' of class 'jobjRef') used in the future expression
So, yes, your example actually works when using plan(multicore). The reason for that is that 'multicore' uses forked processes (available on Unix and macOS but not Windows). However, I would try my best to limit your software to parallelize only on "forkable" systems; if you can find an alternative approach I would aim for that. That way your code will also work on, say, a huge cloud cluster.
(*) The reason for these checks not being enabled by default is (a) it's still in beta testing, and (b) it comes with overhead because we basically need to scan for non-supported objects among all the globals. Whether these checks will be enabled by default in the future or not, will be discussed over at https://github.com/HenrikBengtsson/future.
The code in the question is calling unknown Method1 method, my_value is undefined, ... hard to know what you are really trying to achieve.
Take a look at the following example, maybe you can get inspiration from it:
library(future)
plan(multicore)
library(rJava)
.jinit()
end = .jnew("java/lang/String", " World!")
f <- future({
start = .jnew("java/lang/String", "Hello")
.jcall(start, "Ljava/lang/String;", "concat", end)
})
value(f)
[1] "Hello World!"

Parallelization using shared memory [bigmemory]

I'm experiencing some difficulties when trying to make it work a parallel scenario [doSNOW], with involves the use of shared memory [bigmemory]. The summary is that I get the following error "Error in { : task 1 failed - "cannot open the connection"" in some of the foreach workers. More specifically, checking the cluster output log, it is related to "'/temp/x_bigmatrix.desc': Permission denied" like if there were some problem with a concurrent access to the big.matrix descriptor file.
Please excuse me, but because the code is a bit complex I'm not including a reproducible example but going to try to explain what's the workflow with the main points.
I have a matrix X, which is converted into big.matrix through:
x_bigmatrix <- as.big.matrix(x_matrix,
type = "double",
separated = FALSE,
backingfile = "x_bigmatrix.bin",
descriptorfile = "x_bigmatrix.desc",
backingpath = "./temp/")
Then, I initialize the sock cluster with doSNOW [I'm on Windows 10 x64]:
cl <- parallel::makeCluster(N-1, outfile= "output.log")
registerDoSNOW(cl)
(showConnections() shows properly the registered connections)
Now I have to explain that I have the main loop (foreach) for each worker and then, there is an inner loop where each worker loops over the rows in X. The main idea is that, within the main corpus, each worker is fed with chunks of data sequentially through the inner loop and then, each worker may store some of these observations but instead of store the observations themselves; they store the row indexes for posterior retrieval. In order to complicate even more the things, each worker modifies an associated R6 class environment where the indices are stored. I say this because the access to the big.matrix descriptor file takes places in two different places: the main foreach loop and within each R6 environment. The foreach main corpus is the following:
workersOutput <- foreach(worker = workers, .verbose = TRUE) %dopar% {
# In the main foreach loop I don't include the .packages argument because
# I do source with all the needed libraries at the beginning of the loop
source(libs)
# I attach the big.matrix using attach.big.matrix and the descriptor file
x_bigmatrix <- attach.big.matrix("./temp/x_bigmatrix.desc")
# Now the inner for loop is going to loop over the rows in X.
# Each worker can store some of these indices for posterior retrieval
for (i in seq(1, nrow(X)) {
# Each R6 object associated with each worker is modified, storing indices...
# Within these environments, there is read-only access through a getter using the same
# procedure that above: x_bigmatrix <- attach.big.matrix("./temp/x_bigmatrix.desc")
}
}
stopCluster(cl)
The problem occurs in the inner loop when trying to access the big.matrix backed in file. Because if I change the behaviour in these environments to store explicitly the observations instead of the row indices (thus, there is no access to the descriptor file within these objects anymore), then it works without any problem. Also, if I run it without parallelization [registerDoSEQ()] but storing the row indices in the objects, there is also no errors. So the problem takes place if I mix parallelization and double accessing to the shared big.matrix within the different R6 environments. The weird thing is that some of the workers can run for longer time than others, and even in the end at least one finish its run... So that makes me think about the problem with a concurrent access to the big.matrix descriptor file.
Am I failing at some basics here?
If the problem is a concurrent access to the big.matrix descriptor file, you can just pass the descriptor object (with describe) rather than the descriptor file which contains the object.
Explanation:
When attaching from the descriptor file, it first creates the big.matrix.descriptor object and then attach the big.matrix from this object. So, if you use directly the object, it will be copied to all your clusters and you can attach the big.matrix from them.

Garbage Collection in R

I started to use gc() for garbage collection in R. I have 16 GB RAM and sometimes, up to 10 GB RAM gets freed when using this command.
Does it make sense to use gc() inside functions? Often, the functions I write/use need almost all RAM that is available. Or does R reliably clean up memory that was used only inside a function?
Example:
f <- function(x) {
# do something
y <- doStuff(x)
# do something else
z <- doMoreStuff(y)
# garbage collection
gc()
# return result
return(z)
}
Calling gc() is largely pointless, as R calls it automatically when more memory is needed. The only reason I can think of for calling gc() explicitly is if another program needs memory that R is hogging.

remove a temporary environment variable and release memory in R

I am working on a job in which a temporary Hash table is repeatedly used through a loop. The Hash table is represented by an environment variable in R. The problem is that as the loop proceeds the memory cost keeps rising no matter what method I used to delete the table (I tried rm() and gc() however neither was able to free the memory.) As a consequence I cannot accomplish an extraordinary long loop, say 10M cycles. It looks like a memory leak problem but I fail to find a solution elsewhere. I would like to ask what is the correct way to completely removing an environment variable and simultaneously releasing all memory it previously occupied. Thanks in advance for helping check the problem for me.
Here is a very simple example. I am using Windows 8 and R version 3.1.0.
> fun = function(){
H = new.env()
for(i in rnorm(100000)){
H[[as.character(i)]] = rnorm(100)
}
rm(list=names(H), envir=H, inherits=FALSE)
rm(H)
gc()
}
>
> for(k in 1:5){
print(k)
fun()
gc()
print(memory.size(F))
}
[1] 1
[1] 40.43
[1] 2
[1] 65.34
[1] 3
[1] 82.56
[1] 4
[1] 100.22
[1] 5
[1] 120.36
Environments in R are not a good choice for situations where the keys can vary a lot during the computation. The reason is that environments require keys to be symbols, and symbols are not garbage collected. So each run of your function is adding to the internal symbol table. Arranging for symbols to be garbage collected would be one possibility, though care would be needed since a lot of internals code assumes they are not. Another option would be to create better hash table support so environments don't have to try to serve this purpose for which they were not originally designed.

In R, do an operation temporarily using a setting such as working directory

I'm almost certain I've read somewhere how to do this. Instead of having to save the current option (say working directory) to a variable, change the w.d, do an operation, and then revert back to what it was, doing this inside a function akin to "with" relative to attach/detach. A solution just for working directory is what I need now, but there might be a more generic function that does that sort of things? Or ain't it?
So to illustrate... The way it is now:
curdir <- getwd()
setwd("../some/place")
# some operation
setwd(curdir)
The way it is in my wildest dreams:
with.dir("../some/place", # some operation)
I know I could write a function for this, I just have the impression there's something more readily available and generalizable to other parameters too.
Thanks
There is an idiom for this in some of R's base plotting functions
op <- par(no.readonly = TRUE)
# par(blah = stuff)
# plot(stuff)
par(op)
that is so unbelievably crude as to be fully portable to options() and setwd().
Fortunately it's also easy to implement a crude wrapper:
with_dir <- function(dir, expr) {
old_wd <- getwd()
setwd(dir)
result <- evalq(expr)
setwd(old_wd)
result
}
I'm no wizard with nonstandard evaluation so evalq could be unstable somehow. More on NSE in an old write-up by Lumley and also in Wickham's Advanced R, but it's dense stuff and I haven't wrapped my head around it all yet.
edit: as per Ben Bolker's comment, it's probably better to use on.exit for this:
with_dir <- function(dir, expr) {
old_wd <- getwd()
on.exit(setwd(old_wd))
setwd(dir)
evalq(expr)
}
From the R docs:
on.exit records the expression given as its argument as needing to be executed when the current function exits (either naturally or as the result of an error). This is useful for resetting graphical parameters or performing other cleanup actions.
What you're describing depends upon two things: detecting when you enter and leave a particular lexical scope, and defining a behavior to do on entrance and on exit. Python has these, called "Context Managers". This was a big deal when it was released, and many parts of Python's standard library now behave like context managers, and have to define the "enter" and "exit" behavior in explicitly, or by leveraging some clever inheritance scheme.
with.default
function (data, expr, ...)
eval(substitute(expr), data, enclos = parent.frame())
<bytecode: 0x07d02ccc>
<environment: namespace:base>
R's with function works sort of like a context manager, because it can pass scopes around easily. That said, this doesn't give you the "enter" and "exit" operations for free. Especially consider that the current working directory isn't an entry in the current scope, but a state of the R interpreter, which can only be queried or changed by function calls behind the .Internal shield.
You can easily define your own object types to have methods that are context manager-like for the with generic function, as well as writing and registering methods for other types you commonly use, but it is not part of the base R language.

Resources