R best practices: do I need to "unset" a RefClass? - r

BLUP: What are the risks of creating an environment whose parent is .GlobalEnv, and calling setRefClass within this environment?
I have a package (repository on Github) that loads the contents of an R file provided by HDFql. This wrapper contains a call to setRefClass. After trying a lot of different things (most of which failed) I settled on sourcing the wrapper into an environment that is a child of .GlobalEnv. The environment itself lives in an environment contained in the package. This nesting was required to get around binding errors, because the call to setRefClass fails if it is executed inside an environment whose ancestor is the package namespace. The global environment seemed to be the only environment suitable for the setRefClass evaluation.
However, I'm a bit worried about creating and using an environment in a package whose parent is .GlobalEnv, and making calls to setRefClass inside this environment. What are potential pitfalls when doing this? Are there best practices for removing or "unsetting" the RefClass when finished? Is there a better solution I am not thinking of?
I have included some sample code below, although it is not reproducible; if you want a reproducible example, you can clone the package repository and/or install it with devtools. The code in question lives in the function hql_load() in file connect.r.
# "hql" is an empty environment exported by the package
constants = new.env(parent = .GlobalEnv)
source(constants.file, local = constants)
assign("constants", constants, envir = hql)
where constants.file contains the code
hdfql_cursor_ <- setRefClass("hdfql_cursor_",
field = list(address = "numeric"),
method = list(
finalize = function(){
.Call("_hdfql_cursor_destroy", .self$address, PACKAGE = "HDFqlR")
}
)
)

Related

R generic dispatching to attached environment

I have a bunch of functions and I'm trying to keep my workspace clean by defining them in an environment and attaching the environment. Some of the functions are S3 generics, and they don't seem to play well with this approach.
A minimum example of what I'm experiencing requires 4 files:
testfun.R
ttt.xxx <- function(object) print("x")
ttt <- function(object) UseMethod("ttt")
ttt2 <- function() {
yyy <- structure(1, class="xxx")
ttt(yyy)
}
In testfun.R I define an S3 generic ttt and a method ttt.xxx, I also define a function ttt2 calling the generic.
testenv.R
test_env <- new.env(parent=globalenv())
source("testfun.R", local=test_env)
attach(test_env)
In testenv.R I source testfun.R to an environment, which I attach.
test1.R
source("testfun.R")
ttt2()
xxx <- structure(1, class="xxx")
ttt(xxx)
test1.R sources testfun.R to the global environment. Both ttt2 and a direct function call work.
test2.R
source("testenv.R")
ttt2()
xxx <- structure(1, class="xxx")
ttt(xxx)
test2.R uses the "attach" approach. ttt2 still works (and prints "x" to the console), but the direct function call fails:
Error in UseMethod("ttt") :
no applicable method for 'ttt' applied to an object of class "xxx"
however, calling ttt and ttt.xxx without arguments show that they are known, ls(pos=2) shows they are on the search path, and sloop::s3_dispatch(ttt(xxx)) tells me it should work.
This questions is related to Confusion about UseMethod search mechanism and the link therein https://blog.thatbuthow.com/how-r-searches-and-finds-stuff/, but I cannot get my head around what is going on: why is it not working and how can I get this to work.
I've tried both R Studio and R in the shell.
UPDATE:
Based on the answers below I changed my testenv.R to:
test_env <- new.env(parent=globalenv())
source("testfun.R", local=test_env)
attach(test_env)
if (is.null(.__S3MethodsTable__.))
.__S3MethodsTable__. <- new.env(parent = baseenv())
for (func in grep(".", ls(envir = test_env), fixed = TRUE, value = TRUE))
.__S3MethodsTable__.[[func]] <- test_env[[func]]
rm(test_env, func)
... and this works (I am only using "." as an S3 dispatching separator).
It’s a little-known fact that you must use .S3method() to define methods for S3 generics inside custom environments (outside of packages).1 The reason almost nobody knows this is because it is not necessary in the global environment; but it is necessary everywhere else since R version 3.6.
There’s virtually no documentation of this change, just a technical blog post by Kurt Hornik about some of the background. Note that the blog post says the change was made in R 3.5.0; however, the actual effect you are observing — that S3 methods are no longer searched in attached environments — only started happening with R 3.6.0; before that, it was somehow not active yet.
… except just using .S3method will not fix your code, since your calling environment is the global environment. I do not understand the precise reason why this doesn’t work, and I suspect it’s due to a subtle bug in R’s S3 method lookup. In fact, using getS3method('ttt', 'xxx') does work, even though that should have the same behaviour as actual S3 method lookup.
I have found that the only way to make this work is to add the following to testenv.R:
if (is.null(.__S3MethodsTable__.)) {
.__S3MethodsTable__. <- new.env(parent = baseenv())
}
.__S3MethodsTable__.$ttt.xxx <- ttt.xxx
… in other words: supply .GlobalEnv manually with an S3 methods lookup table. Unfortunately this relies on an undocumented S3 implementation detail that might theoretically change in the future.
Alternatively, it “just works” if you use ‘box’ modules instead of source. That is, you can replace the entirety of your testenv.R by the following:
box::use(./testfun[...])
This code treats testfun.R as a local module and loads it, attaching all exported names (via the attach declaration [...]).
1 (and inside packages you need to use the equivalent S3method namespace declaration, though if you’re using ‘roxygen2’ then that’s taken care of for you)
First of all, my advice would be: don't try to reinvent R packages. They solve all the problems you say you are trying to solve, and others as well.
Secondly, I'll try to explain what went wrong in test2.R. It calls ttt on an xxx object, and ttt.xxx is on the search list, but is not found.
The problem is how the search for ttt.xxx happens. The search doesn't look for ttt.xxx in the search list, it looks for it in the environment from which ttt was called, then in an object called .__S3MethodsTable__.. I think there are two reasons for this:
First, it's a lot faster. It only needs to look in one or two places, and the table can be updated whenever a package is attached or detached, a relatively rare operation.
Second, it's more reliable. Each package has its own methods table, because two packages can use the same name for generics that have nothing to do with each other, or can use the same class names that are unrelated. So package code needs to be able to count on finding its own definitions first.
Since your call to ttt() happens at the top level, that's where R looks first for ttt.xxx(), but it's not there. Then it looks in the global .__S3MethodsTable__. (which is actually in the base environment), and it's not there either. So it fails.
There is a workaround that will make your code work. If you run
.__S3MethodsTable__. <- list2env(list(ttt.xxx = ttt.xxx))
as the last line of testenv.R, then you'll create a methods table in the global environment. (Normally there isn't one there, because that's user space, and R doesn't like putting things there unless the user asks for it.)
R will find that methods table, and will find the ttt.xxx method that it defines. I wouldn't be surprised if this breaks some other aspect of S3 dispatch, so I don't recommend doing it, but give it a try if you insist on reinventing the package system.

Difference between environment etc. in testthat::test_check vs testthat::test_dir

This is somewhat deep R testing question, and as such, I'm not sure if general Stack Overflow is the right place for it, or if there's an R specific forum that would be better.
Any pointers on that are welcome.
The scenario is: I have package that is using testthat and has some tests in tests/testthat and (for reasons that are important but, to be honest, I don't totally understand) there are some other tests in inst/validation that need to be run as well, as part of a validation script (i.e. the script that this post is about).
I was running test_check(pkg) in my tests folder and it was working fine, but I wasn't getting the extra tests (which makes sense). So then I switched to the following:
test_dirs <- c("tests/testthat", "inst/validation")
for (.t in test_dirs) {
test_dir(.t)
}
Now a bunch of my tests are failing because they can't find some of the constants, etc. that are part of my package! (see note at the bottom for more details...)
So I dig in to the source code and find that test_check() actually calls testthat:::test_package_dir under the hood. Note the ::: this is an unexported function, so I don't really just want to call it in my own code.
testthat:::test_package_dir in turn calls the following, before calling test_dir() itself:
env <- test_pkg_env(package)
withr::local_options(list(topLevelEnvironment = env))
withr::local_envvar(list(
TESTTHAT_PKG = package,
TESTTHAT_DIR = maybe_root_dir(test_path)
))
test_dir(...
Sooooo... it seems like test_check() essentially just does some things to load the package environment (note test_pkg_env is also unexported) and then calls test_dir().
So I guess my question is: why? I've actually noticed this before with test_file() not working because it doesn't have everything in the package environment. Why do these functions not load the package environment like the other testing functions do?
Or really, my question is: is there a way to make them load it? And specifically in my case, is there a way to do what I'm trying to do (run tests in a few different directories) and have it load the package environment?
I notice this in the test_dirs docs:
env -- Environment in which to execute the tests. Expert use only.
which is set to test_env() by default. I have a feeling this is my answer, but I can't figure out how to get the package environment without basically copy/pasting a bunch of code out of functions that are hidden in :::. Perhaps I don't qualify as an "expert"...
Thanks for any insight and/or solutions!
note at the bottom:
Specifically my issue is that I have some "constants" in my aaa.R that are mostly just hard-coded strings or lists like:
SUMMARY_NAME <- "summary"
SUMMARY_COUNT <- "sum_count"
SUMMARY_PATH <- "sum_path"
SUM_REQ_COLS <- list(
list(name = SUMMARY_NAME, type = "character"),
list(name = SUMMARY_COUNT, type = "numeric"),
list(name = SUMMARY_PATH, type = "character"),
)
These are things that I use for checking S3 classes and other purposes so that I don't have hard-coded strings all over my code. The point is: I use some of these in my tests, which works fine for test_check() and devtools::check() and devtools::test() but dies when I try to use test_dir() or test_file() because they can't be found, presumably because the package environment isn't loaded.

R functions that execute functions

I'm trying to break out common lines of code used in a fairly large R script into encapsulated functions...however, they don't seem to be running the intended code when called. I feel like I'm missing some conceptual piece of how R works, or functional programming in general.
Examples:
Here's a piece of code I'd like to call to clear the workspace -
clearWorkSpace <- function() {
rm(list= ls(all=TRUE))
}
As noted, the code inside of the function executes as expected, however if the parent function is called, the environment is not cleared.
Again, here's a function intended to load all dependency files -
loadDependencies <- function() {
dep_files <- list.files(path="./dependencies")
for (file in dep_files) {
file_path <- paste0("./dependencies/",file)
source(file_path,local=TRUE)
}
}
If possible, it'd be great to be able to encapsulate code into easy to read functions. Thanks for your help in advance.
What you are calling workspace is more properly referred to as the global environment.
Functions execute in their own environments. This is, for example, why you don't see the variables defined inside a function in your global environment. Also how a function knows to use a variable named x defined in the function body rather than some x you might happen to have in your global environment.
Most functions don't modify the external environments, which is good! It's the functional programming paradigm. Functions that do modify environments, such as rm and source, usually take arguments so that you can be explicit about which environment is modified. If you look at ?rm you'll see an envir argument, and that argument is most of what its Details section describes. source has a local argument:
local - TRUE, FALSE or an environment, determining where the parsed expressions are evaluated. FALSE (the default) corresponds to the user's workspace (the global environment) and TRUE to the environment from which source is called.
You explicitly set local = TRUE when you call source, which explicitly tells source to only modify the local (inside the function) environment, so of course your global environment is untouched!
To make your functions work as I assume you want them to, you could modify clearWorkSpace like this:
clearWorkSpace <- function() {
rm(list= ls(all=TRUE, envir = .GlobalEnv), envir = .GlobalEnv)
}
And for loadDependencies simply delete the local = TRUE. (Or more explicitly set local = FALSE or local = .GlobalEnv) Though you could re-write it in a more R-like way:
loadDependencies = function() {
invisible(lapply(list.files(path = "./dependencies", full.names = TRUE), source))
}
For both of these (especially with the simplified dependency running above) I'd question whether you really need these wrapped up in functions. Might be better to just get in the habit of restarting R when you resume work on a project and keeping invisible(lapply(list.files(path = "./dependencies", full.names = TRUE), source)) at the top of your script...
For more reading on environments, there is The Evironments Section of Advanced R. Notably, there are several ways to specify environments that might be useful for different use cases rather than hard-coding the global environment.
In theory you need just to do something like:
rm(list= ls(all=TRUE, envir = .GlobalEnv))
I mean you set explicitly the environment ( even it is better here to use pos argument). but this will delete also the clearWorkSpace function since it is a defined in the global environment. So this will fails with a recursive call.
Personally I never use rm within a function or a local call. My understanding , rm is intended to be called from the console to clear the work space.

Employ environments to handle package-data in package-functions

I recently wrote a R extension. The functions use data contained in the package and must therefore load them. Subroutines also need to access the data.
This is the approach taken:
main<- function(...){
data(data)
sub <- function(...,data=data){...}
...
}
I'm unhappy with the fact that the data resides in .GlobalEnv so it still hangs around when the function had terminated (also undermining the downpassing via argument concept).
Please put me on the right track! How do you employ environments, when you have to handle package-data in package-functions?
It looks that you are looking for the LazyData directive in your namepace:
LazyData: yes
Othewise, data has the envir argument you can use to control in which environment you want to load your data, so for example if you wanted the data to be loaded inside main, you could use :
main<- function(...){
data(data, envir = environment() )
sub <- function(...,data=data){...}
...
}
If the data is needed for your functions, not for the user of the package, it should be saved in a file called sysdata.rda located in the R directory.
From R extensions:
Two exceptions are allowed: if the R subdirectory contains a file
sysdata.rda (a saved image of R objects: please use suitable
compression as suggested by tools::resaveRdaFiles) this will be
lazy-loaded into the namespace/package environment – this is intended
for system datasets that are not intended to be user-accessible via
data.

Creating and serializing / saving global variable from within a NAMESPACE in R

I would like to create a function within a package with a NAMESPACE that will save some variables. The problem is that when load is called on the .Rdata file it
tries to load the namespace of the package that contained the function that created the .Rdata file, but this package need not be loaded.
This example function is in a package in a namespace :
create.global.function <- function(x, FUN, ...) {
environment(FUN) <- .GlobalEnv
assign(".GLOBAL.FUN", function(x) { FUN(x, ...) }, env=.GlobalEnv)
environment(.GLOBAL.FUN) <- .GlobalEnv
save(list = ls(envir = .GlobalEnv, all.names = TRUE),
file = "/tmp/.Rdata",
envir = .GlobalEnv)
}
The environment(.GLOBAL.FUN) <- .GlobalEnv calls are not sufficient and attaching gdb to the R process confirms it is serializing a NAMESPACESXP here with the name of the package namespace and the load fails because it is unable to load this.
Is it possible to fully strip the namespace out of the .GLOBAL.FUN before I save it such that it can be loaded into other R instances without trying to load the namespace?
#JorisMeys snowfall and the others do not offer exactly this functionality.
snowfall uses sfExport ( from clusterFunctions.R in snowfall) to export local and global objects to the slave nodes, and this in turn uses sfClusterCall which is a wrapper around the clusterCall function from snow.
res <- sfClusterCall( assign, name, val, env = globalenv(),
stopOnError = FALSE )
And the snow library is loaded on the clients getting around any namespace issues as I mentioned in the last sentence of my question I would like to not load the namespace there.
Furthermore, it seems to make simplified assumptions such as that the nodes will share an NFS mount point for shared data (e.g. sfSource function in clusterFunctions.R).
I am more interested in something like a case where a node saves an .Rdata file then scp's it to another node that need not have the package namespace loaded.
It seems I can for now solve my original problem by using eval.parent and substitute:
assign(".GLOBAL.FUN",
eval.parent(substitute(function(y) { FUN(y, ...) })),
env=.GlobalEnv)
I apologize for the posting snafu, but I do not have an edit link although I posted this question, nor is there any place for me to leave a "comment" in the same way that I have this big text field for an answer. I've flagged this for moderation so I can get some help with that and have referenced the FAQ which talks about buttons that do not appear for me for leaving comments. there is some problem with this new account.

Resources