Global Variables as Function Parameters - r

In an R project, we have a global dataframe df that is to be used inside a function my_func(). The dataframe will not be changed, but it will be used as a "read-only" table.
Can you please assist me, on, what is the best practice:
Include the dataframe in the parameters of the function, as in
my_func(df)
{
a <- df[1,2]
}
OR
Not include it in the parameters, just use it (read it) in the function body, as in
my_func()
{
a <- df[1,2]
}

In an ideal world, data enters a function as an argument and leaves it as a return value. That is a good principle. Besides it is prefereable for code reuse. Right now you may be conviced, that you will only ever call this code on df (bad name by the way, as there is a function calles df already in R and that can lead to terrible error messages).
The only exception from this rule, and the reason, why <<- exist(*), may rarely be performance.
However in the read-only case, there are no performance gains, as R does behave cleverly.
Will will need to install the microbenchmark package for the following code to run:
expl <- data.frame(a = rep("Hello world.", 1e8),
b = rep(1, 1e8))
fun1 <- function(dataframe) return(sum(dataframe$b))
fun2 <- function() return(sum(expl$b))
microbenchmark::microbenchmark(fun1(expl), fun2())
Try it and you will see, that there is no performance gain in fun2over fun1, even though the dataframe has considerable size.
Edit:
(*) as I have learned from Konrad Rudolph's comment below, <<- can be usefull, when giving data to the parent, not necessarily the global namespace. Very interesting read even if not strictly on topic here: http://adv-r.had.co.nz/Functional-programming.html#mutable-state

Related

Unit testing functions with global variables in R

Preamble: package structure
I have an R package that contains an R/globals.R file with the following content (simplified):
utils::globalVariables("COUNTS")
Then I have a function that simply uses this variable. For example, R/addx.R contains a function that adds a number to COUNTS
addx <- function(x) {
COUNTS + x
}
This is all fine when doing a devtools::check() on my package, there's no complaining about COUNTS being out of the scope of addx().
Problem: writing a unit test
However, say I also have a tests/testthtat/test-addx.R file with the following content:
test_that("addition works", expect_gte(fun(1), 1))
The content of the test doesn't really matter here, because when running devtools::test() I get an "object 'COUNTS' not found" error.
What am I missing? How can I correctly write this test (or setup my package).
What I've tried to solve the problem
Adding utils::globalVariables("COUNTS") to R/addx.R, either before, inside or after the function definition.
Adding utils::globalVariables("COUNTS") to tests/testthtat/test-addx.R in all places I could think of.
Manually initializing COUNTS (e.g., with COUNTS <- 0 or <<- 0) in all places of tests/testthtat/test-addx.R I could think of.
Reading some examples from other packages on GitHub that use a similar syntax (source).
I think you misunderstand what utils::globalVariables("COUNTS") does. It just declares that COUNTS is a global variable, so when the code analysis sees
addx <- function(x) {
COUNTS + x
}
it won't complain about the use of an undefined variable. However, it is up to you to actually create the variable, for example by an explicit
COUNTS <- 0
somewhere in your source. I think if you do that, you won't even need the utils::globalVariables("COUNTS") call, because the code analysis will see the global definition.
Where you would need it is when you're doing some nonstandard evaluation, so that it's not obvious where a variable comes from. Then you declare it as a global, and the code analysis won't worry about it. For example, you might get a warning about
subset(df, Col1 < 0)
because it appears to use a global variable named Col1, but of course that's fine, because the subset() function evaluates in a non-standard way, letting you include column names without writing df$Col.
#user2554330's answer is great for many things.
If I understand correctly, you have a COUNTS that needs to be updateable, so putting it in the package environment might be an issue.
One technique you can use is the use of local environments.
Two alternatives:
If it will always be referenced in one function, it might be easiest to change the function from
myfunc <- function(...) {
# do something
COUNTS <- COUNTS + 1
}
to
myfunc <- local({
COUNTS <- NA
function(...) {
# do something
COUNTS <<- COUNTS + 1
}
})
What this does is create a local environment "around" myfunc, so when it looks for COUNTS, it will be found immediately. Note that it reassigns using <<- instead of <-, since the latter would not update the different-environment-version of the variable.
You can actually access this COUNTS from another function in the package:
otherfunc <- function(...) {
COUNTScopy <- get("COUNTS", envir = environment(myfunc))
COUNTScopy <- COUNTScopy + 1
assign("COUNTS", COUNTScopy, envir = environment(myfunc))
}
(Feel free to name it COUNTS here as well, I used a different name to highlight that it doesn't matter.)
While the use of get and assign is a little inconvenient, it should only be required twice per function that needs to do this.
Note that the user can get to this if needed, but they'll need to use similar mechanisms. Perhaps that's a problem; in my packages where I need some form of persistence like this, I have used convenience getter/setter functions.
You can place an environment within your package, and then use it like a named list within your package functions:
E <- new.env(parent = emptyenv())
myfunc <- function(...) {
# do something
E$COUNTS <- E$COUNTS + 1
}
otherfunc <- function(...) {
E$COUNTS <- E$COUNTS + 1
}
We do not need the get/assign pair of functions, since E (a horrible name, chosen for its brevity) should be visible to all functions in your package. If you don't need the user to have access, then keep it unexported. If you want users to be able to access it, then exporting it via the normal package mechanisms should work.
Note that with both of these, if the user unloads and reloads the package, the COUNTS value will be lost/reset.
I'll list provide a third option, in case the user wants/needs direct access, or you don't want to do this type of value management within your package.
Make the user provide it at all times. For this, add an argument to every function that needs it, and have the user pass an environment. I recommend that because most arguments are passed by-value, but environments allow referential semantics (pass by-reference).
For instance, in your package:
myfunc <- function(..., countenv) {
stopifnot(is.environment(countenv))
# do something
countenv$COUNT <- countenv$COUNT + 1
}
otherfunc <- function(..., countenv) {
countenv$COUNT <- countenv$COUNT + 1
}
new_countenv <- function(init = 0) {
E <- new.env(parent = emptyenv())
E$COUNT <- init
E
}
where new_countenv is really just a convenience function.
The user would then use your package as:
mycount <- new_countenv()
myfunc(..., countenv = mycount)
otherfunc(..., countenv = mycount)

How to check if a function has been called from the console?

I am trying to track the number of times certain functions are called from the console.
My plan is to add a simple function such as "trackFunction" in each function that can check whether they have been called from the console or as underlying functions.
Even though the problem sounds straight-forward I can't find a good solution to this problem as my knowledge in function programming is limited. I've been looking at the call stack and rlang::trace_back but without a good solution to this.
Any help is appreciated.
Thanks
A simple approach would be to see on which level the current frame lies. That is, if a function is called directly in the interpreter, then sys.nframe() returns 1, otherwise 2 or higher.
Relate:
Rscript detect if R script is being called/sourced from another script
myfunc <- function(...) {
if (sys.nframe() == 1) {
message("called from the console")
} else {
message("called from elsewhere")
}
}
myfunc()
# called from the console
g <- function() myfunc()
g()
# called from elsewhere
Unfortunately, this may not always be intuitive:
ign <- lapply(1, myfunc)
# called from elsewhere
for (ign in 1) myfunc()
# called from the console
While for many things the lapply-family and for loops are similar, they behave separately here. If this is a problem, perhaps the only way to mitigate this is the analyze/parse the call stack and perhaps "ignore" certain functions. If this is what you need, then perhaps this is more appropriate:
R How to check that a custom function is called within a specific function from a certain package

In R, do an operation temporarily using a setting such as working directory

I'm almost certain I've read somewhere how to do this. Instead of having to save the current option (say working directory) to a variable, change the w.d, do an operation, and then revert back to what it was, doing this inside a function akin to "with" relative to attach/detach. A solution just for working directory is what I need now, but there might be a more generic function that does that sort of things? Or ain't it?
So to illustrate... The way it is now:
curdir <- getwd()
setwd("../some/place")
# some operation
setwd(curdir)
The way it is in my wildest dreams:
with.dir("../some/place", # some operation)
I know I could write a function for this, I just have the impression there's something more readily available and generalizable to other parameters too.
Thanks
There is an idiom for this in some of R's base plotting functions
op <- par(no.readonly = TRUE)
# par(blah = stuff)
# plot(stuff)
par(op)
that is so unbelievably crude as to be fully portable to options() and setwd().
Fortunately it's also easy to implement a crude wrapper:
with_dir <- function(dir, expr) {
old_wd <- getwd()
setwd(dir)
result <- evalq(expr)
setwd(old_wd)
result
}
I'm no wizard with nonstandard evaluation so evalq could be unstable somehow. More on NSE in an old write-up by Lumley and also in Wickham's Advanced R, but it's dense stuff and I haven't wrapped my head around it all yet.
edit: as per Ben Bolker's comment, it's probably better to use on.exit for this:
with_dir <- function(dir, expr) {
old_wd <- getwd()
on.exit(setwd(old_wd))
setwd(dir)
evalq(expr)
}
From the R docs:
on.exit records the expression given as its argument as needing to be executed when the current function exits (either naturally or as the result of an error). This is useful for resetting graphical parameters or performing other cleanup actions.
What you're describing depends upon two things: detecting when you enter and leave a particular lexical scope, and defining a behavior to do on entrance and on exit. Python has these, called "Context Managers". This was a big deal when it was released, and many parts of Python's standard library now behave like context managers, and have to define the "enter" and "exit" behavior in explicitly, or by leveraging some clever inheritance scheme.
with.default
function (data, expr, ...)
eval(substitute(expr), data, enclos = parent.frame())
<bytecode: 0x07d02ccc>
<environment: namespace:base>
R's with function works sort of like a context manager, because it can pass scopes around easily. That said, this doesn't give you the "enter" and "exit" operations for free. Especially consider that the current working directory isn't an entry in the current scope, but a state of the R interpreter, which can only be queried or changed by function calls behind the .Internal shield.
You can easily define your own object types to have methods that are context manager-like for the with generic function, as well as writing and registering methods for other types you commonly use, but it is not part of the base R language.

Method initialisation in R reference classes

I've noticed some strange behaviour in R reference classes when trying to implement some optimisation algorithm. There seems to be some behind-the-scenes parsing magic involved in initialising methods in a particular which makes it difficult to work with anonymous functions.
Here's an example that illustrates the difficulty: I define a function to optimise (f_opt), a function that runs optim on it, and a reference class that has these two as methods. The odd behaviour will be clearer in the code
f_opt <- function(x) (t(x)%*%x)
do_optim_opt <- function(x) optim(x,f)
do_optim2_opt <- function(x)
{
f(x) #Pointless extra evaluation
optim(x,f)
}
optClass <- setRefClass("optClass",methods=list(do_optim=do_optim_opt,
do_optim2=do_optim2_opt,
f=f_opt))
oc <- optClass$new()
oc$do_optim(rep(0,2)) #Doesn't work: Error in function (par) : object 'f' not found
oc$do_optim2(rep(0,2)) #Works.
oc$do_optim(rep(0,2)) #Parsing magic has presumably happened, and now this works too.
Is it just me, or does this look like a bug to other people too?
This post in R-devel seems relevant, with workaround
do_optim_opt <- function(x, f) optim(x, .self$f)
Seems worth a post to R-devel.

Why can't I pass a dataset to a function?

I'm using the package glmulti to fit models to several datasets. Everything works if I fit one dataset at a time.
So for example:
output <- glmulti(y~x1+x2,data=dat,fitfunction=lm)
works just fine.
However, if I create a wrapper function like so:
analyze <- function(dat)
{
out<- glmulti(y~x1+x2,data=dat,fitfunction=lm)
return (out)
}
simply doesn't work. The error I get is
error in evaluating the argument 'data' in selecting a method for function 'glmulti'
Unless there is a data frame named dat, it doesn't work. If I use results=lapply(list_of_datasets, analyze), it doesn't work.
So what gives? Without my said wrapper, I can't lapply a list of datasets through this function. If anyone has thoughts or ideas on why this is happening or how I can get around it, that would be great.
example 2:
dat=list_of_data[[1]]
analyze(dat)
works fine. So in a sense it is ignoring the argument and just literally looking for a data frame named dat. It behaves the same no matter what I call it.
I guess this is -yet another- problem due to the definition of environments in the parse tree of S4 methods (one of the resons why I am not a big fan of S4...)
It can be shown by adding quotes around the dat :
> analyze <- function(dat)
+ {
+ out<- glmulti(y~x1+x2,data="dat",fitfunction=lm)
+ return (out)
+ }
> analyze(test)
Initialization...
Error in eval(predvars, data, env) : invalid 'envir' argument
You should in the first place send this information to the maintainers of the package, as they know how they deal with the environments internally. They'll have to adapt the functions.
A -very dirty- workaround for yourself, is to put "dat" in the global environment and delete it afterwards.
analyze <- function(dat)
{
assign("dat",dat,envir=.GlobalEnv) # put the dat in the global env
out<- glmulti(y~x1+x2,data=dat,fitfunction=lm)
remove(dat,envir=.GlobalEnv) # delete dat again from global env
return (out)
}
EDIT:
Just for clarity, this is really about the worst solution possible, but I couldn't manage to find anything better. If somebody else gives you a solution where you don't have to touch your global environment, by all means use that one.

Resources