I have a long calculation which I am trying to parallelize using future package (specifically within an R shiny application).
I have used dplyr package for all data manipulations and there are a lot of column names that are not referred with quotation marks. When I try to run this function that I know to be working when non parallelized, I get the following error:
Warning: Error in : Identified global objects via static code inspection ({if (i > maxProgress) {; maxProgress <- i; shinyWidgets::updateProgressBar(session = session, id = "pb",; value = 100 * i/length(scanList)); ...; }; return(resultInstance); }). Failed to locate global object in the relevant environments: ‘Months in Service’
because static code inspection thinks ‘Months in Service’ is a global variable and fails to find it (it is in fact a column name of a dplyr tibble), so it never runs my code.
If I do the following globally before calling the function:
`Months in Service` <- NULL
it gives the same error for another column name. So one solution is to define each of those names as global variables and hope dplyr will work the same. However is there another simple way to get around this such as telling R to evaluate everything as it would have done without parallelization (each iteration is completely independent)
Edit 1: I simplified the operation and tested if NULL would actually work. It doesnt, it would complain because it thinks column name is NULL, for example:
no applicable method for 'rename_' applied to an object of class "NULL"
Edit 2: example reproduction
library("dplyr")
library("listenv")
exampleTibb <- tibble(`col 1`=c(1,2,3))
exampleFuture <- listenv()
exampleFuture[[1]] %<-% future({ rename(exampleTibb, `col 2` = `col 1`) })
exampleFuture <- as.list(exampleFuture)
Author of the future package here: This is due to a bug that has been fixed for future 1.8.0. I'm preparing to submit this to CRAN, but in the meanwhile you can either do:
options(future.globals.onMissing = "ignore")
or, better, install the develop version:
remotes::install_github("HenrikBengtsson/future#develop")
UPDATE 2018-04-08: future 1.8.0 is now on CRAN.
UPDATE 2018-04-07: This error only occurs when using nested futures. Note that you introduce nested futures by mistake when you use both %<-% and future(). That is a clearly a mistake. You want to use, either:
exampleFuture[[1]] %<-% { rename(exampleTibb, `col 2` = `col 1`) }
or
exampleFuture[[1]] <- future({ rename(exampleTibb, `col 2` = `col 1`) })
and certainly not both.
Related
Preamble: package structure
I have an R package that contains an R/globals.R file with the following content (simplified):
utils::globalVariables("COUNTS")
Then I have a function that simply uses this variable. For example, R/addx.R contains a function that adds a number to COUNTS
addx <- function(x) {
COUNTS + x
}
This is all fine when doing a devtools::check() on my package, there's no complaining about COUNTS being out of the scope of addx().
Problem: writing a unit test
However, say I also have a tests/testthtat/test-addx.R file with the following content:
test_that("addition works", expect_gte(fun(1), 1))
The content of the test doesn't really matter here, because when running devtools::test() I get an "object 'COUNTS' not found" error.
What am I missing? How can I correctly write this test (or setup my package).
What I've tried to solve the problem
Adding utils::globalVariables("COUNTS") to R/addx.R, either before, inside or after the function definition.
Adding utils::globalVariables("COUNTS") to tests/testthtat/test-addx.R in all places I could think of.
Manually initializing COUNTS (e.g., with COUNTS <- 0 or <<- 0) in all places of tests/testthtat/test-addx.R I could think of.
Reading some examples from other packages on GitHub that use a similar syntax (source).
I think you misunderstand what utils::globalVariables("COUNTS") does. It just declares that COUNTS is a global variable, so when the code analysis sees
addx <- function(x) {
COUNTS + x
}
it won't complain about the use of an undefined variable. However, it is up to you to actually create the variable, for example by an explicit
COUNTS <- 0
somewhere in your source. I think if you do that, you won't even need the utils::globalVariables("COUNTS") call, because the code analysis will see the global definition.
Where you would need it is when you're doing some nonstandard evaluation, so that it's not obvious where a variable comes from. Then you declare it as a global, and the code analysis won't worry about it. For example, you might get a warning about
subset(df, Col1 < 0)
because it appears to use a global variable named Col1, but of course that's fine, because the subset() function evaluates in a non-standard way, letting you include column names without writing df$Col.
#user2554330's answer is great for many things.
If I understand correctly, you have a COUNTS that needs to be updateable, so putting it in the package environment might be an issue.
One technique you can use is the use of local environments.
Two alternatives:
If it will always be referenced in one function, it might be easiest to change the function from
myfunc <- function(...) {
# do something
COUNTS <- COUNTS + 1
}
to
myfunc <- local({
COUNTS <- NA
function(...) {
# do something
COUNTS <<- COUNTS + 1
}
})
What this does is create a local environment "around" myfunc, so when it looks for COUNTS, it will be found immediately. Note that it reassigns using <<- instead of <-, since the latter would not update the different-environment-version of the variable.
You can actually access this COUNTS from another function in the package:
otherfunc <- function(...) {
COUNTScopy <- get("COUNTS", envir = environment(myfunc))
COUNTScopy <- COUNTScopy + 1
assign("COUNTS", COUNTScopy, envir = environment(myfunc))
}
(Feel free to name it COUNTS here as well, I used a different name to highlight that it doesn't matter.)
While the use of get and assign is a little inconvenient, it should only be required twice per function that needs to do this.
Note that the user can get to this if needed, but they'll need to use similar mechanisms. Perhaps that's a problem; in my packages where I need some form of persistence like this, I have used convenience getter/setter functions.
You can place an environment within your package, and then use it like a named list within your package functions:
E <- new.env(parent = emptyenv())
myfunc <- function(...) {
# do something
E$COUNTS <- E$COUNTS + 1
}
otherfunc <- function(...) {
E$COUNTS <- E$COUNTS + 1
}
We do not need the get/assign pair of functions, since E (a horrible name, chosen for its brevity) should be visible to all functions in your package. If you don't need the user to have access, then keep it unexported. If you want users to be able to access it, then exporting it via the normal package mechanisms should work.
Note that with both of these, if the user unloads and reloads the package, the COUNTS value will be lost/reset.
I'll list provide a third option, in case the user wants/needs direct access, or you don't want to do this type of value management within your package.
Make the user provide it at all times. For this, add an argument to every function that needs it, and have the user pass an environment. I recommend that because most arguments are passed by-value, but environments allow referential semantics (pass by-reference).
For instance, in your package:
myfunc <- function(..., countenv) {
stopifnot(is.environment(countenv))
# do something
countenv$COUNT <- countenv$COUNT + 1
}
otherfunc <- function(..., countenv) {
countenv$COUNT <- countenv$COUNT + 1
}
new_countenv <- function(init = 0) {
E <- new.env(parent = emptyenv())
E$COUNT <- init
E
}
where new_countenv is really just a convenience function.
The user would then use your package as:
mycount <- new_countenv()
myfunc(..., countenv = mycount)
otherfunc(..., countenv = mycount)
I'm developing a package in R. I have a bunch of functions, some of them need some global variables. How do I manage global variables in packages?
I've read something about environment, but I do not understand how it will work, of if this even is the way to go about the things.
You can use package local variables through an environment. These variables will be available to multiple functions in the package, but not (easily) accessible to the user and will not interfere with the users workspace. A quick and simple example is:
pkg.env <- new.env()
pkg.env$cur.val <- 0
pkg.env$times.changed <- 0
inc <- function(by=1) {
pkg.env$times.changed <- pkg.env$times.changed + 1
pkg.env$cur.val <- pkg.env$cur.val + by
pkg.env$cur.val
}
dec <- function(by=1) {
pkg.env$times.changed <- pkg.env$times.changed + 1
pkg.env$cur.val <- pkg.env$cur.val - by
pkg.env$cur.val
}
cur <- function(){
cat('the current value is', pkg.env$cur.val, 'and it has been changed',
pkg.env$times.changed, 'times\n')
}
inc()
inc()
inc(5)
dec()
dec(2)
inc()
cur()
You could set an option, eg
options("mypkg-myval"=3)
1+getOption("mypkg-myval")
[1] 4
In general global variables are evil. The underlying principle why they are evil is that you want to minimize the interconnections in your package. These interconnections often cause functions to have side-effects, i.e. it depends not only on the input arguments what the outcome is, but also on the value of some global variable. Especially when the number of functions grows, this can be hard to get right and hell to debug.
For global variables in R see this SO post.
Edit in response to your comment:
An alternative could be to just pass around the needed information to the functions that need it. You could create a new object which contains this info:
token_information = list(token1 = "087091287129387",
token2 = "UA2329723")
and require all functions that need this information to have it as an argument:
do_stuff = function(arg1, arg2, token)
do_stuff(arg1, arg2, token = token_information)
In this way it is clear from the code that token information is needed in the function, and you can debug the function on its own. Furthermore, the function has no side effects, as its behavior is fully determined by its input arguments. A typical user script would look something like:
token_info = create_token(token1, token2)
do_stuff(arg1, arg2, token_info)
I hope this makes things more clear.
The question is unclear:
Just one R process or several?
Just on one host, or across several machine?
Is there common file access among them or not?
In increasing order of complexity, I'd use a file, a SQLite backend via the RSQlite package or (my favourite :) the rredis package to set to / read from a Redis instance.
You could also create a list of tokens and add it to R/sysdata.rda with usethis::use_data(..., internal = TRUE). The data in this file is internal, but accessible by all functions. The only problem would arise if you only want some functions to access the tokens, which would be better served by:
the environment solution already proposed above; or
creating a hidden helper function that holds the tokens and returns them. Then just call this hidden function inside the functions that use the tokens, and (assuming it is a list) you can inject them to their environment with list2env(..., envir = environment()).
If you don't mind adding a dependency to your package, you can use an R6 object from the homonym package, as suggested in the comments to #greg-snow's answer.
R6 objects are actual environments with the possibility of adding public and private methods, are very lightweight and could be a good and more rigorous option to share package's global variables, without polluting the global environment.
Compared to #greg-snow's solution, it allows for a stricter control of your variables (you can add methods that check for types for example). The drawback can be the dependency and, of course, learning the R6 syntax.
library(R6)
MyPkgOptions = R6::R6Class(
"mypkg_options",
public = list(
get_option = function(x) private$.options[[x]]
),
active = list(
var1 = function(x){
if(missing(x)) private$.options[['var1']]
else stop("This is an environment parameter that cannot be changed")
}
,var2 = function(x){
if(missing(x)) private$.options[['var2']]
else stop("This is an environment parameter that cannot be changed")
}
),
private = list(
.options = list(
var1 = 1,
var2 = 2
)
)
)
# Create an instance
mypkg_options = MyPkgOptions$new()
# Fetch values from active fields
mypkg_options$var1
#> [1] 1
mypkg_options$var2
#> [1] 2
# Alternative way
mypkg_options$get_option("var1")
#> [1] 1
mypkg_options$get_option("var3")
#> NULL
# Variables are locked unless you add a method to change them
mypkg_options$var1 = 3
#> Error in (function (x) : This is an environment parameter that cannot be changed
Created on 2020-05-27 by the reprex package (v0.3.0)
Suppose there is a set of functions, drawn from a package not written by me, that I want to assign to a special behavior on error. My current concern is with the _impl family of functions in dplyr. Take mutate_impl, for example. When I get an error from mutate, traceback almost always leads me to mutate_impl, but it is usually a ways up the call stack -- I have seen it be as many as 15 calls from the call to mutate. So what I want to know at that point is typically how the arguments to mutate_impl relate to the arguments I originally supplied to mutate (or think I did).
So, this code is probably wrong in too many ways to count -- certainly it does not work -- but I hope it at least helps to express my intent. The idea is that I could wrap it around mutate_impl, and if it produces an error, it saves the error message and a description of the arguments and returns them as a list
str_impl <- function(f){tryCatch(f, error = function(c) {
msg <- conditionMessage(c)
args <- capture.output(str(as.list(match.call(call(f)))))
list(message = msg, arguments = args)
}
assign(str_impl(mutate_impl), .GlobalEnv)
Although, this still falls short of what I really want, because even without the constraint of working code, I could not figure out how to produce a draft. What I really want is to be able to identify a function or list of functions that I want to have this behavior on error, and then have it occur on error whenever and wherever that function is called. I could not think of any way to even start to do that without rewriting functions in the dplyr package environment, which struck me as a really bad idea.
The final assignment to the global environment is supposed to get the error object back to somewhere I can find it, even if the call to mutate_impl happens somewhere inaccessible, like in an environment that ceases to exist after the error.
Probably the best way of achieving what you want is via the trace functionality. It's surely worth reading the help about trace, but here is a working example:
library(dplyr)
trace("mutate_impl", exit = quote({
if (class(returnValue())[1]=="NULL") {
cat("df\n")
print(head(df))
cat("\n\ndots\n")
print(dots)
} else {
# no problem, nothing to do
}
}), where = mutate, print = FALSE)
# ok
xx <- mtcars %>% mutate(gear = gear * 2)
# not ok, extra output
xx <- mtcars %>% mutate(gear = hi * 2)
It should be fairly simple to adjust this to your specific needs, e.g. if you want to log to a file instead:
trace("mutate_impl", exit = quote({
if (class(returnValue())[1]=="NULL") {
sink("error.log")
cat("df\n")
print(head(df))
cat("\n\ndots\n")
print(dots)
sink()
} else {
# no problem, nothing to do
}
}), where = mutate, print = FALSE)
I have solved variants of the "no visible binding" notes that one gets when checking their package. However, I am unable to solve the situation when applied to the '<<-' assignment.
Specifically, I had defined and used a local variable in several functions, such as:
fName = function(df){
eval({vName<<-0}, envir=environment(fName))
}
However when I run check() from devtools, I get the error:
fName: no visible binding for '<<-' assignment to 'vName'
So, I tried using a different syntax, as follows:
fName = function(df){
assign("vName",0, envir=environment(fName))
}
But got the check() error:
cannot add bindings to a locked environment
When I tried:
fName = function(df){
assign("vName",0, envir=environment(-1))
}
I got the error:
use of NULL environment is defunct
So, my question is how I can accomplish the assignment operator <<- without getting a note in the check() from devtools.
Thank you.
The easy answer is - Don't use <<- in your package. One Alterna way you can make assignments to an environment (but not a meaningful one) is to create a locked binding.
e <- new.env()
e$vName <- 0L
lockBinding("vName", e)
vName
# Error: object 'vName' not found
with(e, vName)
# [1] 0
e$vName <- 5
# Error in e$vName <- 5 : cannot change value of locked binding for 'vName'
You can also lock an environment, but not a meaningful one.
lockEnvironment(e)
rm(vName, envir = e)
# Error in rm(vName, envir = e) :
# cannot remove bindings from a locked environment
Have a look at help(bindenv), it's a good read.
Updated Since you mentioned you might be waiting to make assignment rather than at load times, have a read of help(globalVariables) It's another best-seller at ?
For globalVariables, the names supplied are of functions or other objects that should be regarded as defined globally when the check tool is applied to this package. The call to globalVariables will be included in the package's source. Repeated calls in the same package accumulate the names of the global variables.
i dont know if it helps after that many years.I had the same problem,and what i did to solve it is :
utils::globalVariables(c("global_var"))
Write this r code somewhere inside the R directory(save it as R file).Whenever you assign a global var assign it also locally like so:
global_var<<-1
global_var<-1
That worked for me.
I am creating an R package containing a dataset and an R function which uses the data.
The R function looks like this:
myFun <- function(iobs){
data(MyData)
return(MyData[iobs,])
}
When I do the usual "R CMD check myPack" business, it gives me error saying
* checking R code for possible problems ... NOTE
myFun: no visible binding for global variable ‘MyData’
Is there way to fix this problem?
You can use lazy-loading for this.
Just put
LazyData: yes
in your DESCRIPTION file and remove
data(MyData)
from your function.
Due to lazy-loading your MyData-Object will be available in your namespace, so no need for to call data().
Two alternatives to the lazy data approach. Both rely on using the list argument to data
data(list = 'MyData')
Define as an default argument of the function (may not ideal as then can be changed)
myFun <- function(iobs, myData = data(list='MyData')){
return(myData[iobs,])
}
Load into an empty environment then extract using [[.
myFun2 <- function(iobs){
e <- new.env(parent = emptyenv())
data(list='MyData', envir = e)
e[['MyData']][iobs,]
}
Note that
e$MyData[iobs,] should also work.
I would also suggest using drop = TRUEas safe practice to retain the same class as MyData
eg
MyData[iobs,,drop=TRUE]. This may not be an issue given the specifics of this function and the structure of MyData, but is good programming practice, especially within packages when you want robust code.