Manipulate data.table objects within user defined function - r

I'm trying to write functions that use data.table methods to add and edit columns by reference. However, the same code that work in the console does not work when called from within a function. Here is a simple examples:
> dt <- data.table(alpha = c("a","b","c"), numeric = 1:3)
> foo <- function(x) {
print(is.data.table(x))
x[,"uppercase":=toupper(alpha)]
}
When I call this function, I get the following error.
> test = foo(dt)
[1] TRUE
Error in `:=`("uppercase", toupper(alpha)) :
Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are
defined for use in j, once only and in particular ways. See help(":=").
Yet if I type the same code directly into the console, it works perfectly fine.
> dt[,"uppercase":=toupper(alpha)]
> dt
alpha numeric uppercase
1: a 1 A
2: b 2 B
3: c 3 C
I've scoured stackoverflow and the only clues I could find suggest that the function might be looking for alpha in a different environment or parent frame.
EDIT: More information to reproduce the error. The function is not within a package, but declared directly in the global environment. I didn't realize this before, but in order to reproduce the error I need to dump the function to a file, then load it. I've saved all of my personal functions via dput and load them into R with dget, so that's why I get this error often.
> dput(foo, "foo.R")
> foo <- dget("foo.R")
> foo(dt)
Error in `:=` etc...

This problem is a different flavor of the one described in the post Function on data.table environment errors. It's not exactly a problem, just how dget is designed. But for those curious, this happens because dget assigns the object to parent environment base, and the namespace base isn't data.table aware.
If x is a function the associated environment is stripped. Hence scoping information can be lost.
One workaround is to assign the function to the global enviornment:
> environment(foo) <- .GlobalEnv
But I think the best solution here is to use saveRDS to transfer R objects, which is what ?dget recommends.

Related

Unit testing functions with global variables in R

Preamble: package structure
I have an R package that contains an R/globals.R file with the following content (simplified):
utils::globalVariables("COUNTS")
Then I have a function that simply uses this variable. For example, R/addx.R contains a function that adds a number to COUNTS
addx <- function(x) {
COUNTS + x
}
This is all fine when doing a devtools::check() on my package, there's no complaining about COUNTS being out of the scope of addx().
Problem: writing a unit test
However, say I also have a tests/testthtat/test-addx.R file with the following content:
test_that("addition works", expect_gte(fun(1), 1))
The content of the test doesn't really matter here, because when running devtools::test() I get an "object 'COUNTS' not found" error.
What am I missing? How can I correctly write this test (or setup my package).
What I've tried to solve the problem
Adding utils::globalVariables("COUNTS") to R/addx.R, either before, inside or after the function definition.
Adding utils::globalVariables("COUNTS") to tests/testthtat/test-addx.R in all places I could think of.
Manually initializing COUNTS (e.g., with COUNTS <- 0 or <<- 0) in all places of tests/testthtat/test-addx.R I could think of.
Reading some examples from other packages on GitHub that use a similar syntax (source).
I think you misunderstand what utils::globalVariables("COUNTS") does. It just declares that COUNTS is a global variable, so when the code analysis sees
addx <- function(x) {
COUNTS + x
}
it won't complain about the use of an undefined variable. However, it is up to you to actually create the variable, for example by an explicit
COUNTS <- 0
somewhere in your source. I think if you do that, you won't even need the utils::globalVariables("COUNTS") call, because the code analysis will see the global definition.
Where you would need it is when you're doing some nonstandard evaluation, so that it's not obvious where a variable comes from. Then you declare it as a global, and the code analysis won't worry about it. For example, you might get a warning about
subset(df, Col1 < 0)
because it appears to use a global variable named Col1, but of course that's fine, because the subset() function evaluates in a non-standard way, letting you include column names without writing df$Col.
#user2554330's answer is great for many things.
If I understand correctly, you have a COUNTS that needs to be updateable, so putting it in the package environment might be an issue.
One technique you can use is the use of local environments.
Two alternatives:
If it will always be referenced in one function, it might be easiest to change the function from
myfunc <- function(...) {
# do something
COUNTS <- COUNTS + 1
}
to
myfunc <- local({
COUNTS <- NA
function(...) {
# do something
COUNTS <<- COUNTS + 1
}
})
What this does is create a local environment "around" myfunc, so when it looks for COUNTS, it will be found immediately. Note that it reassigns using <<- instead of <-, since the latter would not update the different-environment-version of the variable.
You can actually access this COUNTS from another function in the package:
otherfunc <- function(...) {
COUNTScopy <- get("COUNTS", envir = environment(myfunc))
COUNTScopy <- COUNTScopy + 1
assign("COUNTS", COUNTScopy, envir = environment(myfunc))
}
(Feel free to name it COUNTS here as well, I used a different name to highlight that it doesn't matter.)
While the use of get and assign is a little inconvenient, it should only be required twice per function that needs to do this.
Note that the user can get to this if needed, but they'll need to use similar mechanisms. Perhaps that's a problem; in my packages where I need some form of persistence like this, I have used convenience getter/setter functions.
You can place an environment within your package, and then use it like a named list within your package functions:
E <- new.env(parent = emptyenv())
myfunc <- function(...) {
# do something
E$COUNTS <- E$COUNTS + 1
}
otherfunc <- function(...) {
E$COUNTS <- E$COUNTS + 1
}
We do not need the get/assign pair of functions, since E (a horrible name, chosen for its brevity) should be visible to all functions in your package. If you don't need the user to have access, then keep it unexported. If you want users to be able to access it, then exporting it via the normal package mechanisms should work.
Note that with both of these, if the user unloads and reloads the package, the COUNTS value will be lost/reset.
I'll list provide a third option, in case the user wants/needs direct access, or you don't want to do this type of value management within your package.
Make the user provide it at all times. For this, add an argument to every function that needs it, and have the user pass an environment. I recommend that because most arguments are passed by-value, but environments allow referential semantics (pass by-reference).
For instance, in your package:
myfunc <- function(..., countenv) {
stopifnot(is.environment(countenv))
# do something
countenv$COUNT <- countenv$COUNT + 1
}
otherfunc <- function(..., countenv) {
countenv$COUNT <- countenv$COUNT + 1
}
new_countenv <- function(init = 0) {
E <- new.env(parent = emptyenv())
E$COUNT <- init
E
}
where new_countenv is really just a convenience function.
The user would then use your package as:
mycount <- new_countenv()
myfunc(..., countenv = mycount)
otherfunc(..., countenv = mycount)

R matrix/dataframe doesn't seem to have an environment

I seem to be having the same issue as is seen here so I started checking the environments that my frames/matrices are in. I have a character matrix, and a table that was imported as a list. I have been able to create a user-defined function that I have debugged and I can confirm that it runs through step by step assigning values in the character matrix to those needing change in the list.
{
i = 1
j = NROW(v)
while (i < j) {
if (v[i] %in% Convert[, 1]) {
n <- match(v[i], Convert[, 1])
v[i] <- Convert[n, 2]
}
i = i + 1
}
}
That is the code in case you need to see what I am doing.
The problem is whenever I check the environment of either of the list or the matrix, I get NULL (using environment()). I tried using assign() to create a new matrix. It seems, based on the link above, that this is an environment issue, but if the lists/matrices used have no environment, what is one to do?
Post Note: I have tried converting these to different formats (using as.character or as.list), but I don't know if this is even relevant if I can't get the environment issue resolved above.
environment() works only for functions and not for variables.
In fact the environment function gives the enclosing environment that makes sense only for function and not the binding environment you are interested in case of a variable.
If you want the binding environment use pryr package
library(pryr)
where("V")
Here an example
e<-new.env()
e$test<-function(x){x}
environment(e$test)
yu can see the environment here is the global environment because you defined the function there, but obviously the binding environment(that is the environment where you find the name of the variabile) is e.
http://adv-r.had.co.nz/Environments.html
Here to understand better the problem

No visible binding for '<<-' Assignment

I have solved variants of the "no visible binding" notes that one gets when checking their package. However, I am unable to solve the situation when applied to the '<<-' assignment.
Specifically, I had defined and used a local variable in several functions, such as:
fName = function(df){
eval({vName<<-0}, envir=environment(fName))
}
However when I run check() from devtools, I get the error:
fName: no visible binding for '<<-' assignment to 'vName'
So, I tried using a different syntax, as follows:
fName = function(df){
assign("vName",0, envir=environment(fName))
}
But got the check() error:
cannot add bindings to a locked environment
When I tried:
fName = function(df){
assign("vName",0, envir=environment(-1))
}
I got the error:
use of NULL environment is defunct
So, my question is how I can accomplish the assignment operator <<- without getting a note in the check() from devtools.
Thank you.
The easy answer is - Don't use <<- in your package. One Alterna way you can make assignments to an environment (but not a meaningful one) is to create a locked binding.
e <- new.env()
e$vName <- 0L
lockBinding("vName", e)
vName
# Error: object 'vName' not found
with(e, vName)
# [1] 0
e$vName <- 5
# Error in e$vName <- 5 : cannot change value of locked binding for 'vName'
You can also lock an environment, but not a meaningful one.
lockEnvironment(e)
rm(vName, envir = e)
# Error in rm(vName, envir = e) :
# cannot remove bindings from a locked environment
Have a look at help(bindenv), it's a good read.
Updated Since you mentioned you might be waiting to make assignment rather than at load times, have a read of help(globalVariables) It's another best-seller at ?
For globalVariables, the names supplied are of functions or other objects that should be regarded as defined globally when the check tool is applied to this package. The call to globalVariables will be included in the package's source. Repeated calls in the same package accumulate the names of the global variables.
i dont know if it helps after that many years.I had the same problem,and what i did to solve it is :
utils::globalVariables(c("global_var"))
Write this r code somewhere inside the R directory(save it as R file).Whenever you assign a global var assign it also locally like so:
global_var<<-1
global_var<-1
That worked for me.

Examining contents of .rdata file by attaching into a new environment - possible?

I am interested in listing objects in an RDATA file and loading only selected objects, rather than the whole set (in case some may be big or may already exist in the environment). I'm not quite clear on how to do this when there are conflicts in names, as attach() doesn't work as nicely.
1: For examining the contents of an R data file without loading it: This question is similar, but different from, the one asked at listing contents of an R data file without loading
In that case, the solution offered was:
attach(filename)
ls(pos = 2)
detach()
If there are naming conflicts between objects in the file and those in the global environment, this warning appears:
The following object(s) are masked _by_ '.GlobalEnv':
I tried creating a new environment, but I cannot seem to attach into that.
For instance, this produces the same error:
lsfile <- function(filename){
tmpEnv <- new.env()
evalq(attach(filename), envir = tmpEnv)
tmpls <- ls(pos = 2)
detach()
return(tmpls)
}
lsfile(filename)
Maybe I've made a mess of things with evalq (or eval). Is there some other way to avoid the naming conflict?
2: If I want to access an object - if there are no naming conflicts, I can just work with the one from the .rdat file, or copy it to a new one. If there are conflicts, how does one access the object in the file's namespace?
For instance, if my file is "sample.rdat", and the object is surveyData, and a surveyData object already exists in the global environment, then how can I access the one from the file:sample.rdat namespace?
I currently solve this problem by loading everything into a temporary environment, and then copy out what's needed, but this is inefficient.
Since this question has just been referenced let's clarify two things:
attach() simply calls load() so there is really no point in using it instead of load
if you want selective access to prevent masking it's much easier to simply load the file into a new environment:
e = local({load("foo.RData"); environment()})
You can then use ls(e) and access contents like e$x. You can still use attach on the environment if you really want it on the search path.
FWIW .RData files have no index (the objects are stored in one big pairlist), so you can't list the contained objects without loading. If you want convenient access, convert it to the lazy-load format instead which simply adds an index so each object can be loaded separately (see Get specific object from Rdata file)
I just use an env= argument to load():
> x <- 1; y <- 2; z <- "foo"
> save(x, y, z, file="/tmp/foo.RData")
> ne <- new.env()
> load(file="/tmp/foo.RData", env=ne)
> ls(env=ne)
[1] "x" "y" "z"
> ne$z
[1] "foo"
>
The cost of this approach is that you do read the whole RData file---but on the other hand that seems to be unavoidable anyway as no other method seems to offer a list of the 'content' of such a file.
You can suppress the warning by setting warn.conflicts=FALSE on the call to attach. If an object is masked by one in the global environment, you can use get to retreive it from your attached data.
x <- 1:10
save(x, file="x.rData")
#attach("x.rData", pos=2, warn.conflicts=FALSE)
attach("x.rData", pos=2)
(x <- 1)
# [1] 1
(x <- get("x", pos=2))
# [1] 1 2 3 4 5 6 7 8 9 10
Thanks to #Dirk and #Joshua.
I had an epiphany. The command/package foreach with SMP or MC seems to produce environments that only inherit, but do not seem to conflict with, the global environment.
lsfile <- function(list_files){
aggregate_ls = foreach(ix = 1:length(list_files)) %dopar% {
attach(list_files[ix])
tmpls <- ls(pos = 2)
return(tmpls)
}
return(aggregate_ls)
}
lsfile("f1.rdat")
lsfile(dir(pattern = "*rdat"))
This is useful to me because I can now parallelize this. This is a bare-bones version, and I will modify it to give more detailed information, but so far it seems to be the only way to avoid conflicts, even without ignore.
So, question #1 can be resolved by either ignoring the warnings (as #Joshua suggested) or by using whatever magic foreach summons.
For part 2, loading an object, I think #Joshua has the right idea - "get" will do.
The foreach magic can also work, by using the .noexport option. However, this has risks: whatever isn't specifically excluded will be inherited/exported from the global environment (I could do ls(), but there's always the possibility of attached datasets). For safety, this means that get() must still be used to avoid the risk of a naming conflict. Loading into a subenvironment avoids the naming conflict, but doesn't avoid the loading of unnecessary objects.
#Joshua's answer is far simpler than my foreach detour.

Protecting function names in R

Is it possible in R to protect function names (or variables in general) so that they cannot be masked.
I recently spotted that this can be a problem when creating a data frame with the name "new", which masked a function used by lmer and thus stopped it working. (Recovery is easy once you know what the problem is, here "rm(new)" did it.)
There is an easy workaround for your problem, without worrying about protecting variable names (though playing with lockBinding does look fun). If a function becomes masked, as in your example, it is still possible to call the masked version, with the help of the :: operator.
In general, the syntax is packagename::variablename.
(If the function you want has not been exported from the package, then you need three colons instead, :::. This shouldn't apply in this case however.)
Maybe use environments! This is a great way to separate namespaces. For example:
> a <- new.env()
> assign('printer', function(x) print(x), envir=a)
> get('printer', envir=a)('test!')
[1] "test!"
#hdallazuanna recommends (via Twitter)
new <- 1
lockBinding('new', globalenv())
this makes sense when the variable is user created but does not, of course, prevent overwriting a function from a package.
I had the reverse problem from the OP, and I wanted to prevent my custom functions in .Rprofile from being overridden when I defined a variable with the same name as a function, but I ended up putting my functions to ~/.R.R and I added these lines to .Rprofile:
if("myfuns"%in%search())detach("myfuns")
source("~/.R.R",attach(NULL,name="myfuns"))
From the help page of attach:
One useful ‘trick’ is to use ‘what = NULL’ (or equivalently a
length-zero list) to create a new environment on the search path
into which objects can be assigned by assign or load or
sys.source.
...
## create an environment on the search path and populate it
sys.source("myfuns.R", envir = attach(NULL, name = "myfuns"))

Resources