Calling objects from list - r

I'm having some trouble calling an object from a list, from a created variable within my for loop.
for (i in 1:10)
{
#create variables and run through function
varName<-paste("var",i,sep="")
assign(varName, rnmf(data, k=i, showprogress=FALSE))
#create new variable using object 3 from varName output
varNF<-paste("varNF",i,sep="")
assign(varNF, (data-varName[[3]])^2)
}
My problem is with the second part of my for loop. I am attempting to use the third object from the output of my first created variable, in the calculation of my second variable. If I use varName[[3]] I get "subscript out of bounds", and if I use varName$fit, I get "$ operator is invalid for atomic vectors".
It looks like varName in my second part is not calling the incrementing varName (var1, var2, var3, etc...) that I am creating, but it is calling the actual variable varName. To try and get around that, I instead tried
assign(varNF, (data-get(paste("var",i,"[[3]]",sep="")))^2)
Which gave me the error "object 'var1[[3]]' not found". But, if I simply call var1[[3]] in my R console, it does exist. I'm not quite sure where to go from here. Any help would be great!

A very useful rule of thumb in R is:
If you find yourself using either assign() or get() in your code, it's a strong indicator that you are approaching the problem with the wrong tools. If you still think you should use those functions, think again. The tools that you are missing are most likely R lists and subsetting of lists.
(and tell everyone that you know about the above)
In your case, I would do something like:
library("rNMF")
[...]
var <- list()
varNF <- list()
for (i in 1:10) {
res <- rnmf(data, k = i, showprogress = FALSE)
var[[i]] <- res
varNF[[i]] <- (data - res$fit)^2
}

Related

Unit testing functions with global variables in R

Preamble: package structure
I have an R package that contains an R/globals.R file with the following content (simplified):
utils::globalVariables("COUNTS")
Then I have a function that simply uses this variable. For example, R/addx.R contains a function that adds a number to COUNTS
addx <- function(x) {
COUNTS + x
}
This is all fine when doing a devtools::check() on my package, there's no complaining about COUNTS being out of the scope of addx().
Problem: writing a unit test
However, say I also have a tests/testthtat/test-addx.R file with the following content:
test_that("addition works", expect_gte(fun(1), 1))
The content of the test doesn't really matter here, because when running devtools::test() I get an "object 'COUNTS' not found" error.
What am I missing? How can I correctly write this test (or setup my package).
What I've tried to solve the problem
Adding utils::globalVariables("COUNTS") to R/addx.R, either before, inside or after the function definition.
Adding utils::globalVariables("COUNTS") to tests/testthtat/test-addx.R in all places I could think of.
Manually initializing COUNTS (e.g., with COUNTS <- 0 or <<- 0) in all places of tests/testthtat/test-addx.R I could think of.
Reading some examples from other packages on GitHub that use a similar syntax (source).
I think you misunderstand what utils::globalVariables("COUNTS") does. It just declares that COUNTS is a global variable, so when the code analysis sees
addx <- function(x) {
COUNTS + x
}
it won't complain about the use of an undefined variable. However, it is up to you to actually create the variable, for example by an explicit
COUNTS <- 0
somewhere in your source. I think if you do that, you won't even need the utils::globalVariables("COUNTS") call, because the code analysis will see the global definition.
Where you would need it is when you're doing some nonstandard evaluation, so that it's not obvious where a variable comes from. Then you declare it as a global, and the code analysis won't worry about it. For example, you might get a warning about
subset(df, Col1 < 0)
because it appears to use a global variable named Col1, but of course that's fine, because the subset() function evaluates in a non-standard way, letting you include column names without writing df$Col.
#user2554330's answer is great for many things.
If I understand correctly, you have a COUNTS that needs to be updateable, so putting it in the package environment might be an issue.
One technique you can use is the use of local environments.
Two alternatives:
If it will always be referenced in one function, it might be easiest to change the function from
myfunc <- function(...) {
# do something
COUNTS <- COUNTS + 1
}
to
myfunc <- local({
COUNTS <- NA
function(...) {
# do something
COUNTS <<- COUNTS + 1
}
})
What this does is create a local environment "around" myfunc, so when it looks for COUNTS, it will be found immediately. Note that it reassigns using <<- instead of <-, since the latter would not update the different-environment-version of the variable.
You can actually access this COUNTS from another function in the package:
otherfunc <- function(...) {
COUNTScopy <- get("COUNTS", envir = environment(myfunc))
COUNTScopy <- COUNTScopy + 1
assign("COUNTS", COUNTScopy, envir = environment(myfunc))
}
(Feel free to name it COUNTS here as well, I used a different name to highlight that it doesn't matter.)
While the use of get and assign is a little inconvenient, it should only be required twice per function that needs to do this.
Note that the user can get to this if needed, but they'll need to use similar mechanisms. Perhaps that's a problem; in my packages where I need some form of persistence like this, I have used convenience getter/setter functions.
You can place an environment within your package, and then use it like a named list within your package functions:
E <- new.env(parent = emptyenv())
myfunc <- function(...) {
# do something
E$COUNTS <- E$COUNTS + 1
}
otherfunc <- function(...) {
E$COUNTS <- E$COUNTS + 1
}
We do not need the get/assign pair of functions, since E (a horrible name, chosen for its brevity) should be visible to all functions in your package. If you don't need the user to have access, then keep it unexported. If you want users to be able to access it, then exporting it via the normal package mechanisms should work.
Note that with both of these, if the user unloads and reloads the package, the COUNTS value will be lost/reset.
I'll list provide a third option, in case the user wants/needs direct access, or you don't want to do this type of value management within your package.
Make the user provide it at all times. For this, add an argument to every function that needs it, and have the user pass an environment. I recommend that because most arguments are passed by-value, but environments allow referential semantics (pass by-reference).
For instance, in your package:
myfunc <- function(..., countenv) {
stopifnot(is.environment(countenv))
# do something
countenv$COUNT <- countenv$COUNT + 1
}
otherfunc <- function(..., countenv) {
countenv$COUNT <- countenv$COUNT + 1
}
new_countenv <- function(init = 0) {
E <- new.env(parent = emptyenv())
E$COUNT <- init
E
}
where new_countenv is really just a convenience function.
The user would then use your package as:
mycount <- new_countenv()
myfunc(..., countenv = mycount)
otherfunc(..., countenv = mycount)

R parLapply: How to (or Can we) access an object within the parallel code

I am trying to use parLapply to run a custom function. Since my actual code and data is not very reader friendly, I am creating a pseudo code for reference. I do the following:
a) First, I create a custom function. This function takes an argument say "Argument1". Argument1 is a list object which is what I use to run the parLapply on later.
b) Inside the function, based on Argument1, I create a subset called subset_data (subsetting on the full dataset which is supplied while calling parLapply).
c) After getting subset_data, I obtain a list of unique items for Variable2 and then further subset it depending on the number of unique items in Variable2.
d) Finally I run a function (SomeOtherFunction) which takes subset_data2 as the argument.
SomeCustomFunction = function(Argument1){
subset_data = OriginalData[which(OriginalData$Variable1==Argument1),]
some_other_variable = unique(subset_data$Variable2)
for (object in some_other_variable){
subset_data2 = subset_data[which(subset_data$Variable2 == object),]
FinalOutput = SomeOtherFunction(subset_data2)
}
return(SomeOutput)
}
SomeOtherFunction=function(subset_data2){
#Do Some computation here
}
Next I can create clusters in this way:
cl=parallel::makeCluster(2,type="PSOCK")
registerDoParallel(cl)
And supply the objects Argument1, OriginalData by calling clusterExport and then finally run parLapply by supplying SomeCustomFunction and a list for Argument1 (suppose Argument1_list).
clusterExport(cl=cl, list("Argument1","OriginalData"),envir=environment())
zz=parLapply(cl=cl,fun=SomeCustomFunction,Argument1=Argument1_list)
However, in this case, when I run parLapply, I get an error saying
Error in get(name, envir = envir) : object 'subset_data2' not found
In this case, I was assuming that since subset_data2 is being created within the first function, the object subset_data2 will get supplied automatically. Clearly this is not happening.
Is there a way for me supply this 2nd subset (subset_data2) within the function SomeCustomFunction without passing it to the cluster when calling ClusterExport?
If the question is not clear, please let me know and I can modify it accordingly. Thanks in advance.
P.S. I read this question: using parallel's parLapply: unable to access variables within parallel code, but in my case I do not call parLapply inside my function.
In the related question you mention, the top answer passes clusterExport a character vector of variable names, whereas you pass a list. Also, help(clusterExport) reveals: "varlist: character vector of names of objects to export".
Also, you're missing a " after Argument1 here: list("Argument1,"OriginalData, but I'm guessing that's only the sample code you posted, not in your real code.
PS: It's a step in the right direction that you put some code, but your question will get more responses if you put sample data and code that can be directly pasted and run to reproduce the error.

R: syntax of svychisq and summary on a svytable

I'm working on a "svydesigned" database and having trouble using svysq.
Here's what I tried which worked:
AxB<-svytable(~A+B, surveydesign, Ntotal=100)
AxB
svychisq(~A+B, surveydesign)
And what I would like to make work:
svychisq(AxB, surveydesign)
returns "$ operator is invalid for atomic vectors"
svychisq(~AxB, surveydesign)
returns "Error in formula [[2]][[2]] : Object of type symbol is not subsettable"
summary(AxB)
returns the table and the chisq, but with integers in the table (so only 0 and 1 since my values are in 0.xx format due to Ntotal=100)
What bugs me is that the help states that "sumary on svytable calls svychisq". I'm still new to R syntax and can't figure out how to make svychisq return a result using the table instead of typing again the whole formula I just used to create the table.
I'd also like to be able to see the decimals when usign "summary", is there a way? I tried to use digits=4 but nothing changed.
Thanks.
svychisq expects a formula and a svydesign object as arguments. It is just the way it was created, you won't be able to feed it a svytable argument. You could work around by writing your own function:
FOO <- function(x){
temp <- as.character(attr(x, "call"))[2:3]
svychisq(as.formula(temp[1]), design = eval(parse(text = temp[2])))
}
You feed it a svytable object, it retrieves the call of the object and feeds it back to svychisq.
FOO(AxB) should work as expected.

error using loop on custom function

I am having some trouble debugging this issue can someone please let me know where I am going wrong?
I have this simple function created that will be used on multiple dataframes to get the same information
TransCleaning <- function(df){
x <- select(df, a, b, c, d, e, f, g) %>% filter(e != "$0.00")
return(x)
}
since the names of the dataframes this function will be used on should stay the same, I could easily just hard code it but I was a loop.
so I make a list of my dataframes after making their names shorter.
files2 <- c(substr(files,5,10)
Then I try and run through this loop
for(i in 1:length(files2))
{
clean=TransCleaning(files2[i])
assign(files2[i], clean)
}
I get the following error. it has something to do with calling the files2 list because
Transclean(files2[1])
does not work either, while
Transclean(df)
works fine.
The error I am getting when I run the loop and transclean(files2[1]) is as follows:
Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "character"
In the function, the values of the data.frame string objects are not returned, so we can use get to do this
for(i in 1:length(files2)){
clean <- TransCleaning(get(files2[i]))
assign(files2[i], clean)
}
Though, it is better not to create objects in the global environment as it can be read directly into list and then functions can be applied on list instead of having lots of objects in global env.

Why can't I pass a dataset to a function?

I'm using the package glmulti to fit models to several datasets. Everything works if I fit one dataset at a time.
So for example:
output <- glmulti(y~x1+x2,data=dat,fitfunction=lm)
works just fine.
However, if I create a wrapper function like so:
analyze <- function(dat)
{
out<- glmulti(y~x1+x2,data=dat,fitfunction=lm)
return (out)
}
simply doesn't work. The error I get is
error in evaluating the argument 'data' in selecting a method for function 'glmulti'
Unless there is a data frame named dat, it doesn't work. If I use results=lapply(list_of_datasets, analyze), it doesn't work.
So what gives? Without my said wrapper, I can't lapply a list of datasets through this function. If anyone has thoughts or ideas on why this is happening or how I can get around it, that would be great.
example 2:
dat=list_of_data[[1]]
analyze(dat)
works fine. So in a sense it is ignoring the argument and just literally looking for a data frame named dat. It behaves the same no matter what I call it.
I guess this is -yet another- problem due to the definition of environments in the parse tree of S4 methods (one of the resons why I am not a big fan of S4...)
It can be shown by adding quotes around the dat :
> analyze <- function(dat)
+ {
+ out<- glmulti(y~x1+x2,data="dat",fitfunction=lm)
+ return (out)
+ }
> analyze(test)
Initialization...
Error in eval(predvars, data, env) : invalid 'envir' argument
You should in the first place send this information to the maintainers of the package, as they know how they deal with the environments internally. They'll have to adapt the functions.
A -very dirty- workaround for yourself, is to put "dat" in the global environment and delete it afterwards.
analyze <- function(dat)
{
assign("dat",dat,envir=.GlobalEnv) # put the dat in the global env
out<- glmulti(y~x1+x2,data=dat,fitfunction=lm)
remove(dat,envir=.GlobalEnv) # delete dat again from global env
return (out)
}
EDIT:
Just for clarity, this is really about the worst solution possible, but I couldn't manage to find anything better. If somebody else gives you a solution where you don't have to touch your global environment, by all means use that one.

Resources