error using loop on custom function - r

I am having some trouble debugging this issue can someone please let me know where I am going wrong?
I have this simple function created that will be used on multiple dataframes to get the same information
TransCleaning <- function(df){
x <- select(df, a, b, c, d, e, f, g) %>% filter(e != "$0.00")
return(x)
}
since the names of the dataframes this function will be used on should stay the same, I could easily just hard code it but I was a loop.
so I make a list of my dataframes after making their names shorter.
files2 <- c(substr(files,5,10)
Then I try and run through this loop
for(i in 1:length(files2))
{
clean=TransCleaning(files2[i])
assign(files2[i], clean)
}
I get the following error. it has something to do with calling the files2 list because
Transclean(files2[1])
does not work either, while
Transclean(df)
works fine.
The error I am getting when I run the loop and transclean(files2[1]) is as follows:
Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "character"

In the function, the values of the data.frame string objects are not returned, so we can use get to do this
for(i in 1:length(files2)){
clean <- TransCleaning(get(files2[i]))
assign(files2[i], clean)
}
Though, it is better not to create objects in the global environment as it can be read directly into list and then functions can be applied on list instead of having lots of objects in global env.

Related

R parLapply: How to (or Can we) access an object within the parallel code

I am trying to use parLapply to run a custom function. Since my actual code and data is not very reader friendly, I am creating a pseudo code for reference. I do the following:
a) First, I create a custom function. This function takes an argument say "Argument1". Argument1 is a list object which is what I use to run the parLapply on later.
b) Inside the function, based on Argument1, I create a subset called subset_data (subsetting on the full dataset which is supplied while calling parLapply).
c) After getting subset_data, I obtain a list of unique items for Variable2 and then further subset it depending on the number of unique items in Variable2.
d) Finally I run a function (SomeOtherFunction) which takes subset_data2 as the argument.
SomeCustomFunction = function(Argument1){
subset_data = OriginalData[which(OriginalData$Variable1==Argument1),]
some_other_variable = unique(subset_data$Variable2)
for (object in some_other_variable){
subset_data2 = subset_data[which(subset_data$Variable2 == object),]
FinalOutput = SomeOtherFunction(subset_data2)
}
return(SomeOutput)
}
SomeOtherFunction=function(subset_data2){
#Do Some computation here
}
Next I can create clusters in this way:
cl=parallel::makeCluster(2,type="PSOCK")
registerDoParallel(cl)
And supply the objects Argument1, OriginalData by calling clusterExport and then finally run parLapply by supplying SomeCustomFunction and a list for Argument1 (suppose Argument1_list).
clusterExport(cl=cl, list("Argument1","OriginalData"),envir=environment())
zz=parLapply(cl=cl,fun=SomeCustomFunction,Argument1=Argument1_list)
However, in this case, when I run parLapply, I get an error saying
Error in get(name, envir = envir) : object 'subset_data2' not found
In this case, I was assuming that since subset_data2 is being created within the first function, the object subset_data2 will get supplied automatically. Clearly this is not happening.
Is there a way for me supply this 2nd subset (subset_data2) within the function SomeCustomFunction without passing it to the cluster when calling ClusterExport?
If the question is not clear, please let me know and I can modify it accordingly. Thanks in advance.
P.S. I read this question: using parallel's parLapply: unable to access variables within parallel code, but in my case I do not call parLapply inside my function.
In the related question you mention, the top answer passes clusterExport a character vector of variable names, whereas you pass a list. Also, help(clusterExport) reveals: "varlist: character vector of names of objects to export".
Also, you're missing a " after Argument1 here: list("Argument1,"OriginalData, but I'm guessing that's only the sample code you posted, not in your real code.
PS: It's a step in the right direction that you put some code, but your question will get more responses if you put sample data and code that can be directly pasted and run to reproduce the error.

Dispatch of `rbind` and `cbind` for a `data.frame`

Background
The dispatch mechanism of the R functions rbind() and cbind() is non-standard. I explored some possibilities of writing rbind.myclass() or cbind.myclass() functions when one of the arguments is a data.frame, but so far I do not have a satisfactory approach. This post concentrates on rbind, but the same holds for cbind.
Problem
Let us create an rbind.myclass() function that simply echoes when it has been called.
rbind.myclass <- function(...) "hello from rbind.myclass"
We create an object of class myclass, and the following calls to rbind all
properly dispatch to rbind.myclass()
a <- "abc"
class(a) <- "myclass"
rbind(a, a)
rbind(a, "d")
rbind(a, 1)
rbind(a, list())
rbind(a, matrix())
However, when one of the arguments (this need not be the first one), rbind() will call base::rbind.data.frame() instead:
rbind(a, data.frame())
This behavior is a little surprising, but it is actually documented in the
dispatch section of rbind(). The advice given there is:
If you want to combine other objects with data frames,
it may be necessary to coerce them to data frames first.
In practice, this advice may be difficult to implement. Conversion to a data frame may remove essential class information. Moreover, the user who might be unware of the advice may be stuck with an error or an unexpected result after issuing the command rbind(a, x).
Approaches
Warn the user
A first possibility is to warn the user that the call to rbind(a, x) should not be made when x is a data frame. Instead, the user of package mypackage should make an explicit call to a hidden function:
mypackage:::rbind.myclass(a, x)
This can be done, but the user has to remember to make the explicit call when needed. Calling the hidden function is something of a last resort, and should not be regular policy.
Intercept rbind
Alternatively, I tried to shield the user by intercepting dispatch. My first try was to provide a local definition of base::rbind.data.frame():
rbind.data.frame <- function(...) "hello from my rbind.data.frame"
rbind(a, data.frame())
rm(rbind.data.frame)
This fails as rbind() is not fooled in calling rbind.data.frame from the .GlobalEnv, and calls the base version as usual.
Another strategy is to override rbind() by a local function, which was suggested in S3 dispatching of `rbind` and `cbind`.
rbind <- function (...) {
if (attr(list(...)[[1]], "class") == "myclass") return(rbind.myclass(...))
else return(base::rbind(...))
}
This works perfectly for dispatching to rbind.myclass(), so the user can now type rbind(a, x) for any type of object x.
rbind(a, data.frame())
The downside is that after library(mypackage) we get the message The following objects are masked from ‘package:base’: rbind .
While technically everything works as expected, there should be better ways than a base function override.
Conclusion
None of the above alternatives is satisfactory. I have read about alternatives using S4 dispatch, but so far I have not located any implementations of the idea. Any help or pointers?
As you mention yourself, using S4 would be one good solution that works nicely. I have not investigated recently, with data frames as I am much more interested in other generalized matrices, in both of my long time CRAN packages 'Matrix' (="recommended", i.e. part of every R distribution) and in 'Rmpfr'.
Actually even two different ways:
1) Rmpfr uses the new way to define methods for the '...' in rbind()/cbind().
this is well documented in ?dotsMethods (mnemonic: '...' = dots) and implemented in Rmpfr/R/array.R line 511 ff (e.g. https://r-forge.r-project.org/scm/viewvc.php/pkg/R/array.R?view=annotate&root=rmpfr)
2) Matrix uses the older approach by defining (S4) methods for rbind2() and cbind2(): If you read ?rbind it does mention that and when rbind2/cbind2 are used. The idea there: "2" means you define S4 methods with a signature for two ("2") matrix-like objects and rbind/cbind uses them for two of its potentially many arguments recursively.
The dotsMethod approach was suggested by Martin Maechler and implemented in the Rmpfr package. We need to define a new generic, class and a method using S4.
setGeneric("rbind", signature = "...")
mychar <- setClass("myclass", slots = c(x = "character"))
b <- mychar(x = "b")
rbind.myclass <- function(...) "hello from rbind.myclass"
setMethod("rbind", "myclass",
function(..., deparse.level = 1) {
args <- list(...)
if(all(vapply(args, is.atomic, NA)))
return( base::cbind(..., deparse.level = deparse.level) )
else
return( rbind.myclass(..., deparse.level = deparse.level))
})
# these work as expected
rbind(b, "d")
rbind(b, b)
rbind(b, matrix())
# this fails in R 3.4.3
rbind(b, data.frame())
Error in rbind2(..1, r) :
no method for coercing this S4 class to a vector
I haven't been able to resolve the error. See
R: Shouldn't generic methods work internally within a package without it being attached?
for a related problem.
As this approach overrides rbind(), we get the warning The following objects are masked from 'package:base': rbind.
I don't think you're going to be able to come up with something completely satisfying. The best you can do is export rbind.myclass so that users can call it directly without doing mypackage:::rbind.myclass. You can call it something else if you want (dplyr calls its version bind_rows), but if you choose to do so, I'd use a name that evokes rbind, like rbind_myclass.
Even if you can get r-core to agree to change the dispatch behavior, so that rbind dispatches on its first argument, there are still going to be cases when users will want to rbind multiple objects together with a myclass object somewhere other than the first. How else can users dispatch to rbind.myclass(df, df, myclass)?
The data.table solution seems dangerous; I would not be surprised if the CRAN maintainers put in a check and disallow this at some point.

Calling objects from list

I'm having some trouble calling an object from a list, from a created variable within my for loop.
for (i in 1:10)
{
#create variables and run through function
varName<-paste("var",i,sep="")
assign(varName, rnmf(data, k=i, showprogress=FALSE))
#create new variable using object 3 from varName output
varNF<-paste("varNF",i,sep="")
assign(varNF, (data-varName[[3]])^2)
}
My problem is with the second part of my for loop. I am attempting to use the third object from the output of my first created variable, in the calculation of my second variable. If I use varName[[3]] I get "subscript out of bounds", and if I use varName$fit, I get "$ operator is invalid for atomic vectors".
It looks like varName in my second part is not calling the incrementing varName (var1, var2, var3, etc...) that I am creating, but it is calling the actual variable varName. To try and get around that, I instead tried
assign(varNF, (data-get(paste("var",i,"[[3]]",sep="")))^2)
Which gave me the error "object 'var1[[3]]' not found". But, if I simply call var1[[3]] in my R console, it does exist. I'm not quite sure where to go from here. Any help would be great!
A very useful rule of thumb in R is:
If you find yourself using either assign() or get() in your code, it's a strong indicator that you are approaching the problem with the wrong tools. If you still think you should use those functions, think again. The tools that you are missing are most likely R lists and subsetting of lists.
(and tell everyone that you know about the above)
In your case, I would do something like:
library("rNMF")
[...]
var <- list()
varNF <- list()
for (i in 1:10) {
res <- rnmf(data, k = i, showprogress = FALSE)
var[[i]] <- res
varNF[[i]] <- (data - res$fit)^2
}

How do I load objects to the current environment from a function in R?

Instead of doing
a <- loadBigObject("a")
b <- loadBigObject("b")
I'd like to call a function like
loadBigObjects(list("a","b"))
And be able to access the a and b objects.
It is not clear what loadBigObjects() does or where it will look for a and b. How does it load the objects from file or sourcing code?
There are lots of options in general:
sys.source() allows an R file to be sourced to a given environment
load() which will load an .Rdata file to a given environment
assign() in combination with any object created by loadBigObjects() or a call to readRDS() can also load an object to a given environment.
From within your function, you'll want to specify the environment in which to load objects as the Global Environment by using globalenv(). If you don't do that then the object will only exist in the evaluation frame of the running loadBigObjects(). E.g.
loadBigObjects <- function(list) {
lapply(list, function(x) assign(x, readRDS(x), envir = globalenv()))
}
(as per your comment to #GSee's Answer, and assuming the list("a","b") is sufficient information for readRDS() to locate and open the object.
Without knowing anything about what loadBigObject is or does, you can use lapply to apply a function to a list of objects
lapply(list("a", "b"), loadBigObject)
If you provided the code for loadBigObject or at least describe what it is supposed to do, a better loadBigObjects function could probably be written.
The assign function can be used to define a variable in an environment other than the current one.
loadBigObjects <- function(lst) {
lapply(lst, function(l) {
assign(l, loadBigObject(l), envir=globalenv())
}
lst
}
(Not that this is necessarily a good idea.)

Why can't I pass a dataset to a function?

I'm using the package glmulti to fit models to several datasets. Everything works if I fit one dataset at a time.
So for example:
output <- glmulti(y~x1+x2,data=dat,fitfunction=lm)
works just fine.
However, if I create a wrapper function like so:
analyze <- function(dat)
{
out<- glmulti(y~x1+x2,data=dat,fitfunction=lm)
return (out)
}
simply doesn't work. The error I get is
error in evaluating the argument 'data' in selecting a method for function 'glmulti'
Unless there is a data frame named dat, it doesn't work. If I use results=lapply(list_of_datasets, analyze), it doesn't work.
So what gives? Without my said wrapper, I can't lapply a list of datasets through this function. If anyone has thoughts or ideas on why this is happening or how I can get around it, that would be great.
example 2:
dat=list_of_data[[1]]
analyze(dat)
works fine. So in a sense it is ignoring the argument and just literally looking for a data frame named dat. It behaves the same no matter what I call it.
I guess this is -yet another- problem due to the definition of environments in the parse tree of S4 methods (one of the resons why I am not a big fan of S4...)
It can be shown by adding quotes around the dat :
> analyze <- function(dat)
+ {
+ out<- glmulti(y~x1+x2,data="dat",fitfunction=lm)
+ return (out)
+ }
> analyze(test)
Initialization...
Error in eval(predvars, data, env) : invalid 'envir' argument
You should in the first place send this information to the maintainers of the package, as they know how they deal with the environments internally. They'll have to adapt the functions.
A -very dirty- workaround for yourself, is to put "dat" in the global environment and delete it afterwards.
analyze <- function(dat)
{
assign("dat",dat,envir=.GlobalEnv) # put the dat in the global env
out<- glmulti(y~x1+x2,data=dat,fitfunction=lm)
remove(dat,envir=.GlobalEnv) # delete dat again from global env
return (out)
}
EDIT:
Just for clarity, this is really about the worst solution possible, but I couldn't manage to find anything better. If somebody else gives you a solution where you don't have to touch your global environment, by all means use that one.

Resources