Here is the structure of a function I have created inside which a parallelization has been implemented.
parallelized.function <- function(...){
# Function used in the parallelization
used.in.par <- function(...)
# Function needed by used.in.par (auxiliars)
aux1<-function(...)
aux2<-function(...)
#---------------------------------------------------#
# PARALLELIZATION PROCESS
suppressMessages(library(foreach))
suppressMessages(library(doParallel))
..................................
%dopar%{ used.in.par(...) }
#---------------------------------------------------#
return(something)
}
The code works, but it supposes to define aux1 and aux2 inside parallelized.function to work well (which needs a lot of lines of code).
Is there any way to call aux1 and aux2 functions instead of writing all the code inside parallelized.function?
I tried creating new scripts with aux functions and writing source(".../aux1.R") and source(".../aux2.R") inside parallelized.function without success.
Thank you,
foreach package would do that for you. In order to acccess function that was not defined in the current environment, you should use .export; and loading required packages on each of the workers would be possible by using .packages option.
foreach(
...,
.export = c('aux1', 'aux2'),
.packages = c(...)
) %dopar% {
...
}
Note that loading foreach and doParallel package inside foreach loop is not required. But you missed the part when you register clusters.
Related
Many great R packages exists. However, often they use slightly different names for the same behaviour. As I often use different packages, the different names get in my way. Thus, I would like to extend the original package by adding local functions. E.g.
in the package "rethinking" we use the function "extract.samples()" to obtain the samples from the posterior distribution.
in the package "rstanarm" we use the function "as.matrix()" instead.
It would be nice to add the function "extract.samples()" to my local repository, and to define that it is called only if the input parameter is an "rstanarm object". Thus, I really would like to extend the package: If I load "rethinking" the "rethinking::extract.samples()" is used, and if I load "rstanarm" the function "rstanarm::extract.samples()" is used.
What I currently do is the following:
extract.samples = function(object, n=1000, ...){
if ( class(object)[[1]] == 'stanreg' ){
# rstanarm object:
SIMULATIONS = as.matrix(object)
} else if ( attr(class(object), "package") == 'rethinking' ){
SIMULATIONS = rethinking::extract.samples(object, n=n, ...)
}
return(invisible( SIMULATIONS ))
}
Thus, I explicitly have to take care of all the possible objects and parameter setting. This becomes messy, if a third package defines the function "extract.samples" or if the two packages use different parameters. I wonder, if there is a more robust method.
The proper way to do this is to create your own package that exports a generic function and several methods for it. If you can edit rethinking, then just do that, otherwise create a third package. I'll assume you're doing that.
Here's what the code could look like:
extract.samples <- function(x, ...) {
UseMethod("extract.samples")
}
extract.samples.stanreg <- function(x, ...) {
as.matrix(x, ...)
}
extract.samples.default <- function(x, ...) {
rethinking::extract.samples(x, ...)
}
Here, I have a function named f() for example, and I import some packages I need. Also I defined a function named g()in f(). But I findg()can't use the functions defined in the packages I import before, if I want to parallel the g.
f=function()
{
command 1...
library(pkg1)... # there is a function named t(),for example
library(pkg2)...
g=function(x)
{
t() # function from pkg1
}
library(parallel)
cl <- makeCluster(core,outfile="")
result=parLapply(cl,x,g) # error, the t is not defined
stopCluster(cl)
}
you have to export the function and the library to the cluster,
for example:
cl <<- makeCluster(length(Tasks), type = "PSOCK")
clusterEvalQ(cl,c(library(httr),library(XML),library(magrittr),library(xml2)))
clusterEvalQ(cl,expr) exports the libraries listed to the cluster so it can use them.
you have to do the same thing with variables only using clusterExport(cl,varlist)
Let's suppose that I want to apply, in a parallel fashion, myfunction to each row of myDataFrame. Suppose that otherDataFrame is a dataframe with two columns: COLUNM1_odf and COLUMN2_odf used for some reasons in myfunction. So I would like to write a code using parApply like this:
clus <- makeCluster(4)
clusterExport(clus, list("myfunction","%>%"))
myfunction <- function(fst, snd) {
#otherFunction and aGlobalDataFrame are defined in the global env
otherFunction(aGlobalDataFrame)
# some code to create otherDataFrame **INTERNALLY** to this function
otherDataFrame %>% filter(COLUMN1_odf==fst & COLUMN2_odf==snd)
return(otherDataFrame)
}
do.call(bind_rows,parApply(clus,myDataFrame,1,function(r) { myfunction(r[1],r[2]) }
The problem here is that R doesn't recognize COLUMN1_odf and COLUMN2_odf even if I insert them in clusterExport. How can I solve this problem? Is there a way to "export" all the object that snow needs in order to not enumerate each of them?
EDIT 1: I've added a comment (in the code above) in order to specify that the otherDataFrame is created interally to myfunction.
EDIT 2: I've added some pseudo-code in order to generalize myfunction: it now uses a global dataframe (aGlobalDataFrame and another function otherFunction)
Done some experiments, so I solved my problem (with the suggestion of Benjamin and considering the 'edit' that I've added to the question) with:
clus <- makeCluster(4)
clusterEvalQ(clus, {library(dplyr); library(magrittr)})
clusterExport(clus, "myfunction", "otherfunction", aGlobalDataFrame)
myfunction <- function(fst, snd) {
#otherFunction and aGlobalDataFrame are defined in the global env
otherFunction(aGlobalDataFrame)
# some code to create otherDataFrame **INTERNALLY** to this function
otherDataFrame %>% dplyr::filter(COLUMN1_odf==fst & COLUMN2_odf==snd)
return(otherDataFrame)
}
do.call(bind_rows, parApply(clus, myDataFrame, 1,
{function(r) { myfunction(r[1], r[2]) } )
In this way I've registered aGlobalDataFrame, myfunction and otherfunction, in short all the function and the data used by the function used to parallelize the job (myfunction itself)
Now that I'm not looking at this on my phone, I can see a couple of issues.
First, you are not actually creating otherDataFrame in your function. You are trying to pipe an existing otherDataFrame into filter, and if otherDataFrame doesn't exist in the environment, the function will fail.
Second, unless you have already loaded the dplyr package into your cluster environments, you will be calling the wrong filter function.
Lastly, when you've called parApply, you haven't specified anywhere what fst and snd are supposed to be. Give the following a try:
clus <- makeCluster(4)
clusterEvalQ(clus, {library(dplyr); library(magrittr)})
clusterExport(clus, "myfunction")
myfunction <- function(otherDataFrame, fst, snd) {
dplyr::filter(otherDataFrame, COLUMN1_odf==fst & COLUMN2_odf==snd)
}
do.call(bind_rows,parApply(clus,myDataFrame,1,function(r, fst, snd) { myfunction(r[fst],r[snd]), "[fst]", "[snd]") }
Is there an R function that lists all the functions in an R script file along with their arguments?
i.e. an output of the form:
func1(var1, var2)
func2(var4, var10)
.
.
.
func10(varA, varB)
Using [sys.]source has the very undesirable side-effect of executing the source inside the file. At the worst this has security problems, but even “benign” code may simply have unintended side-effects when executed. At best it just takes unnecessary time (and potentially a lot).
It’s actually unnecessary to execute the code, though: it is enough to parse it, and then do some syntactical analysis.
The actual code is trivial:
file_parsed = parse(filename)
functions = Filter(is_function, file_parsed)
function_names = unlist(Map(function_name, functions))
And there you go, function_names contains a vector of function names. Extending this to also list the function arguments is left as an exercise to the reader. Hint: there are two approaches. One is to eval the function definition (now that we know it’s a function definition, this is safe); the other is to “cheat” and just get the list of arguments to the function call.
The implementation of the functions used above is also not particularly hard. There’s probably even something already in R core packages (‘utils’ has a lot of stuff) but since I’m not very familiar with this, I’ve just written them myself:
is_function = function (expr) {
if (! is_assign(expr)) return(FALSE)
value = expr[[3L]]
is.call(value) && as.character(value[[1L]]) == 'function'
}
function_name = function (expr) {
as.character(expr[[2L]])
}
is_assign = function (expr) {
is.call(expr) && as.character(expr[[1L]]) %in% c('=', '<-', 'assign')
}
This correctly recognises function declarations of the forms
f = function (…) …
f <- function (…) …
assign('f', function (…) …)
It won’t work for more complex code, since assignments can be arbitrarily complex and in general are only resolvable by actually executing the code. However, the three forms above probably account for ≫ 99% of all named function definitions in practice.
UPDATE: Please refer to the answer by #Konrad Rudolph instead
You can create a new environment, source your file in that environment and then list the functions in it using lsf.str() e.g.
test.env <- new.env()
sys.source("myfile.R", envir = test.env)
lsf.str(envir=test.env)
rm(test.env)
or if you want to wrap it as a function:
listFunctions <- function(filename) {
temp.env <- new.env()
sys.source(filename, envir = temp.env)
functions <- lsf.str(envir=temp.env)
rm(temp.env)
return(functions)
}
The language is R.
I have a couple of files:
utilities.non.foo.R
utilities.foo.R
utilities.R
foo is an in-house package that has been cobbled together (for image processing, although this is irrelevant). It works great, but only on Linux machines, and it is a huge pain to try and compile it even on those.
Basically, utilities.foo.R contains a whole lot of functions that require package foo.
The functions in here are called functionname.foo.
I'm about to start sharing this code with external collaborators who don't have this package or Linux, so I've written a file utilities.non.foo.R, which contains all the functions in utilities.foo.R, except the dependency on package foo has been removed.
These functions are all called functionname.non.foo.
The file utilities.R has a whole heap of this, for each function:
functionname <- function(...) {
if ( fooIsLoaded() ) {
functionname.foo(...)
} else {
functionname.non.foo(...)
}
}
The idea is that one only needs to load utilities.R and if you happen to have package foo (e.g. my internal collaborators), you will use that backend. If you don't have foo (external collaborators), you'll use the non-foo backend.
My question is: is there some way to do the redirection for each function name without explicitly writing the above bit of code for every single function name?
This reminds me of how (e.g.) there is a print method, a print.table, print.data.frame, etc, but the user only needs to use print and which method is used is chosen automatically.
I'd like to have that, except the method.class would be more like method.depends_on_which_package_is_loaded.
Is there any way to avoid writing a redirection function per function in my utilities.R file?
As Dirk says, just use a package. In this case, put all your new *.non.foo functions in a new package, which is also called foo. Distribute this foo to your collaborators, instead of your in-house version. That way your utilities code can just be
functionname <- function(...) functionname.foo(...)
without having to make any checks at all.
Here is an idea: write a function that sets f to either f.foo or f.non.foo. It could be called in a loop, over all functions in a given namespace (or all functions whose name ends in .foo).
dispatch <- function(s) {
if ( fooIsLoaded() ) {
f <- get( paste(s, "foo", sep=".") )
} else {
f <- get( paste(s, "non.foo", sep=".") )
}
assign( s, f, envir=.GlobalEnv ) # You may want to use a namespace
}
f.foo <- function() cat("foo\n")
f.non.foo <- function() cat("non-foo\n")
fooIsLoaded <- function() TRUE
dispatch("f")
f()
fooIsLoaded <- function() FALSE
dispatch("f")
f()
A simpler solution would be to give the same name
to both functions, but put them in different namespaces/packages.
This sounds quite inefficient and inelegant, but how about
funify = function(f, g, package="ggplot2") {
if(paste("package:", package, sep="") %in% search()) f else
{ message("how do you intend to work without ", package) ; g}
}
detach(package:ggplot2)
foo = funify(paste, function(x) letters[x])
foo(1:10)
library(ggplot2)
foo = funify(paste, function(x) letters[x])
foo(1:10)