R optimParallel - could not find function - r

I'm trying to understand how optimParallel deals with embedded functions:
fn1 <- function(x){x^2-x+1}
fn2 <- function(x){fn1(x)} # effectively fn1
cl <- parallel::makeCluster(parallel::detectCores()-1)
parallel::setDefaultCluster(cl = cl)
optimParallel::optimParallel(par = 0, fn = fn1) # Worked
optimParallel::optimParallel(par = 0, fn = fn2) # Not working
parallel::setDefaultCluster(cl=NULL)
parallel::stopCluster(cl)
Why the 2nd not working? the error message is
Error in checkForRemoteErrors(val) :
3 nodes produced errors; first error: could not find function "fn1"
How to fix it?

This problem is specific to the used cluster type:
If a FORK cluster is created after the function definitions, it works.
FORK cluster are available on Linux like systems only:
library(optimParallel)
fn1 <- function(x) x^2-x+1
fn2 <- function(x) fn1(x)
cl <- makeCluster(detectCores()-1, type="FORK")
setDefaultCluster(cl=cl)
optimParallel(par=0, fn=fn2)[[1]]
## [1] 0.5
For other cluster types one can add fn1 as an argument to fn2 and add it as ... argument to optimParallel():
fn1 <- function(x) x^2-x+1
fn2 <- function(x, fn1) fn1(x)
cl <- makeCluster(detectCores()-1)
setDefaultCluster(cl=cl)
optimParallel(par=0, fn=fn2, fn1=fn1)[[1]]
## [1] 0.5
Alternatively, one can export fn1 to all R processes in the cluster:
fn1 <- function(x) x^2-x+1
fn2 <- function(x) fn1(x)
cl <- makeCluster(detectCores()-1)
setDefaultCluster(cl=cl)
clusterExport(cl, "fn1")
optimParallel(par=0, fn=fn2)[[1]]
## [1] 0.5

I believe the problem is that your function fn1 cannot be found on the cores. There is no reproducible example here for me to test, but I believe something along the these lines should work:
fn1 <- function(x){x^2-x+1}
fn2 <- function(x){fn1(x)} # effectively fn1
cl <- parallel::makeCluster(parallel::detectCores()-1)
parallel::setDefaultCluster(cl = cl)
# Export the function to the cluster such that it exists in the global environment
# of the worker
parallel::clusterExport(cl, "fn1")
optimParallel::optimParallel(par = 0, fn = fn2) # Not working
parallel::setDefaultCluster(cl=NULL)
parallel::stopCluster(cl)
The key is that your function fn1 does not exist in the global environment of the worker.

Related

How to Write R Package Documentation for a Function with Parallel Backend

I want to write this function as an R package
Edit
#' create suns package
#''
#' More detailed Description
#'
#' #describeIn This sums helps to
#'
#' #importFrom foreach foreach
#'
#' #importFrom doParallel registerDoParallel
#'
#' #param x Numeric Vector
#'
#' #importFrom doParallel `%dopar%`
#'
#' #importFrom parallel parallel
#'
#' #export
sums <- function(x){
plan(multisession)
n_cores <- detectCores()# check for howmany cores present in the Operating System
cl <- parallel::makeCluster(n_cores)# use all the cores pdectected
doParallel::registerDoParallel(cores = detectCores())
ss <- function(x){
`%dopar%` <- foreach::`%dopar%`
foreach::foreach(i = x, .combine = "+") %dopar% {i}
}
sss <- function(x){
`%dopar%` <- foreach::`%dopar%`
foreach::foreach(i = x, .combine = "+") %dopar% {i^2}
}
ssq <- function(x){
`%dopar%` <- foreach::`%dopar%`
foreach::foreach(i = x, .combine = "+") %dopar% {i^3}
}
sums <- function(x, methods = c("sum", "squaredsum", "cubedsum")){
output <- c()
if("sum" %in% methods){
output <- c(output, ss = ss(x))
}
if("squaredsum" %in% methods){
output <- c(output, sss = sss(x))
}
if("cubedsum" %in% methods){
output <- c(output, ssq = ssq(x))
}
return(output)
}
parallel::stopCluster(cl = cl)
x <- 1:10
sums(x)
.
What I Need
Assuming my vector x is such large that it will take a serial processing about 5 hours to complete the task like x <- 1:9e9 where parallel processing can help.
How do I include:
n_cores <- detectCores()
#cl <- makeCluster(n_cores)
#registerDoParallel(cores = detectCores())
in my .R file and DESCRIPTION file such that it will be worthy of R package documentation?
Even if it is not very easy to see the scope of the question, I'll try to make relevent suggestions. I understand that you have problems running check on your package with examples/tests that use parallel computation.
First of all, remember that check uses CRAN standards and it is impossible in a CRAN package to run examples or tests that use more than 2 cores for compatibility reasons. So your examples must be simple enough to be dealt with by 2 cores.
Then there is a problem in your code as your create a cluster but don't use it in the doParallel
Next you are using in your piece of code parallel package and doParallel package, therefore they must be included in the DESCRIPTION file running in your console:
usethis::use_package("parallel")
usethis::use_package("doParallel")
This will add both packages in the "Imports" section of your description. And then your won't load these libraries explicitely in your package.
Then you should also clarify your function in your example using "::" after the name of the relevant package which would make your example look like:
n_cores <- 2
cl <- parallel::makeCluster(n_cores)
doParallel::registerDoParallel(cl = cl)
...
parallel::stopCluster(cl = cl)
You can also refer to the registerDoParallel documentation to get a similar piece of code, you will also find that it is limited to 2 cores.
To be complete, I do not think your really need foreach package since default parallelization in R is very powerful. If you want to be able to use your function with detectCores, I would suggest you add a limitint parameter. This function should do what you want in a more "R like" manner:
sums <- function(x, methods, maxcores) {
n_cores <- min(maxcores,
parallel::detectCores())# check for howmany cores present in the Operating System
cl <- parallel::makeCluster(n_cores)# use all the cores pdectected
outputs <- sapply(
X = methods,
FUN = function(method) {
if ("sum" == method) {
output <- parallel::parSapply(
cl = cl,
X = x,
FUN = function(i)
i
)
}
if ("squaredsum" == method) {
output <-
parallel::parSapply(
cl = cl,
X = x,
FUN = function(i)
i ** 2
)
}
if ("cubedsum" == method) {
output <-
parallel::parSapply(
cl = cl,
X = x,
FUN = function(i)
i ** 3
)
}
return(sum(output))
}
)
parallel::stopCluster(cl = cl)
return(outputs)
}
x <- 1:10000000
sums(x = x, c("sum", "squaredsum"), 2)

R - How to pass an expression to a cluster (in the parallel package)

I am trying to create a parallelized version of replicate on top of the parallel package. An issue I'm running into is that it keeps evaluating my expressions before handing them to the replicate function, reproducible code:
par_replicate <- function(cl, n, expr){
parallel::clusterCall(
cl = cl,
function() replicate(n , expr)
)
}
cl <- parallel::makeCluster(2)
par_replicate(cl, 3, rnorm(1))
stopCluster(cl)
[[1]]
[1] -1.312669 -1.312669 -1.312669
[[2]]
[1] 0.5598533 0.5598533 0.5598533
As you can see the expression is evaluated within the cluster before its given to the replicate function thus replicate just returns multiple copies of the same number. I am at a complete loss for how to solve this so any help would be appreciated.
Encase anyone was interested in the answer I managed to fix this using the match.call() and eval() functions:
par_replicate <- function(cl, n, expr){
x <- match.call()
parallel::clusterCall(
cl = cl,
function() replicate(n , eval(x$expr))
)
}
cl <- parallel::makeCluster(2)
par_replicate(cl, 3, {rnorm(1)})
stopCluster(cl)
Not sure if this is the best solution but it seems to work for me :)

Using different simulation functions in each core with parallel::clusterApply() in R

I want to use the parallel package for my study. However, each simulation has different parameters. I tried the code below, but it did not work.
require(snow)
library(parallel)
tasks = list(
job1 = function(t, n) sim(t=5, n=30),
job2 = function(t, n) sim(t=5, n=50)
)
cl = makeCluster( length(tasks) )
clusterExport(cl, ls())
out = clusterApply(cl, tasks, function(f) f(t, n))
Any help and suggestions will be appreciated
You could use parallel::parLapply() for this:
library(parallel)
cl <- makeCluster(2)
## simulation function
sim <- function(t, n){
## do things
paste0("input args are: t=", t, ", n=", n)
}
clusterExport(cl, "sim")
## argument configurations
args <- list(config1=list(t=10, n=1),
config2=list(t=20, n=2))
## run the function for all argument configurations in parallel
parLapply(cl=cl, X=args, fun=function(x) do.call(sim, x))
$config1
[1] "input args are: t=10, n=1"
$config2
[1] "input args are: t=20, n=2"

R - parallelisation error, checkCluster(cl) - not a valid cluster

This code brings me an error: Error in checkCluster(cl): not a valid cluster
library(parallel)
numWorkers <-8
cl <-makeCluster(numWorkers, type="PSOCK")
res.mat <- parLapply(1:10, function(x) my.fun(x))
stopCluster(cl)
Without parallelisation attempts this works totally fine:
res.mat <- lapply(1:10, function(x) my.fun(x))
And this example works very well too:
workerFunc <- function(n){return(n^2)}
library(parallel)
numWorkers <-8
cl <-makeCluster(numWorkers, type ="PSOCK")
res <- parLapply(cl, 1:100, workerFunc)
stopCluster(cl)
print(unlist(res))
How can i solve my problem?
I found for example
class(cl)
[1] "SOCKcluster" "cluster"
an cl is:
cl
socket cluster with 8 nodes on host ‘localhost’
library(parallel)
numWorkers <- 8
cl <-makeCluster(numWorkers, type="PSOCK")
res.mat <- parLapply(cl,1:10, function(x) my.fun(x))
stopCluster(cl)
Just to be excessively specific, the problem with
res.mat <- parLapply(1:10, function(x) my.fun(x))
is not necessarily the order of the arguments, but that the argument cl is not specified.
cl <-makeCluster(numWorkers, type ="PSOCK")
res.mat <- parLapply(x = 1:10,
fun = function(x) my.fun(x),
cl = cl
)
should work, because all required arguments are specified. Alternatively, ?parLapply indicates that parLapply uses the default cluster if cl is not specified. A default cluster can be set using parallel::setDefaultCluster(), which then allows parLapply to revert to default behaviour when cl is not included in the user input.
cl <-makeCluster(numWorkers, type ="PSOCK")
parallel::setDefaultCluster(cl)
res.mat <- parallel::parLapply(x = 1:10,#by default cl = NULL if not specified
fun = function(x) my.fun(x),
)

Can't find variable when parallel

When I tried this snippet of R code. I have problem in parallel
# include library
require(stats)
library(GMD)
library(parallel)
# include function
source('~/Workspaces/Projects/RProject/MovielensCluster/readData.R'); # contain readtext.convert() function
###
elbow.k <- function(mydata){
## determine a "good" k using elbow
dist.obj <- dist(mydata);
hclust.obj <- hclust(dist.obj);
css.obj <- css.hclust(dist.obj,hclust.obj);
elbow.obj <- elbow.batch(css.obj);
# print(elbow.obj)
k <- elbow.obj$k
return(k)
}
# include file
filePath <- "dataset/u.user";
data.convert <- readtext.convert(filePath);
data.clustering <- data.convert[,c(-1,-4)];
# find k value
no_cores <- detectCores();
cl<-makeCluster(no_cores);
clusterExport(cl, list("data.clustering", "data.original", "elbow.k", "clustering.kmeans"));
start.time <- Sys.time();
k.clusters <- parSapply(cl, 1, function(x) elbow.k(data.clustering));
end.time <- Sys.time();
cat('Time to find k using Elbow method is',(end.time - start.time),'seconds with k value:', k.clusters);
I has an error notification:
Error in get(name, envir = envir) : object 'data.original' not found
Error in checkForRemoteErrors(val) :
one node produced an error: could not find function "elbow.k"
Can anyone help me to fix it ? Thanks a lot.
I think your problem relate to "variable scope". On Mac/Linux you have the option of using makeCluster(no_core, type="FORK") that automatically contains all environment variables. On Windows you have to use the Parallel Socket Cluster (PSOCK) that starts out with only the base packages loaded. Thus, you always specifiy exactly what variables as well as library that you include for parallel function to work. clusterExport() and clusterEvalQ() are necessary so as to the function to see the needed variables and packages respectively. Note that any changes to the variable after clusterExport are ignored. Comeback to your problem. You must use as following:
clusterEvalQ(cl, library(GMD));
and your full code:
# include library
require(stats)
library(GMD)
library(parallel)
# include function
source('~/Workspaces/Projects/RProject/MovielensCluster/readData.R'); # contain readtext.convert() function
###
elbow.k <- function(mydata){
## determine a "good" k using elbow
dist.obj <- dist(mydata);
hclust.obj <- hclust(dist.obj);
css.obj <- css.hclust(dist.obj,hclust.obj);
elbow.obj <- elbow.batch(css.obj);
# print(elbow.obj)
k <- elbow.obj$k
return(k)
}
# include file
filePath <- "dataset/u.user";
data.convert <- readtext.convert(filePath);
data.clustering <- data.convert[,c(-1,-4)];
# find k value
no_cores <- detectCores();
cl<-makeCluster(no_cores);
clusterEvalQ(cl, library(GMD));
clusterExport(cl, list("data.clustering", "data.original", "elbow.k", "clustering.kmeans"));
start.time <- Sys.time();
k.clusters <- parSapply(cl, 1, function(x) elbow.k(data.clustering));
end.time <- Sys.time();
cat('Time to find k using Elbow method is',(end.time - start.time),'seconds with k value:', k.clusters);

Resources