When I tried this snippet of R code. I have problem in parallel
# include library
require(stats)
library(GMD)
library(parallel)
# include function
source('~/Workspaces/Projects/RProject/MovielensCluster/readData.R'); # contain readtext.convert() function
###
elbow.k <- function(mydata){
## determine a "good" k using elbow
dist.obj <- dist(mydata);
hclust.obj <- hclust(dist.obj);
css.obj <- css.hclust(dist.obj,hclust.obj);
elbow.obj <- elbow.batch(css.obj);
# print(elbow.obj)
k <- elbow.obj$k
return(k)
}
# include file
filePath <- "dataset/u.user";
data.convert <- readtext.convert(filePath);
data.clustering <- data.convert[,c(-1,-4)];
# find k value
no_cores <- detectCores();
cl<-makeCluster(no_cores);
clusterExport(cl, list("data.clustering", "data.original", "elbow.k", "clustering.kmeans"));
start.time <- Sys.time();
k.clusters <- parSapply(cl, 1, function(x) elbow.k(data.clustering));
end.time <- Sys.time();
cat('Time to find k using Elbow method is',(end.time - start.time),'seconds with k value:', k.clusters);
I has an error notification:
Error in get(name, envir = envir) : object 'data.original' not found
Error in checkForRemoteErrors(val) :
one node produced an error: could not find function "elbow.k"
Can anyone help me to fix it ? Thanks a lot.
I think your problem relate to "variable scope". On Mac/Linux you have the option of using makeCluster(no_core, type="FORK") that automatically contains all environment variables. On Windows you have to use the Parallel Socket Cluster (PSOCK) that starts out with only the base packages loaded. Thus, you always specifiy exactly what variables as well as library that you include for parallel function to work. clusterExport() and clusterEvalQ() are necessary so as to the function to see the needed variables and packages respectively. Note that any changes to the variable after clusterExport are ignored. Comeback to your problem. You must use as following:
clusterEvalQ(cl, library(GMD));
and your full code:
# include library
require(stats)
library(GMD)
library(parallel)
# include function
source('~/Workspaces/Projects/RProject/MovielensCluster/readData.R'); # contain readtext.convert() function
###
elbow.k <- function(mydata){
## determine a "good" k using elbow
dist.obj <- dist(mydata);
hclust.obj <- hclust(dist.obj);
css.obj <- css.hclust(dist.obj,hclust.obj);
elbow.obj <- elbow.batch(css.obj);
# print(elbow.obj)
k <- elbow.obj$k
return(k)
}
# include file
filePath <- "dataset/u.user";
data.convert <- readtext.convert(filePath);
data.clustering <- data.convert[,c(-1,-4)];
# find k value
no_cores <- detectCores();
cl<-makeCluster(no_cores);
clusterEvalQ(cl, library(GMD));
clusterExport(cl, list("data.clustering", "data.original", "elbow.k", "clustering.kmeans"));
start.time <- Sys.time();
k.clusters <- parSapply(cl, 1, function(x) elbow.k(data.clustering));
end.time <- Sys.time();
cat('Time to find k using Elbow method is',(end.time - start.time),'seconds with k value:', k.clusters);
Related
I'm working with raster package and I try to switch to terra but for some reasons that I don't understand, terra cannot reproduce the same operation of raster when working in parallel with packages such snowfall and future.apply. Here is a reproducible example.
library(terra)
r <- rast()
r[] <- 1:ncell(r)
m <- rast()
m[] <- c(rep(1,ncell(m)/5),rep(2,ncell(m)/5),rep(3,ncell(m)/5),rep(4,ncell(m)/5),rep(5,ncell(m)/5))
ms <- separate(m,other=NA)
plot(ms)
mymask <- function(ind){
tipo <- tipo_tav[ind]
mask <- ms[[ind]]
masked <-
terra::mask(
r,
mask
)
richard <- function(x){
k <-0.2
v <-0.3
a <-200
y0 <-2
y <- k/v*x*(1-((x/a)^v))+y0
return(y)
}
pred <- richard(masked)
pred <- clamp(pred,lower=0)
return(pred)
}
#the sequential usage works fine, faster than the `raster` counterpart
system.time(x <- mymask(1))#0.03
#when I try to run my function in parallel I receive an error
plan(multisession,workers=5)
system.time(pred_list <- future_lapply(1:5, FUN = mymask))
Error in .External(list(name = "CppMethod__invoke_notvoid", address = <pointer: (nil)>, :
NULL value as symbol address.
the exactly same code works well if I change rast with raster and terra::mask with raster::mask. See below:
library(raster)
r <- raster(r)
ms <- stack(ms)
mymask <- function(ind){
tipo <- tipo_tav[ind]
mask <- ms[[ind]]
masked <-
raster::mask(
r,
mask
)
richard <- function(x){
k <-0.2
v <-0.3
a <-200
y0 <-2
y <- k/v*x*(1-((x/a)^v))+y0
return(y)
}
pred <- richard(masked)
pred <- clamp(pred,lower=0)
return(pred)
}
#this works fine
system.time(x <- mymask(1))#0.06
#this works too
plan(multisession,workers=5)
system.time(pred_list <- future_lapply(1:5, FUN = mymask))#15.48
The same behavior if I use snowfall instead of future
library(snowfall)
sfInit(parallel = TRUE, cpus =5)
sfLibrary(terra)
sfExportAll()
system.time(pred_list <- sfLapply(1:5, fun = mymask))
sfStop()
this return the same error of future_lapply
Why is this happening? I've never seen such an error. I was hoping to take advantage of the higher speed of terra but so I'm stuck.
A SpatRaster cannot be serialized, you cannot send it to parallel compute nodes. Have a look here for more discussion.
Instead you can (a) send and receive filenames; (b) parallelize your custom function that you supply to app or lapp; (c) use the cores=n argument (where available, e.g. app and predict); (d) use a mechanism like wrap; (e) send a filename and a vector to make a SpatExtent to process and create a virtual raster from the output tiles (see ?vrt).
For example, you could do use a function like this (Option "a")
prich <- function(filein, fileout) {
r <- rast(filein)
richard <- function(x) {
k <-0.2
v <-0.3
a <-200
y0 <-2
y <- k/v*x*(1-((x/a)^v))+y0
y[y<0] <- 0
return(y)
}
x <- app(masked, richard, filename=fileout, overwrite=TRUE)
return(TRUE)
}
I use app because it is much more efficient for large rasters --- as it could avoid writing temp files for each of the 10 arithmetic operations with a SpatRaster. Given that you want to parallelize this relatively simple function, I assume the files are very large.
Or option "c":
richard <- function(x) {
k <-0.2
v <-0.3
a <-200
y0 <-2
y <- k/v*x*(1-((x/a)^v))+y0
y[y<0] <- 0
return(y)
}
x <- app(masked, richard, cores=12)
In neither case I included the masking. You could include it in option "a" but mask is disk I/O intensive, not computationally intensive, so it might be as efficient to do it in one step rather than in parallel.
With wrap you could do something like this
f <- function(w) {
x <- rast(w)
y <- richard(x)
wrap(y)
}
r <- rast(nrow=10, ncol=10, vals=1:100)
x <- f(wrap(r))
x <- rast(x)
Where f would be run in parallel. That only works for small rasters, but you could parallelize over tiles, and you can create tiles with terra::makeTiles.
More internal parallelization options will be coming, but don't hold your breath.
I'm trying to understand how optimParallel deals with embedded functions:
fn1 <- function(x){x^2-x+1}
fn2 <- function(x){fn1(x)} # effectively fn1
cl <- parallel::makeCluster(parallel::detectCores()-1)
parallel::setDefaultCluster(cl = cl)
optimParallel::optimParallel(par = 0, fn = fn1) # Worked
optimParallel::optimParallel(par = 0, fn = fn2) # Not working
parallel::setDefaultCluster(cl=NULL)
parallel::stopCluster(cl)
Why the 2nd not working? the error message is
Error in checkForRemoteErrors(val) :
3 nodes produced errors; first error: could not find function "fn1"
How to fix it?
This problem is specific to the used cluster type:
If a FORK cluster is created after the function definitions, it works.
FORK cluster are available on Linux like systems only:
library(optimParallel)
fn1 <- function(x) x^2-x+1
fn2 <- function(x) fn1(x)
cl <- makeCluster(detectCores()-1, type="FORK")
setDefaultCluster(cl=cl)
optimParallel(par=0, fn=fn2)[[1]]
## [1] 0.5
For other cluster types one can add fn1 as an argument to fn2 and add it as ... argument to optimParallel():
fn1 <- function(x) x^2-x+1
fn2 <- function(x, fn1) fn1(x)
cl <- makeCluster(detectCores()-1)
setDefaultCluster(cl=cl)
optimParallel(par=0, fn=fn2, fn1=fn1)[[1]]
## [1] 0.5
Alternatively, one can export fn1 to all R processes in the cluster:
fn1 <- function(x) x^2-x+1
fn2 <- function(x) fn1(x)
cl <- makeCluster(detectCores()-1)
setDefaultCluster(cl=cl)
clusterExport(cl, "fn1")
optimParallel(par=0, fn=fn2)[[1]]
## [1] 0.5
I believe the problem is that your function fn1 cannot be found on the cores. There is no reproducible example here for me to test, but I believe something along the these lines should work:
fn1 <- function(x){x^2-x+1}
fn2 <- function(x){fn1(x)} # effectively fn1
cl <- parallel::makeCluster(parallel::detectCores()-1)
parallel::setDefaultCluster(cl = cl)
# Export the function to the cluster such that it exists in the global environment
# of the worker
parallel::clusterExport(cl, "fn1")
optimParallel::optimParallel(par = 0, fn = fn2) # Not working
parallel::setDefaultCluster(cl=NULL)
parallel::stopCluster(cl)
The key is that your function fn1 does not exist in the global environment of the worker.
I'd like to know whether the cpv function within the trotter package works with %dopar%? I'm getting the following error:
task 1 failed - "object of type 'S4' is not subsettable"
Here's a small example:
library(doParallel)
library(trotter)
registerDoParallel(cores = 2)
x <- letters
combos <- cpv(2, 1:4)
print(combos)
num_combos <- length(combos)
results_list <- foreach(combo_num=1:num_combos) %dopar% { # many iterations
y <- x[combos[combo_num]]
# time consuming stuff follows that involves using y
}
Replacing %dopar% with %do% (or simply using a for loop) and it works fine.
Depending on the cluster type one needs to explicitly specify the used packages via the .packages argument. The following should work:
library(doParallel)
library(trotter)
cl <- makePSOCKcluster(2)
registerDoParallel(cl=cl)
x <- letters
combos <- cpv(2, 1:4)
num_combos <- length(combos)
rl <- foreach(combo_num=1:num_combos, .packages="trotter") %dopar% {
x[combos[combo_num]]
}
This code brings me an error: Error in checkCluster(cl): not a valid cluster
library(parallel)
numWorkers <-8
cl <-makeCluster(numWorkers, type="PSOCK")
res.mat <- parLapply(1:10, function(x) my.fun(x))
stopCluster(cl)
Without parallelisation attempts this works totally fine:
res.mat <- lapply(1:10, function(x) my.fun(x))
And this example works very well too:
workerFunc <- function(n){return(n^2)}
library(parallel)
numWorkers <-8
cl <-makeCluster(numWorkers, type ="PSOCK")
res <- parLapply(cl, 1:100, workerFunc)
stopCluster(cl)
print(unlist(res))
How can i solve my problem?
I found for example
class(cl)
[1] "SOCKcluster" "cluster"
an cl is:
cl
socket cluster with 8 nodes on host ‘localhost’
library(parallel)
numWorkers <- 8
cl <-makeCluster(numWorkers, type="PSOCK")
res.mat <- parLapply(cl,1:10, function(x) my.fun(x))
stopCluster(cl)
Just to be excessively specific, the problem with
res.mat <- parLapply(1:10, function(x) my.fun(x))
is not necessarily the order of the arguments, but that the argument cl is not specified.
cl <-makeCluster(numWorkers, type ="PSOCK")
res.mat <- parLapply(x = 1:10,
fun = function(x) my.fun(x),
cl = cl
)
should work, because all required arguments are specified. Alternatively, ?parLapply indicates that parLapply uses the default cluster if cl is not specified. A default cluster can be set using parallel::setDefaultCluster(), which then allows parLapply to revert to default behaviour when cl is not included in the user input.
cl <-makeCluster(numWorkers, type ="PSOCK")
parallel::setDefaultCluster(cl)
res.mat <- parallel::parLapply(x = 1:10,#by default cl = NULL if not specified
fun = function(x) my.fun(x),
)
I would like to do bootstrap of residuals for nls fits in a loop. I use nlsBoot and in order to decrease computation time I would like to do that in parallel (on a Windows 7 system at the moment). Here is some code, which reproduces my problem:
#function for fitting
Falge2000 <- function(GP2000,alpha,PAR) {
(GP2000*alpha*PAR)/(GP2000+alpha*PAR-GP2000/2000*PAR)
}
#some data
PAR <- 10:1600
GPP <- Falge2000(-450,-0.73,PAR) + rnorm(length(PAR),sd=0.0001)
df1 <- data.frame(PAR,GPP)
#nls fit
mod <- nls(GPP~Falge2000(GP2000,alpha,PAR),start=list(GP2000=-450,alpha=-0.73),data=df1, upper=c(0,0),algorithm="port")
#bootstrap of residuals
library(nlstools)
summary(nlsBoot(mod,niter=5))
#works
#now do it several times
#and in parallel
library(foreach)
library(doParallel)
cl <- makeCluster(1)
registerDoParallel(cl)
ttt <- foreach(1:5, .packages='nlstools',.export="df1") %dopar% {
res <- nlsBoot(mod,niter=5)
summary(res)
}
#Error in { :
#task 1 failed - "Procedure aborted: the fit only converged in 1 % during bootstrapping"
stopCluster(cl)
I suspect this an issue with environments and after looking at the code of nlsBoot the problem seems to arise from the use of an anonymous function in a lapply call:
l1 <- lapply(1:niter, function(i) {
data2[, var1] <- fitted1 + sample(scale(resid1, scale = FALSE),
replace = TRUE)
nls2 <- try(update(nls, start = as.list(coef(nls)), data = data2),
silent = TRUE)
if (inherits(nls2, "nls"))
return(list(coef = coef(nls2), rse = summary(nls2)$sigma))
})
if (sum(sapply(l1, is.null)) > niter/2)
stop(paste("Procedure aborted: the fit only converged in",
round(sum(sapply(l1, is.null))/niter), "% during bootstrapping"))
Is there a way to use nlsBoot in a parallel loop? Or do I need to modify the function? (I could try to use a for loop instead of lapply.)
By moving the creation of the mod object into the %dopar% loop, it looks like everything works OK. Also, this automatically exports the df1 object, so you can remove the .export argument.
ttt <- foreach(1:5, .packages='nlstools') %dopar% {
mod <- nls(GPP~Falge2000(GP2000,alpha,PAR),start=list(GP2000=-450,alpha=-0.73),data=df1, upper=c(0,0),algorithm="port")
res <- nlsBoot(mod,niter=5)
capture.output(summary(res))
}
However, you might need to work out what you want returned. Using capture.output was just to see if things were working, since summary(res) seemed to only return NULL.