I am trying to run code on several cores (I tried both the snow and parallel packages). I have
cl <- makeCluster(2)
y <- 1:10
sapply(1:5, function(x) x + y) # Works
parSapply(cl, 1:5, function(x) x + y)
The last line returns the error:
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: object 'y' not found
Clearly parSapply isn't finding y in the global environment. Any ways to get around this? Thanks.
The nodes don't know about the y in the global environment on the master. You need to tell them somehow.
library(parallel)
cl <- makeCluster(2)
y <- 1:10
# add y to function definition and parSapply call
parSapply(cl, 1:5, function(x,y) x + y, y)
# export y to the global environment of each node
# then call your original code
clusterExport(cl, "y")
parSapply(cl, 1:5, function(x) x + y)
It is worth mentioning that your example will work if parSapply is called from within a function, although the real issue is where the function function(x) x + y is created. For example, the following code works correctly:
library(parallel)
fun <- function(cl, y) {
parSapply(cl, 1:5, function(x) x + y)
}
cl <- makeCluster(2)
fun(cl, 1:10)
stopCluster(cl)
This is because functions that are created in other functions are serialized along with the local environment in which they were created, while functions created from the global environment are not serialized along with the global environment. This can be useful at times, but it can also lead to a variety a problems if you're not aware of the issue.
Related
I am trying to optimize a function using the genoud genetic optimizer in R, which is part of library(rgenoud). I have limitted experience with setting up parallel processing in R.
genoud has a built in cluster option to makeClusters whose names are known. I want to set up parallel processing on a single multi-core machine. I understand the local machine's name is localhost and repeating the name will use more than one core on the machine.
This approach seems to be working fine for simple functions such as:
genout <- genoud(sin, 1, cluster=c('localhost','localhost','localhost'))
However when I optimize more complex function structures some of the machines or environments do not find all functions. Here is an example:
fun1 <- function(x) {sin(x)}
fun2 <- function(x) {
x <- fun1(x)
ret <- x + cos(x)
return(ret)
}
genout <- genoud(fun2, 1, cluster=c('localhost','localhost','localhost'))
This gives error
Error in checkForRemoteErrors(val) :
3 nodes produced errors; first error: could not find function "fun1"
One solution seems to be to embed fun1 into fun2. However, this seems inefficient if fun2 is run a large number of times (in other cases than the silly example). Is it true that the only way to solve this problem is embedding fun1 in fun2?
Edit: Even when embedding, there are more problems when objects need to be passed to fun2 via genoud, see:
y=1
fun2 <- function(x,z) {
fun1 <- function(x) {sin(x)}
x <- fun1(x)
ret <- x + cos(x)*z
return(ret)
}
genout <- genoud(fun2, 1, cluster=c('localhost','localhost','localhost'), z=y)
>Error in checkForRemoteErrors(val) :
3 nodes produced errors; first error: object 'y' not found
I think the problem is that the genoud function is not properly evaluating and sending the additional arguments to the workers. In this case, the argument z is sent to the workers without being evaluated, so the workers have to evaluate it, but they fail because the variable y wasn't sent to the workers.
A work-around is to export the necessary variables to the workers using clusterExport, which requires you to explicitly create the cluster object for genoud to use:
library(rgenoud)
library(parallel)
cl <- makePSOCKcluster(3)
fun1 <- function(x) {sin(x)}
fun2 <- function(x, z) {
x <- fun1(x)
x + cos(x) * z
}
y <- 1
clusterExport(cl, c('fun1', 'y'))
genout <- genoud(fun2, 1, cluster=cl, z=y)
This also exports fun1, which is necessary since genoud doesn't know that fun2 depends on it.
I am fond of the parallel package in R and how easy and intuitive it is to do parallel versions of apply, sapply, etc.
Is there a similar parallel function for replicate?
You can just use the parallel versions of lapply or sapply, instead of saying to replicate this expression n times you do the apply on 1:n and instead of giving an expression, you wrap that expression in a function that ignores the argument sent to it.
possibly something like:
#create cluster
library(parallel)
cl <- makeCluster(detectCores()-1)
# get library support needed to run the code
clusterEvalQ(cl,library(MASS))
# put objects in place that might be needed for the code
myData <- data.frame(x=1:10, y=rnorm(10))
clusterExport(cl,c("myData"))
# Set a different seed on each member of the cluster (just in case)
clusterSetRNGStream(cl)
#... then parallel replicate...
parSapply(cl, 1:10000, function(i,...) { x <- rnorm(10); mean(x)/sd(x) } )
#stop the cluster
stopCluster(cl)
as the parallel equivalent of:
replicate(10000, {x <- rnorm(10); mean(x)/sd(x) } )
Using clusterEvalQ as a model, I think I would implement a parallel replicate as:
parReplicate <- function(cl, n, expr, simplify=TRUE, USE.NAMES=TRUE)
parSapply(cl, integer(n), function(i, ex) eval(ex, envir=.GlobalEnv),
substitute(expr), simplify=simplify, USE.NAMES=USE.NAMES)
The arguments simplify and USE.NAMES are compatible with sapply rather than replicate, but they make it a better wrapper around parSapply in my opinion.
Here's an example derived from the replicate man page:
library(parallel)
cl <- makePSOCKcluster(3)
hist(parReplicate(cl, 100, mean(rexp(10))))
The future.apply package provides a plug-in replacement to replicate() that runs in parallel and uses statistical sound parallel random number generation out of the box:
library(future.apply)
plan(multisession, workers = 4)
y <- future_replicate(100, mean(rexp(10)))
Lets say I'm trying to run the following code
library(gregmisc)
library(parallel)
myfunction <- function(x){
combinations(10, x, 1:10)
}
cl <- makeCluster(getOption("cl.cores", 2))
parLapply(cl, 3, myfunction)
I'm getting the error
#Error in checkForRemoteErrors(val) :
#one node produced an error: could not find function "combinations"
So if I'll library "gregmisc" package within the function it will work
myfunction <- function(x){
library(gregmisc)
combinations(10, x, 1:10)
}
cl <- makeCluster(getOption("cl.cores", 2))
parLapply(cl, 3, myfunction)
The question is, how can I avoide librarying packages within the function?
I saw similar questions were asked already re "snow" and "snowfall" in here and here
but I couldn't get it to work for the "parallel" package
I've tried (without success)
library(snow)
library(snowfall)
sfExport(list=list("combinations"))
sfLibrary(gregmisc)
clusterEvalQ(cl, library(gregmisc))
I don't see any combinations function in gregmisc. Could that be your actual problem?
Loading packages on each node with clusterEvalQ() should work, and always has worked for me.
The following code is lifted nearly verbatim from page 8 of vignette("parallel"):
require(parallel)
cl <- makeCluster(4)
junk <- clusterEvalQ(cl, library(boot)) ## Discard result
The environment is not the same in the individual nodes. Try specifying the package explicitly:
myfunction <- function(x){
gregmisc::combinations(10, x, 1:10)
}
I'm trying to use the snow package to score an elastic net model in R, but I can't figure out how to get the predict function to run across multiple nodes in the cluster. The code below contains both a timing benchmark and the actual code producing the error:
##############
#Snow example#
##############
library(snow)
library(glmnet)
library(mlbench)
data(BostonHousing)
BostonHousing$chas<-as.numeric(BostonHousing$chas)
ind<-as.matrix(BostonHousing[,1:13],col.names=TRUE)
dep<-as.matrix(BostonHousing[,14],col.names=TRUE)
fit_lambda<-cv.glmnet(ind,dep)
#fit elastic net
fit_en<<-glmnet(ind,dep,family="gaussian",alpha=0.5,lambda=fit_lambda$lambda.min)
ind_exp<-rbind(ind,ind)
#single thread baseline
i<-0
while(i < 2000){
ind_exp<-rbind(ind_exp,ind)
i = i+1
}
system.time(st<-predict(fit_en,ind_exp))
#formula for parallel execution
pred_en<-function(x){
x<-as.matrix(x)
return(predict(fit_en,x))
}
#make the cluster
cl<-makeSOCKcluster(4)
clusterExport(cl,"fit_en")
clusterExport(cl,"pred_en")
#parallel baseline
system.time(mt<-parRapply(cl,ind_exp,pred_en))
I have been able to parallelize via forking on a Linux box using multicore, but I ended up having to use a pretty poorly performing mclapply combined with unlist and was looking for a better way to do it with snow (that would incidentally work on both my dev windows PC and my prod Linux servers). Thanks SO.
I should start by saying that the predict.glmnet function doesn't seem to be compute intensive enough to be worth parallelizing. But this is an interesting example, and my answer may be helpful to you, even if this particular case isn't worth parallelizing.
The main problem is that the parRapply function is a parallel wrapper around apply, which in turn calls your function on the rows of the submatrices, which isn't what you want. You want your function to be called directly on the submatrices. Snow doesn't contain a convenience function that does that, but it's easy to write one:
rowchunkapply <- function(cl, x, fun, ...) {
do.call('rbind', clusterApply(cl, splitRows(x, length(cl)), fun, ...))
}
Another problem in your example is that you need to load glmnet on the workers so that the correct predict function is called. You also don't need to explicitly export the pred_en function, since that is handled for you.
Here's my version of your example:
library(snow)
library(glmnet)
library(mlbench)
data(BostonHousing)
BostonHousing$chas <- as.numeric(BostonHousing$chas)
ind <- as.matrix(BostonHousing[,1:13], col.names=TRUE)
dep <- as.matrix(BostonHousing[,14], col.names=TRUE)
fit_lambda <- cv.glmnet(ind, dep)
fit_en <- glmnet(ind, dep, family="gaussian", alpha=0.5,
lambda=fit_lambda$lambda.min)
ind_exp <- do.call("rbind", rep(list(ind), 2002))
# make and initialize the cluster
cl <- makeSOCKcluster(4)
clusterEvalQ(cl, library(glmnet))
clusterExport(cl, "fit_en")
# execute a function on row chunks of x and rbind the results
rowchunkapply <- function(cl, x, fun, ...) {
do.call('rbind', clusterApply(cl, splitRows(x, length(cl)), fun, ...))
}
# worker function
pred_en <- function(x) {
predict(fit_en, x)
}
mt <- rowchunkapply(cl, ind_exp, pred_en)
You may also be interested in using the cv.glmnet parallel option, which uses the foreach package.
I would like to speed up my bootstrap function, which works perfectly fine itself. I read that since R 2.14 there is a package called parallel, but I find it very hard for sb. with low knowledge of computer science to really implement it. Maybe somebody can help.
So here we have a bootstrap:
n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
boot_b<-numeric()
for(i in 1:boot){
bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
boot_b[i]<-lm(y~x,bootstrap_data)$coef[2]
print(paste('Run',i,sep=" "))
}
The goal is to use parallel processing / exploit the multiple cores of my PC. I am running R under Windows. Thanks!
EDIT (after reply by Noah)
The following syntax can be used for testing:
library(foreach)
library(parallel)
library(doParallel)
registerDoParallel(cores=detectCores(all.tests=TRUE))
n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
start1<-Sys.time()
boot_b <- foreach(i=1:boot, .combine=c) %dopar% {
bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
unname(lm(y~x,bootstrap_data)$coef[2])
}
end1<-Sys.time()
boot_b<-numeric()
start2<-Sys.time()
for(i in 1:boot){
bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
boot_b[i]<-lm(y~x,bootstrap_data)$coef[2]
}
end2<-Sys.time()
start1-end1
start2-end2
as.numeric(start1-end1)/as.numeric(start2-end2)
However, on my machine the simple R code is quicker. Is this one of the known side effects of parallel processing, i.e. it causes overheads to fork the process which add to the time in 'simple tasks' like this one?
Edit: On my machine the parallel code takes about 5 times longer than the 'simple' code. This factor apparently does not change as I increase the complexity of the task (e.g. increase boot or n). So maybe there is an issue with the code or my machine (Windows based processing?).
Try the boot package. It is well-optimized, and contains a parallel argument. The tricky thing with this package is that you have to write new functions to calculate your statistic, which accept the data you are working on and a vector of indices to resample the data. So, starting from where you define data, you could do something like this:
# Define a function to resample the data set from a vector of indices
# and return the slope
slopeFun <- function(df, i) {
#df must be a data frame.
#i is the vector of row indices that boot will pass
xResamp <- df[i, ]
slope <- lm(y ~ x, data=xResamp)$coef[2]
}
# Then carry out the resampling
b <- boot(data, slopeFun, R=1000, parallel="multicore")
b$t is a vector of the resampled statistic, and boot has lots of nice methods to easily do stuff with it - for instance plot(b)
Note that the parallel methods depend on your platform. On your Windows machine, you'll need to use parallel="snow".
I haven't tested foreach with the parallel backend on Windows, but I believe this will work for you:
library(foreach)
library(doSNOW)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
registerDoSNOW(cl=cl)
n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
boot_b <- foreach(i=1:boot, .combine=c) %dopar% {
bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
unname(lm(y~x,bootstrap_data)$coef[2])
}
I think the main problem is that you have a lot of small tasks. In some cases, you can improve your performance by using task chunking, which results in fewer, but larger data transfers between the master and workers, which is often more efficient:
boot_b <- foreach(b=idiv(boot, chunks=getDoParWorkers()), .combine='c') %dopar% {
sapply(1:b, function(i) {
bdata <- data[sample(nrow(data), nrow(data), replace=T),]
lm(y~x, bdata)$coef[[2]]
})
}
I like using the idiv function for this, but you could b=rep(boot/detectCores(),detectCores()) if you like.
this is an old-question but I think a lot of this can be made more efficient using data.table. the benefits will not really be noticed until larger data sets are used. Putting this answer here to help others that may have to bootstrap larger datasets
library(data.table)
setDT(data) # convert data.frame to data.table by reference
system.time({
b <- rbindlist(
lapply(
1:boot,
function(i) {
data.table(
# store the statistic
'statistic' = lm(y ~ x, data=data[sample(.N, .N, replace = T)])$coef[[2]],
# store the iteration
'iteration' = i
)
}
)
)
})
# 1.66 seconds on my system
ggplot(b) + geom_density(aes(x = statistic))
You could then further improve performance by making use of parallel package.
library(parallel)
cl <- makeCluster(detectCores()) # use all cores on machine, can change this
clusterExport( # give it the variables it needs #nolint
cl,
c(
"data"
),
envir = environment()
)
clusterEvalQ( # give it libraries needed #nolint
cl,
c(
library(data.table)
)
)
system.time({
b <- rbindlist(
parLapply( # this is changed to be in parallel
cl, # give it the cluster you created earlier
1:boot,
function(i) {
data.table(
'statistic' = lm(y ~ x, data=data[sample(.N, .N, replace = T)])$coef[[2]],
'iteration' = i
)
}
)
)
})
stopCluster(cl)
# .47 seconds on my machine