I'm currently developing an R package that will be using parallel computing to solve some tasks, through means of the "parallel" package.
I'm getting some really awkward behavior when utilizing clusters defined inside functions of my package, where the parLapply function assigns a job to a worker and waits for it to finish to assign a job to next worker.
Or at least this is what appears to be happening, through the observation of the log file "cluster.log" and the list of running processes in the unix shell.
Below is a mockup version of the original function declared inside my package:
.parSolver <- function( varMatrix, var1 ) {
no_cores <- detectCores()
#Rows in varMatrix
rows <- 1:nrow(varMatrix[,])
# Split rows in n parts
n <- no_cores
parts <- split(rows, cut(rows, n))
# Initiate cluster
cl <- makePSOCKcluster(no_cores, methods = FALSE, outfile = "/home/cluster.log")
clusterEvalQ(cl, library(raster))
clusterExport(cl, "varMatrix", envir=environment())
clusterExport(cl, "var1", envir=environment())
rParts <- parLapply(cl = cl, X = 1:n, fun = function(x){
part <- rasterize(varMatrix[parts[[x]],], raster(var1), .....)
print(x)
return(part)
})
do.call(merge, rParts)
}
NOTES:
I'm using makePSOCKcluster because i want the code to run on windows and unix systems alike although this particular problem is only manifesting itself in a unix system.
Functions rasterize and raster are defined in library(raster), exported to the cluster.
The weird part to me is if I execute the exact same code of the function parSolver in a global environment every thing works smoothly, all workers take one job at the same time and the task completes in no time.
However if I do something like:
library(myPackage)
varMatrix <- (...)
var1 <- (...)
result <- parSolver(varMatrix, var1)
the described problem appears.
It appears to be a load balancing problem however that does not explain why it works ok in one situation and not in the other.
Am I missing something here?
Thanks in advance.
I don't think parLapply is running sequentially. More likely, it's just running inefficiently, making it appear to run sequentially.
I have a few suggestions to improve it:
Don't define the worker function inside parSolver
Don't export all of varMatrix to each worker
Create the cluster outside of parSolver
The first point is important, because as your example now stands, all of the variables defined in parSolver will be serialized along with the anonymous worker function and sent to the workers by parLapply. By defining the worker function outside of any function, the serialization won't capture any unwanted variables.
The second point avoids unnecessary socket I/O and uses less memory, making the code more scalable.
Here's a fake, but self-contained example that is similar to yours that demonstrates my suggestions:
# Define worker function outside of any function to avoid
# serialization problems (such as unexpected variable capture)
workerfn <- function(mat, var1) {
library(raster)
mat * var1
}
parSolver <- function(cl, varMatrix, var1) {
parts <- splitIndices(nrow(varMatrix), length(cl))
varMatrixParts <- lapply(parts, function(i) varMatrix[i,,drop=FALSE])
rParts <- clusterApply(cl, varMatrixParts, workerfn, var1)
do.call(rbind, rParts)
}
library(parallel)
cl <- makePSOCKcluster(3)
r <- parSolver(cl, matrix(1:20, 10, 2), 2)
print(r)
Note that this takes advantage of the clusterApply function to iterate over a list of row-chunks of varMatrix so that the entire matrix doesn't need to be sent to everyone. It also avoids calls to clusterEvalQ and clusterExport, simplifying the code, as well as making it a bit more efficient.
Related
I have a large database and I wrote a code which executes the same calculations on that in a rolling manner by nesting it in a for loop. My problem is that the code runs pretty long. As I read, this is probably caused by R using a single-threaded method as default. As far as I know, foreach package would make it possible to speed up the execution by considerable time, however, I am unsure how to implement it. Currently, my code looks like this, in every iteration I subset a chunk of the large database and do various stuff with these subsets. At the end of an iteration, I collect the output in a time series. Is it possible to apply foreach in this situation?
(k in seq(1,5284, 21)) {
fdata <- data[k:(k+251),]
tdata <- data[(k+252):(k+377),]
}
Thanks!
This is certainly doable using foreach. Depending on your OS you would first have to load a suitable backend (e.g. SNOW on a windows machine) and then set up a cluster.
Example:
library(foreach)
library(doSNOW)
# set number of cores/CPUs to be used
(n_cores <- parallel::detectCores() - 1)
# some example data
dat <- matrix(1:1e3, ncol = 10)
# a set you iterate over
k <- 1:99
# run stuff in parallel
cl <- makeCluster(n_cores)
registerDoSNOW(cl)
result <- foreach(k) %dopar% {
fdata <- dat[k:(k+1), ]
# do computationally expensive stuff with `fdata`
# ... and return something
cumsum(fdata[1,] + fdata[2,])
}
stopCluster(cl)
By default result will be a list of the results. There are, however, ways to combine into an array or similar. Look at details on the .combine argument in ?foreach.
I'm running the following code (extracted from doParallel's Vignettes) on a PC (OS Linux) with 4 and 8 physical and logical cores, respectively.
Running the code with iter=1e+6 or less, every thing is fine and I can see from CPU usage that all cores are employed for this computation. However, with larger number of iterations (e.g. iter=4e+6), it seems parallel computing does not work in which case. When I also monitor the CPU usage, just one core is involved in computations (100% usage).
Example1
require("doParallel")
require("foreach")
registerDoParallel(cores=8)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
iter=4e+6
ptime <- system.time({
r <- foreach(i=1:iter, .combine=rbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})[3]
Do you have any idea what could be the reason? Could memory be the cause?
I googled around and I found THIS relevant to my question but the point is that I'm not given any kind of error and the OP seemingly has came up with a solution by providing necessary packages inside foreach loop. But no package is used inside my loop, as can be seen.
UPDATE1
My problem still is not solved. As per my experiments, I don't think that memory could be the reason. I have 8GB of memory on the system on which I run the following simple parallel (over all 8 logical cores) iteration:
Example2
require("doParallel")
require("foreach")
registerDoParallel(cores=8)
iter=4e+6
ptime <- system.time({
r <- foreach(i=1:iter, .combine=rbind) %dopar% {
i
}
})[3]
I do not have problem with running of this code but when I monitor the CPU usage, just one core (out of 8) is 100%.
UPDATE2
As for Example2, #SteveWeston (thanks for pointing this out) stated that (in comments) : "The example in your update is suffering from having tiny tasks. Only the master has any real work to do, which consists of sending tasks and processing results. That's fundamentally different than the problem with the original example which did use multiple cores on a smaller number of iterations."
However, Example1 still remains unsolved. When I run it and I monitor the processes with htop, here is what happens in more detail:
Let's name all 8 created processes p1 through p8. The status (column S in htop) for p1 is R meaning that it's running and remains unchanged. However, for p2 up to p8, after some minutes, the status changes to D (i.e. uninterruptible sleep) and, after some minutes, again changes to Z (i.e. terminated but not reaped by its parent). Do you have any idea why this happens?
I think you're running low on memory. Here's a modified version of that example that should work better when you have many tasks. It uses doSNOW rather than doParallel because doSNOW allows you to process the results with the combine function as they're returned by the workers. This example writes those results to a file in order to use less memory, however it reads the results back into memory at the end using a ".final" function, but you could skip that if you don't have enough memory.
library(doSNOW)
library(tcltk)
nw <- 4 # number of workers
cl <- makeSOCKcluster(nw)
registerDoSNOW(cl)
x <- iris[which(iris[,5] != 'setosa'), c(1,5)]
niter <- 15e+6
chunksize <- 4000 # may require tuning for your machine
maxcomb <- nw + 1 # this count includes fobj argument
totaltasks <- ceiling(niter / chunksize)
comb <- function(fobj, ...) {
for(r in list(...))
writeBin(r, fobj)
fobj
}
final <- function(fobj) {
close(fobj)
t(matrix(readBin('temp.bin', what='double', n=niter*2), nrow=2))
}
mkprogress <- function(total) {
pb <- tkProgressBar(max=total,
label=sprintf('total tasks: %d', total))
function(n, tag) {
setTkProgressBar(pb, n,
label=sprintf('last completed task: %d of %d', tag, total))
}
}
opts <- list(progress=mkprogress(totaltasks))
resultFile <- file('temp.bin', open='wb')
r <-
foreach(n=idiv(niter, chunkSize=chunksize), .combine='comb',
.maxcombine=maxcomb, .init=resultFile, .final=final,
.inorder=FALSE, .options.snow=opts) %dopar% {
do.call('c', lapply(seq_len(n), function(i) {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}))
}
I included a progress bar since this example takes several hours to execute.
Note that this example also uses the idiv function from the iterators package to increase the amount of work in each of the tasks. This technique is called chunking, and often improves the parallel performance. However, using idiv messes up the task indices, since the variable i is now a per-task index rather than a global index. For a global index, you can write a custom iterator that wraps idiv:
idivix <- function(n, chunkSize) {
i <- 1
it <- idiv(n, chunkSize=chunkSize)
nextEl <- function() {
m <- nextElem(it) # may throw 'StopIterator'
value <- list(i=i, m=m)
i <<- i + m
value
}
obj <- list(nextElem=nextEl)
class(obj) <- c('abstractiter', 'iter')
obj
}
The values emitted by this iterator are lists, each containing a starting index and a count. Here's a simple foreach loop that uses this custom iterator:
r <-
foreach(a=idivix(10, chunkSize=3), .combine='c') %dopar% {
do.call('c', lapply(seq(a$i, length.out=a$m), function(i) {
i
}))
}
Of course, if the tasks are compute intensive enough, you may not need chunking and can use a simple foreach loop as in the original example.
At first I thought you were running into memory problems because submitting many tasks does use more memory, and that can eventually cause the master process to get bogged down, so my original answer shows several techniques for using less memory. However, now it sounds like there's a startup and shutdown phase where only the master process is busy, but the workers are busy for some period of time in the middle. I think the issue is that the tasks in this example aren't really very compute intensive, and so when you have a lot of tasks, you start to really notice the startup and shutdown times. I timed the actual computations and found that each task only takes about 3 milliseconds. In the past, you wouldn't get any benefit from parallel computing with tasks that small, but now, depending on your machine, you can get some benefit but the overhead is significant, so when you have a great many tasks you really notice that overhead.
I still think that my other answer works well for this problem, but since you have enough memory, it's overkill. The most important technique to use chunking. Here is an example that uses chunking with minimal changes to the original example:
require("doParallel")
nw <- 8
registerDoParallel(nw)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
niter <- 4e+6
r <- foreach(n=idiv(niter, chunks=nw), .combine='rbind') %dopar% {
do.call('rbind', lapply(seq_len(n), function(i) {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}))
}
Note that this does the chunking slightly differently than my other answer. It only uses one task per worker by using the idiv chunks option, rather than the chunkSize option. This reduces the amount of work done by the master and is a good strategy if you have enough memory.
I need to multi-thread my R application as it takes 5 minutes to run and is only using 15% of the computers available CPU.
An example of a process which takes a while to run is calculating the mean of a very large raster stack containing n layers:
mean = cellStats(raster_layers[[n]], stat='sd', na.rm=TRUE)
Using the parallel library, I can create a new cluster and pass a function to it:
cl <- makeCluster(8, type = "SOCK")
parLapply(cl, raster_layers[[1]], mean_function)
stopCluster(cl)
where mean function is:
mean_function <- function(raster_object)
{
result = cellStats(raster_object, stat='mean', na.rm=TRUE)
return(result)
}
This method works fine except that it can't see the 'raster' package which is required to use cellStats. So it fails saying no function for cellStats. I have tried including the library within the function but this doesnt help.
The raster package comes with a cluster function, and it CAN see the function cellStats, however as far as I can tell, the cluster function must return a raster object and must be passed a single raster object which isn't flexible enough for me, I need to be able to pass a list of objects and return a numeric variable... which I can do with normal clustering using the parallel library if only it can see the raster package functions.
So, does anybody know how I can pass a package to a node with multi-threading in R? Or, how I can return a single value from the raster cluster function perhaps?
The solution came from Ben Barnes, thank you.
The following code works fine:
mean_function <- function(variable)
{
result = cellStats(variable, stat='mean', na.rm=TRUE)
return(result)
}
cl <- makeCluster(procs, type = "SOCK")
clusterEvalQ(cl, library(raster))
result = parLapply(cl, a_list, mean_function)
stopCluster(cl)
Where procs is the number of processors you wish to use, which must be the same value as the length of the list you are passing (in this case called a_list).
a_list in this case needs to be a list containing rasters which can be operated on to calculate the mean using the cellStats function. So, a_list is simply a list of rasters, containing procs number of rasters.
I'm trying to run a NetLogo simulation (using RNetLogo package) in R using parallel processing on my laptop. I'm trying to assess "t-feeding of females" using 3 (i.e., 0, 25, and 50) different "minimum-separation" values. For each "minimum-separation" value, I'd like to replicate the simulation 10 times. I can run everything correctly just using lapply but I'm having trouble with parLapply. I've just started using the package "parallel" so I'm sure it is something in the syntax.
#Set up clusters for parallel
processors <- detectCores()
cl <- makeCluster(processors)
#Simulation
sim3 <- function(min_sep) {
NLCommand("set minimum-separation ", min_sep, "setup")
ret <- NLDoReport(720, "go", "[t-feeding] of females", as.data.frame=TRUE)
tot <- sum(ret[,1])
return(tot)
}
#Replicate simulations 10 times using lapply and create boxplots. This one works.
rep.sim3 <- function(min_sep, rep) {
return(
lapply(min_sep, function(min_sep) {
replicate(rep, sim3(min_sep))
})
)
}
d <- seq(0,50,25)
res <- rep.sim3(d,10)
boxplot(res,names=d, xlab="Minimum Separation", ylab="Time spent feeding")
#Replicate simulations 10 times using parLapply. This one does not work.
rep.sim3 <- function(min_sep, rep) {
return(
parLapply(cl, min_sep, function(min_sep) {
replicate(rep, sim3(min_sep))
})
)
}
d <- seq(0,50,25)
res <- rep.sim3(d,10)
# Error in checkForRemoteErrors(val) : 3 nodes produced errors; first error: could not find function "sim3"
#Replicate simulations 10 times using parLapply. This one does work but creates a list of the wrong length and therefore the boxplot cannot be plotted correctly.
rep.sim3 <- function(min_sep, rep) {
return(
parLapply(cl, replicate(rep, d), sim3))
}
d <- seq(0,50,25)
res <- rep.sim3(d,10)
Ideally I'd like to make the first parLapply work. Alternatively, I guess I could modify res from the parLapply that works so that the list has a length of max_sep instead of 30. However, I can't seem to do that. Any help would be much appreciated!
Thanks in advance.
You need to initialize the cluster workers before executing rep.sim3. The error message indicates that your workers can't execute the sim3 function because you haven't exported it to them. Also, I noticed that you haven't loaded the RNetlogo package on the workers, either.
The easiest way to initialize the workers is with the clusterEvalQ and clusterExport functions:
clusterEvalQ(cl, library(RNetLogo))
clusterExport(cl, 'sim3')
Note that you shouldn't do this in your rep.sim3 function, since that would be inefficient and unnecessary. Do it just once after creating the cluster object and sim3 has been defined.
This initialization is necessary because the workers started via makeCluster don't know anything about your variables or functions, or anything else about your R session. And parLapply doesn't analyze the function that you pass to it any more than lapply does. The difference is that lapply executes in your local R session where sim3 is defined and the RNetLogo package is loaded. parLapply executes the specified function in remote R sessions that have not been initialized by executing your R script.
Is there a problem when accessing/writing to global variable in using doSNOW package on multiple cores?
In the below program, each of the MyCalculations(ii) writes to the ii-th column of the matrix "globalVariable"...
Do you think the result will be correct? Will there be hidden catches?
Thanks a lot!
p.s. I have to write out to the global variable because this is a simplied example, in fact I have lots of outputs that need to be transported from within the parallel loops... therefore, probably the only way is to write out to global variables...
library(doSNOW)
MaxSearchSpace=44*5
globalVariable=matrix(0, 10000, MaxSearchSpace)
cl<-makeCluster(7)
registerDoSNOW(cl)
foreach (ii = 2:nMaxSearchSpace, .combine=cbind, .verbose=F) %dopar%
{
MyCalculations(ii)
}
stopCluster(cl)
p.s. I am asking - within the DoSnow framework, is there any danger of accessing/writing global variables... thx
Since this question is a couple months old, I hope you've found an answer by now. However, in case you're still interested in feedback, here's something to consider:
When using foreach with a parallel backend, you won't be able to assign to variables in R's global environment in the way you're attempting (you probably noticed this). Using a sequential backend, assignment will work, but not using a parallel one like with doSNOW.
Instead, save all the results of your calculations for each iteration in a list and return this to an object, so that you can extract the appropriate results after all calculations have been completed.
My suggestion starts similarly to your example:
library(doSNOW)
MaxSearchSpace <- 44*5
cl <- makeCluster(parallel::detectCores())
# do not create the globalVariable object
registerDoSNOW(cl)
# Save the results of the `foreach` iterations as
# lists of lists in an object (`theRes`)
theRes <- foreach (ii = 2:MaxSearchSpace, .verbose=F) %dopar%
{
# do some calculations
theNorms <- rnorm(10000)
thePois <- rpois(10000, 2)
# store the results in a list
list(theNorms, thePois)
}
After all iterations have been completed, extract the results from theRes and store them as objects (e.g., globalVariable, globalVariable2, etc.)
globalVariable1 <- do.call(cbind, lapply(theRes, "[[", 1))
globalVariable2 <- do.call(cbind, lapply(theRes, "[[", 2))
With this in mind, if you are performing calculations with each iteration that are dependent on the results of calculations from previous iterations, then this type of parallel computing is not the approach to take.