I need to multi-thread my R application as it takes 5 minutes to run and is only using 15% of the computers available CPU.
An example of a process which takes a while to run is calculating the mean of a very large raster stack containing n layers:
mean = cellStats(raster_layers[[n]], stat='sd', na.rm=TRUE)
Using the parallel library, I can create a new cluster and pass a function to it:
cl <- makeCluster(8, type = "SOCK")
parLapply(cl, raster_layers[[1]], mean_function)
stopCluster(cl)
where mean function is:
mean_function <- function(raster_object)
{
result = cellStats(raster_object, stat='mean', na.rm=TRUE)
return(result)
}
This method works fine except that it can't see the 'raster' package which is required to use cellStats. So it fails saying no function for cellStats. I have tried including the library within the function but this doesnt help.
The raster package comes with a cluster function, and it CAN see the function cellStats, however as far as I can tell, the cluster function must return a raster object and must be passed a single raster object which isn't flexible enough for me, I need to be able to pass a list of objects and return a numeric variable... which I can do with normal clustering using the parallel library if only it can see the raster package functions.
So, does anybody know how I can pass a package to a node with multi-threading in R? Or, how I can return a single value from the raster cluster function perhaps?
The solution came from Ben Barnes, thank you.
The following code works fine:
mean_function <- function(variable)
{
result = cellStats(variable, stat='mean', na.rm=TRUE)
return(result)
}
cl <- makeCluster(procs, type = "SOCK")
clusterEvalQ(cl, library(raster))
result = parLapply(cl, a_list, mean_function)
stopCluster(cl)
Where procs is the number of processors you wish to use, which must be the same value as the length of the list you are passing (in this case called a_list).
a_list in this case needs to be a list containing rasters which can be operated on to calculate the mean using the cellStats function. So, a_list is simply a list of rasters, containing procs number of rasters.
Related
This question is specifically about the use of multiple cores to run a given function where the function requires a package and additional arguments to run.
I have a large dataset of the following form:
Event_ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)
Type=c("A","B","C","D","E","A","B","C","D","E","A","B","C","D")
Revenue1=c(24,9,51,7,22,15,86,66,0,57,44,93,34,37)
Revenue2=c(16,93,96,44,67,73,12,65,81,22,39,94,41,30)
z = data.frame(Event_ID,Type,Revenue1,Revenue2)
I have a fairly complex function and am trying run it on multiple core. Below I present a simple function where the function essentially takes the sum of two columns and subtracts the value of the multiplication of two matrices (My apologies if the function is over-simplified but I am trying to understand how parallel processing works). Heres the function below:
set.seed(100)
library(truncnorm)
alpha_old=matrix(c(1,5),nrow=1)
library(truncnorm)
Total_Revenue=function(data,alpha_old){
for (i in 1:nrow(z)){
beta_old=matrix(rtruncnorm(2,a=1,b=10,mean =5,sd=1),ncol=1) #generates beta for each row
adjustment_factor = alpha_old%*%beta_old #computes adjustment factor for each row
z[i,'Total_Rev'] = z[i,'Revenue1']+z[i,'Revenue2']-adjustment_factor
}
return(z)
}
Total = Total_Revenue(data=z,alpha=alpha_old)
print(Total)
Running the function regularly and printing the results out provides the expected output (output shown at the end).
Now I want to implement the following using multiple cores using the parSapply. I tried the following:
library(parallel)
library(doParallel)
no_cores <- detectCores() - 1
registerDoParallel(cores=no_cores)
cl2 <- makeCluster(no_cores)
invisible(clusterEvalQ(cl2, library(truncnorm)))
clusterExport(cl=cl2, varlist=c("alpha_old","z"), envir=environment())
result1 = parSapply(cl2, X= 1:nrow(z),FUN=Total_Revenue,data=z,alpha_old=alpha_old)
stopCluster(cl2)
I get the following message:
Error in checkForRemoteErrors(val) : 14 nodes produced errors; first error: unused argument (X[[i]])
This is the first time I am trying to use multicore processing and am not very familiar with the parallel and doParallel packages. The actual dataset I am working with has around 5 million observation and the function involves additional steps (comparing the values between the other values of the dataset) which I deleted from the example function. Any help on dealing with this will be greatly appreciated. Thanks in advance.
P.S. The output that I get by running the function on one core:
P.P.S. The example data is taken from another question that I had posted here:
Gpu processing R (How to use Gpu processing to run a function on subsets of a dataset)
I have a function called DTW in similarity measure package. It takes two matrix or data frame as its arguments and returns the Dynamic time warping distance. Those data frames are the longitudes and latitudes of trajectory.
My program looks like this and all the data frames like df1, df2,df3 and so on are available:
distance <- function(arg1,arg2) {
DTW(arg1, arg2)
}
for(i in 1:length(LIST)){
for(j in 1:length(LIST)){
a <- get(paste0("df",i))
b <- get(paste0("df",j))
ddist[i,j] <- distance(a,b)
print(ddist)
}
}
I am making a matrix ddist in which all the values are inserted returned by distance function. The program is working fine. I want to make it fast using parallel programming like parapply or parlapply function.
Here is a simple method to give you an idea of how to make it parallel
k<-length(LIST)
ddist<-matrix(0,k,k)
library("doParallel")
cl<-makeCluster(4,outfile='')
registerDoParallel(cl)
for(i in 1:k) {
a <- get(paste0("df",i))
ddist[i,]=foreach(j = 1:k , .combine='cbind' ,.export=paste0("df",1:k)) %dopar% {
b <- get(paste0("df",j))
distance(a,b)
}
}
stopCluster(cl)
Having said that , things to evaluate
if the distance function takes more than 2 seconds ,then only use
parallel
df1 , df2 etc may not be a good idea , store each
dataframe as df[[1]] , df[[2]]. Better than using the get function
if length(k) is very huge , then the amount of time taken for
transferring the exported df1,df2 etc is quite a long time , hence
try to hit the sweet spot of performance with various iterations
You can see the option of data.table where there is inplace edit,
use this instead of the ddist as it might be faster
If this code is called within a function , then you might also need to
export the function ddist , like .export=c(ddist,paste0("df",1:k))
Change the "4" in makeCluster to chose the cores you want, as a
thumbrule , keep it as detectCores()-1
I'm currently developing an R package that will be using parallel computing to solve some tasks, through means of the "parallel" package.
I'm getting some really awkward behavior when utilizing clusters defined inside functions of my package, where the parLapply function assigns a job to a worker and waits for it to finish to assign a job to next worker.
Or at least this is what appears to be happening, through the observation of the log file "cluster.log" and the list of running processes in the unix shell.
Below is a mockup version of the original function declared inside my package:
.parSolver <- function( varMatrix, var1 ) {
no_cores <- detectCores()
#Rows in varMatrix
rows <- 1:nrow(varMatrix[,])
# Split rows in n parts
n <- no_cores
parts <- split(rows, cut(rows, n))
# Initiate cluster
cl <- makePSOCKcluster(no_cores, methods = FALSE, outfile = "/home/cluster.log")
clusterEvalQ(cl, library(raster))
clusterExport(cl, "varMatrix", envir=environment())
clusterExport(cl, "var1", envir=environment())
rParts <- parLapply(cl = cl, X = 1:n, fun = function(x){
part <- rasterize(varMatrix[parts[[x]],], raster(var1), .....)
print(x)
return(part)
})
do.call(merge, rParts)
}
NOTES:
I'm using makePSOCKcluster because i want the code to run on windows and unix systems alike although this particular problem is only manifesting itself in a unix system.
Functions rasterize and raster are defined in library(raster), exported to the cluster.
The weird part to me is if I execute the exact same code of the function parSolver in a global environment every thing works smoothly, all workers take one job at the same time and the task completes in no time.
However if I do something like:
library(myPackage)
varMatrix <- (...)
var1 <- (...)
result <- parSolver(varMatrix, var1)
the described problem appears.
It appears to be a load balancing problem however that does not explain why it works ok in one situation and not in the other.
Am I missing something here?
Thanks in advance.
I don't think parLapply is running sequentially. More likely, it's just running inefficiently, making it appear to run sequentially.
I have a few suggestions to improve it:
Don't define the worker function inside parSolver
Don't export all of varMatrix to each worker
Create the cluster outside of parSolver
The first point is important, because as your example now stands, all of the variables defined in parSolver will be serialized along with the anonymous worker function and sent to the workers by parLapply. By defining the worker function outside of any function, the serialization won't capture any unwanted variables.
The second point avoids unnecessary socket I/O and uses less memory, making the code more scalable.
Here's a fake, but self-contained example that is similar to yours that demonstrates my suggestions:
# Define worker function outside of any function to avoid
# serialization problems (such as unexpected variable capture)
workerfn <- function(mat, var1) {
library(raster)
mat * var1
}
parSolver <- function(cl, varMatrix, var1) {
parts <- splitIndices(nrow(varMatrix), length(cl))
varMatrixParts <- lapply(parts, function(i) varMatrix[i,,drop=FALSE])
rParts <- clusterApply(cl, varMatrixParts, workerfn, var1)
do.call(rbind, rParts)
}
library(parallel)
cl <- makePSOCKcluster(3)
r <- parSolver(cl, matrix(1:20, 10, 2), 2)
print(r)
Note that this takes advantage of the clusterApply function to iterate over a list of row-chunks of varMatrix so that the entire matrix doesn't need to be sent to everyone. It also avoids calls to clusterEvalQ and clusterExport, simplifying the code, as well as making it a bit more efficient.
I'm trying to run a NetLogo simulation (using RNetLogo package) in R using parallel processing on my laptop. I'm trying to assess "t-feeding of females" using 3 (i.e., 0, 25, and 50) different "minimum-separation" values. For each "minimum-separation" value, I'd like to replicate the simulation 10 times. I can run everything correctly just using lapply but I'm having trouble with parLapply. I've just started using the package "parallel" so I'm sure it is something in the syntax.
#Set up clusters for parallel
processors <- detectCores()
cl <- makeCluster(processors)
#Simulation
sim3 <- function(min_sep) {
NLCommand("set minimum-separation ", min_sep, "setup")
ret <- NLDoReport(720, "go", "[t-feeding] of females", as.data.frame=TRUE)
tot <- sum(ret[,1])
return(tot)
}
#Replicate simulations 10 times using lapply and create boxplots. This one works.
rep.sim3 <- function(min_sep, rep) {
return(
lapply(min_sep, function(min_sep) {
replicate(rep, sim3(min_sep))
})
)
}
d <- seq(0,50,25)
res <- rep.sim3(d,10)
boxplot(res,names=d, xlab="Minimum Separation", ylab="Time spent feeding")
#Replicate simulations 10 times using parLapply. This one does not work.
rep.sim3 <- function(min_sep, rep) {
return(
parLapply(cl, min_sep, function(min_sep) {
replicate(rep, sim3(min_sep))
})
)
}
d <- seq(0,50,25)
res <- rep.sim3(d,10)
# Error in checkForRemoteErrors(val) : 3 nodes produced errors; first error: could not find function "sim3"
#Replicate simulations 10 times using parLapply. This one does work but creates a list of the wrong length and therefore the boxplot cannot be plotted correctly.
rep.sim3 <- function(min_sep, rep) {
return(
parLapply(cl, replicate(rep, d), sim3))
}
d <- seq(0,50,25)
res <- rep.sim3(d,10)
Ideally I'd like to make the first parLapply work. Alternatively, I guess I could modify res from the parLapply that works so that the list has a length of max_sep instead of 30. However, I can't seem to do that. Any help would be much appreciated!
Thanks in advance.
You need to initialize the cluster workers before executing rep.sim3. The error message indicates that your workers can't execute the sim3 function because you haven't exported it to them. Also, I noticed that you haven't loaded the RNetlogo package on the workers, either.
The easiest way to initialize the workers is with the clusterEvalQ and clusterExport functions:
clusterEvalQ(cl, library(RNetLogo))
clusterExport(cl, 'sim3')
Note that you shouldn't do this in your rep.sim3 function, since that would be inefficient and unnecessary. Do it just once after creating the cluster object and sim3 has been defined.
This initialization is necessary because the workers started via makeCluster don't know anything about your variables or functions, or anything else about your R session. And parLapply doesn't analyze the function that you pass to it any more than lapply does. The difference is that lapply executes in your local R session where sim3 is defined and the RNetLogo package is loaded. parLapply executes the specified function in remote R sessions that have not been initialized by executing your R script.
Is there a problem when accessing/writing to global variable in using doSNOW package on multiple cores?
In the below program, each of the MyCalculations(ii) writes to the ii-th column of the matrix "globalVariable"...
Do you think the result will be correct? Will there be hidden catches?
Thanks a lot!
p.s. I have to write out to the global variable because this is a simplied example, in fact I have lots of outputs that need to be transported from within the parallel loops... therefore, probably the only way is to write out to global variables...
library(doSNOW)
MaxSearchSpace=44*5
globalVariable=matrix(0, 10000, MaxSearchSpace)
cl<-makeCluster(7)
registerDoSNOW(cl)
foreach (ii = 2:nMaxSearchSpace, .combine=cbind, .verbose=F) %dopar%
{
MyCalculations(ii)
}
stopCluster(cl)
p.s. I am asking - within the DoSnow framework, is there any danger of accessing/writing global variables... thx
Since this question is a couple months old, I hope you've found an answer by now. However, in case you're still interested in feedback, here's something to consider:
When using foreach with a parallel backend, you won't be able to assign to variables in R's global environment in the way you're attempting (you probably noticed this). Using a sequential backend, assignment will work, but not using a parallel one like with doSNOW.
Instead, save all the results of your calculations for each iteration in a list and return this to an object, so that you can extract the appropriate results after all calculations have been completed.
My suggestion starts similarly to your example:
library(doSNOW)
MaxSearchSpace <- 44*5
cl <- makeCluster(parallel::detectCores())
# do not create the globalVariable object
registerDoSNOW(cl)
# Save the results of the `foreach` iterations as
# lists of lists in an object (`theRes`)
theRes <- foreach (ii = 2:MaxSearchSpace, .verbose=F) %dopar%
{
# do some calculations
theNorms <- rnorm(10000)
thePois <- rpois(10000, 2)
# store the results in a list
list(theNorms, thePois)
}
After all iterations have been completed, extract the results from theRes and store them as objects (e.g., globalVariable, globalVariable2, etc.)
globalVariable1 <- do.call(cbind, lapply(theRes, "[[", 1))
globalVariable2 <- do.call(cbind, lapply(theRes, "[[", 2))
With this in mind, if you are performing calculations with each iteration that are dependent on the results of calculations from previous iterations, then this type of parallel computing is not the approach to take.