I would like to parallelize a portion of a package I am working on. Which packages and what syntax should I use to make the package flexible and usable on different architectures? My problem sits in a single sapply() call as shown in this mock code:
.heavyStuff <- function(x) {
# do a lot of work
Sys.sleep(1)
}
listOfX <- 1:20
userFunc1 <- function(listOfX) {
res <- sapply(listOfX, .heavyStuff)
return(res)
}
Based on different guides, I have concocted the following:
userFunc2 <- function(listOfX, dopar.arg=2) {
if(requireNamespace("doParallel")) {
doParallel::registerDoParallel(dopar.arg)
res <- foreach(i=1:length(listOfX)) %dopar% {
.heavyStuff(listOfX[[i]])
}
names(res) <- names(listOfX)
} else {
res <- sapply(listOfX, .heavyStuff)
}
return(res)
}
Questions:
Can I safely use such a code in a package? Will it work well on a range of platforms?
Is there a way to avoid the foreach() construct? I'd much prefer to use a sapply- or lapply-like function. However, the constructs in the parallel library appear to be much more platform specific.
The above code doesn't work if dopar.arg==NULL, even though the introduction to doParallel says that without any arguments "you will get three workers and on Unix-like systems
you will get a number of workers equal to approximately half the number of cores on your system."
As the author of the future framework, I suggest that you have a look at the future.apply package, e.g.
library(future.apply)
userFunc2 <- function(listOfX) {
res <- future_sapply(listOfX, .heavyStuff)
return(res)
}
The default is that things runs sequentially, but if the user wishes, they can use whatever parallel future backend they'd like, e.g.
library(future)
plan(multiprocess) # parallel on local machine - all cores by default
library(future.batchtools)
plan(batchtools_sge) # parallel on an SGE compute cluster
library(future)
plan(sequential) # sequentially
The design pattern is that you decide what to parallelize whereas the user how to parallelize.
Related
I have an R API that makes use of 5 different R files that define different metrics that I use. Each of those files has a number of tasks that I run using the parallel package since they all use the same data, but with different groupings. To avoid having to create and close the clusters in each file, I took out those commands and put them into a cluster.R file. So the structure I have is basically:
cluster.R —
cl <- makeCluster(detectCores() - 1)
clusterEvalQ(computeCluster, {
library(‘dplyr’)
source(‘helpers.R’)
})
.Last <- function() {
stopCluster(cl)
}
Metric1.R —
metric1.function <- function(x,y,z) {
dplyr transformations
}
some_date <- date_from_api_input
tasks <- list(job1 = function() {metric1.function(data, grouping1, some_date)},
job2 = function() {metric1.function(data, grouping2, some_date)},
job3 = function() {metric1.function(data, grouping3, some_date)}
)
clusterExport(cl, c('data', 'metric1.function', 'some_date'), envir = environment())
out <- clusterApplyLB(
cl,
tasks,
function(f) f()
)
bind_rows(out)
This API just creates different metrics that then fills a database table that holds them all. So each metric file contains different functions and inputs but output the same columns and groupings.
Metric 2-5 are all the same except the custom function is different for each file and defined at the beginning of each file. The problem I’m having is that all metrics are also ran in parallel and I’m having issues working with the environments. What ends up happening is that the job will say that some_date isn’t found or that metric2.function isn’t found in metric5.R.
I use plumber to expose R and each time it starts, it sources the cluster.R file, starts up the clusters with their initializations, and listens for any requests that come in.
When running in series, it works just fine for testing and everything passes as expected but in production when our server runs all the scripts in parallel, the variables and functions I've exported in the clusterExport function either don't get passed in or are getting mixed up.
Should I be structuring it in a different fashion or am I using the parallel package incorrectly for my purpose?
How do I get parallelizaton of code to work in r in Windows? Include a simple example. Posting this self-answered question because this was rather unpleasant to get working. You'll find package parallel does NOT work on its own, but package snow works very well.
Posting this because this took me bloody forever to figure out. Here's a simple example of parallelization in r that will let you test if things are working right for you and get you on the right path.
library(snow)
z=vector('list',4)
z=1:4
system.time(lapply(z,function(x) Sys.sleep(1)))
cl<-makeCluster(###YOUR NUMBER OF CORES GOES HERE ###,type="SOCK")
system.time(clusterApply(cl, z,function(x) Sys.sleep(1)))
stopCluster(cl)
You should also use library doSNOW to register foreach to the snow cluster, this will cause many packages to parallelize automatically. The command to register is registerDoSNOW(cl) (with cl being the return value from makeCluster()) , the command that undoes registration is registerDoSEQ(). Don't forget to turn off your clusters.
This worked for me, I used package doParallel, required 3 lines of code:
# process in parallel
library(doParallel)
cl <- makeCluster(detectCores(), type='PSOCK')
registerDoParallel(cl)
# turn parallel processing off and run sequentially again:
registerDoSEQ()
Calculation of a random forest decreased from 180 secs to 120 secs (on a Windows computer with 4 cores).
Based on the information here I was able to convert the following code into a parallelised version that worked under R Studio on Windows 7.
Original code:
#
# Basic elbow plot function
#
wssplot <- function(data, nc=20, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i, iter.max=30)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of clusters",
ylab="Within groups sum of squares")
}
Parallelised code:
library("parallel")
workerFunc <- function(nc) {
set.seed(1234)
return(sum(kmeans(my_data_frame, centers=nc, iter.max=30)$withinss)) }
num_cores <- detectCores()
cl <- makeCluster(num_cores)
clusterExport(cl, varlist=c("my_data_frame"))
values <- 1:20 # this represents the "nc" variable in the wssplot function
system.time(
result <- parLapply(cl, values, workerFunc) ) # paralel execution, with time wrapper
stopCluster(cl)
plot(values, unlist(result), type="b", xlab="Number of clusters", ylab="Within groups sum of squares")
Not suggesting it's perfect or even best, just a beginner demonstrating that parallel does seem to work under Windows. Hope it helps.
I think these libraries will help you:
foreach (facilitates executing the loop in parallel)
doSNOW (I think you already use it)
doMC (multicore functionality of the parallel package)
May these article also help you
http://vikparuchuri.com/blog/parallel-r-loops-for-windows-and-linux/
http://www.joyofdata.de/blog/parallel-computing-r-windows-using-dosnow-foreach/
I'm posting a cross-platform answer here because all the other answers I found were over-complicated for what I needed to accomplish. I'm using an example where I'm reading in all sheets of an excel workbook.
# read in the spreadsheet
parallel_read <- function(file){
# detect available cores and use 70%
numCores = round(parallel::detectCores() * .70)
# check if os is windows and use parLapply
if(.Platform$OS.type == "windows") {
cl <- makePSOCKcluster(numCores)
parLapply(cl, file, readxl::read_excel)
stopCluster(cl)
return(dfs)
# if not Windows use mclapply
} else {
dfs <-parallel::mclapply(excel_sheets(file),
readxl::read_excel,
path = file,
mc.cores=numCores)
return(dfs)
}
}
For what it is worth. I was running into the same problem but couldn't get any of these to work. I eventually learned that Rstudio has a 'jobs' pane and can run models in the background each on it's own core. so what I did was divy-up my model into 10 segments (it was iterative over 100 vectors so 10 scripts of 10 vectors each) and ran each as a separate job. that way when one finished I could use the output from it immediately and I could keep working on my script without waiting for each model to finish. here is the link all about using jobs https://blog.rstudio.com/2019/03/14/rstudio-1-2-jobs/
I'm testing the doRedis package by running a worker one machine and the master/server on another. The code on my master looks like this:
#Register ...
r <- foreach(a=1:numreps, .export(...)) %dopar% {
train <- func1(..)
best <- func2(...)
weights <- func3(...)
return ...
}
In every function, a global variable is accessed, but not modified. I've exported the global variable in the .export portion of the foreach loop, but whenever I run the code, an error occurs stating that the variable was not found. Interestingly, the code works when all my workers on one machine, but crashes when I have an "outside" worker. Any ideas why this error is occurring, and how to correct it?
Thanks!
UPDATE: I have a gist of some code here: https://gist.github.com/liangricha/fbf29094474b67333c3b
UPDATE2: I asked a another to doRedis related question: "Would it be possible allow each worker machine to utilize all of its cores?
#Steve Weston responded: "Starting one redis worker per core will often fully utilize a machine."
This kind of code was a problem for the doParallel, doSNOW, and doMPI packages in the past, but they were improved in the last year or so to handle it better. The problem is that variables are exported to a special "export" environment, not to the global environment. That is preferable in various ways, but it means that the backend has to do more work so that the exported variables are in the scope of the exported functions. It looks like doRedis hasn't been updated to use these improvements.
Here is a simple example that illustrates the problem:
library(doRedis)
registerDoRedis('jobs')
startLocalWorkers(3, 'jobs')
glob <- 6
f1 <- function() {
glob
}
f2 <- function() {
foreach(1:3, .export=c('f1', 'glob')) %dopar% {
f1()
}
}
f2() # fails with the error: "object 'glob' not found"
If the doParallel backend is used, it succeeds:
library(doParallel)
cl <- makePSOCKcluster(3)
registerDoParallel(cl)
f2() # works with doParallel
One workaround is to define the function "f1" inside function "f2":
f2 <- function() {
f1 <- function() {
glob
}
foreach(1:3, .export=c('glob')) %dopar% {
f1()
}
}
f2() # works with doParallel and doRedis
Another solution is to use some mechanism to export the variables to the global environment of each of the workers. With doParallel or doSNOW, you could do that with the clusterExport function, but I'm not sure how to do that with doRedis.
I'll report this issue to the author of the doRedis package and suggest that he update doRedis to handle exported functions like doParallel.
I've come up with a strange error.
Suppose I have 10 xts objects in a list called data. I now search for every three combinations using
data_names <- names(data)
combs <- combn(data_names, 3)
My basic goal is to do a PCA on those 1080 triples.
To speed things up I wanted do use the package doParallel. So here is the snippet shortened till the point where the error occurs:
list <- foreach(i=1:ncol(combs)) %dopar% {
tmp_triple <- combs[,i]
p1<-data[tmp_triple[[1]]][[1]]
p2<-data[tmp_triple[[2]]][[1]]
p3<-data[tmp_triple[[3]]][[1]]
data.merge <- merge(p1,p2,p3,all=FALSE)
}
Here, the merge function seems to be the problem. The error is
task 1 failed - "cannot coerce class 'c("xts", "zoo")' into a data.frame"
However, when changing %dopar% to a normal serial %do% everything works as accepted.
Till now I was not able to find any solution to this problem and I'm not even sure what to look for.
A better solution rather than explicitly loading the libraries within the function would be to utilise the .packages argument of the foreach() function:
list <- foreach(i=1:ncol(combs),.packages=c("xts","zoo")) %dopar% {
tmp_triple <- combs[,i]
p1<-data[tmp_triple[[1]]][[1]]
p2<-data[tmp_triple[[2]]][[1]]
p3<-data[tmp_triple[[3]]][[1]]
data.merge <- merge(p1,p2,p3,all=FALSE)
}
The problem is likely that you haven't called library(xts) on each of the workers. You don't say what backend you're using, so I can't be 100% sure.
If that's the problem, then this code will fix it:
list <- foreach(i=1:ncol(combs)) %dopar% {
library(xts)
tmp_triple <- combs[,i]
p1<-data[tmp_triple[[1]]][[1]]
p2<-data[tmp_triple[[2]]][[1]]
p3<-data[tmp_triple[[3]]][[1]]
data.merge <- merge(p1,p2,p3,all=FALSE)
}
Quick fix for problem with foreach %dopar% is to reinstall these packages:
install.packages("doSNOW")
install.packages("doParallel")
install.packages("doMPI")
These are responsible for parallelism in R. Bug which existed in old versions of these packages is now removed. It worked in my case.
I'm trying to use the parallel package in R for parallel operations rather than doSNOW since it's built-in and ostensibly the way the R Project wants things to go. I'm doing something wrong that I can't pin down though. Take for example this:
a <- rnorm(50)
b <- rnorm(50)
arr <- matrix(cbind(a,b),nrow=50)
aaply(arr,.margin=1,function(x){x[1]+x[2]},.parallel=F)
This works just fine, producing the sums of my two columns. But if I try to bring in the parallel package:
library(parallel)
nodes <- detectCores()
cl <- makeCluster(nodes)
setDefaultCluster(cl)
aaply(arr,.margin=1,function(x){x[1]+x[2]},.parallel=T)
It throws the error
2: In setup_parallel() : No parallel backend registered
3: executing %dopar% sequentially: no parallel backend registered
Am I initializing the backend wrong?
Try this setup:
library(doParallel)
library(plyr)
nodes <- detectCores()
cl <- makeCluster(nodes)
registerDoParallel(cl)
aaply(ozone, 1, mean,.parallel=TRUE)
stopCluster(cl)
Since I have never used plyr for parallel computing I have no idea why this issues warnings. The result is correct anyway.
The documentation for aaply states
.parallel: if ‘TRUE’, apply function in parallel, using parallel
backend provided by foreach
so presumably you need to use the foreach package rather than the parallel package.