How do I parallelize in r on windows - example? - r

How do I get parallelizaton of code to work in r in Windows? Include a simple example. Posting this self-answered question because this was rather unpleasant to get working. You'll find package parallel does NOT work on its own, but package snow works very well.

Posting this because this took me bloody forever to figure out. Here's a simple example of parallelization in r that will let you test if things are working right for you and get you on the right path.
library(snow)
z=vector('list',4)
z=1:4
system.time(lapply(z,function(x) Sys.sleep(1)))
cl<-makeCluster(###YOUR NUMBER OF CORES GOES HERE ###,type="SOCK")
system.time(clusterApply(cl, z,function(x) Sys.sleep(1)))
stopCluster(cl)
You should also use library doSNOW to register foreach to the snow cluster, this will cause many packages to parallelize automatically. The command to register is registerDoSNOW(cl) (with cl being the return value from makeCluster()) , the command that undoes registration is registerDoSEQ(). Don't forget to turn off your clusters.

This worked for me, I used package doParallel, required 3 lines of code:
# process in parallel
library(doParallel)
cl <- makeCluster(detectCores(), type='PSOCK')
registerDoParallel(cl)
# turn parallel processing off and run sequentially again:
registerDoSEQ()
Calculation of a random forest decreased from 180 secs to 120 secs (on a Windows computer with 4 cores).

Based on the information here I was able to convert the following code into a parallelised version that worked under R Studio on Windows 7.
Original code:
#
# Basic elbow plot function
#
wssplot <- function(data, nc=20, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i, iter.max=30)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of clusters",
ylab="Within groups sum of squares")
}
Parallelised code:
library("parallel")
workerFunc <- function(nc) {
set.seed(1234)
return(sum(kmeans(my_data_frame, centers=nc, iter.max=30)$withinss)) }
num_cores <- detectCores()
cl <- makeCluster(num_cores)
clusterExport(cl, varlist=c("my_data_frame"))
values <- 1:20 # this represents the "nc" variable in the wssplot function
system.time(
result <- parLapply(cl, values, workerFunc) ) # paralel execution, with time wrapper
stopCluster(cl)
plot(values, unlist(result), type="b", xlab="Number of clusters", ylab="Within groups sum of squares")
Not suggesting it's perfect or even best, just a beginner demonstrating that parallel does seem to work under Windows. Hope it helps.

I think these libraries will help you:
foreach (facilitates executing the loop in parallel)
doSNOW (I think you already use it)
doMC (multicore functionality of the parallel package)
May these article also help you
http://vikparuchuri.com/blog/parallel-r-loops-for-windows-and-linux/
http://www.joyofdata.de/blog/parallel-computing-r-windows-using-dosnow-foreach/

I'm posting a cross-platform answer here because all the other answers I found were over-complicated for what I needed to accomplish. I'm using an example where I'm reading in all sheets of an excel workbook.
# read in the spreadsheet
parallel_read <- function(file){
# detect available cores and use 70%
numCores = round(parallel::detectCores() * .70)
# check if os is windows and use parLapply
if(.Platform$OS.type == "windows") {
cl <- makePSOCKcluster(numCores)
parLapply(cl, file, readxl::read_excel)
stopCluster(cl)
return(dfs)
# if not Windows use mclapply
} else {
dfs <-parallel::mclapply(excel_sheets(file),
readxl::read_excel,
path = file,
mc.cores=numCores)
return(dfs)
}
}

For what it is worth. I was running into the same problem but couldn't get any of these to work. I eventually learned that Rstudio has a 'jobs' pane and can run models in the background each on it's own core. so what I did was divy-up my model into 10 segments (it was iterative over 100 vectors so 10 scripts of 10 vectors each) and ran each as a separate job. that way when one finished I could use the output from it immediately and I could keep working on my script without waiting for each model to finish. here is the link all about using jobs https://blog.rstudio.com/2019/03/14/rstudio-1-2-jobs/

Related

Rstudio refuse to employ more cores [Not about code]

I would like to multithread calculate in R. Below is my code. It examines the "Quality" column of NGS data and remembers the rows that contain ,.
It worked fine but didn't save time. Now I figured out the reason. Multiple threads (11 of them) were successfully created but only 1 got processed. See update 1. I also tried DoMC in the place of DoParallel, the example code worked but not if I insert my code into the shell. See Update2.
Forgive me to introduce Update 3. There are occations the program runs as planned on a 4 thread computer, under Windows. But it's not consistent.
library(microseq)
library(foreach)
library(doParallel)
library(stringr)
cn<-detectCores()-1 ##i.e. 11
read1<-readFastq("pairIGK.fastq")
Cluster <- makeCluster(cn, type = "SOCK", methods = FALSE)
registerDoParallel(Cluster)
del_list <- foreach (i=1:nrow(read1), .inorder = FALSE, .packages = c("stringr")) %dopar% {
if(str_count(read1$Quality[i],",")!=0) i
}
stopCluster(Cluster)
del_list<-unlist(del_list)
read1<-read1[c(-(del_list)),]
Update1. I tested the code on the same 12 thread computer but under Linux. I found basically only one core is working for R. I saw 10 or 11 Rsession items under the monitor app, and they were not processed at all.
Update2. I found a slide from Microsoft talking about Multi-thread calculation with R.
https://blog.revolutionanalytics.com/downloads/Speed%20up%20R%20with%20Parallel%20Programming%20in%20the%20Cloud%20%28useR%20July2018%29.pdf
It highlighted the package doMC by giving a example calculating the likelihood 2 classmate sharing birthday on varied class size. I tested the provided code per se, while it actually used all cores. The code goes,
pbirthdaysim <- function(n) {
ntests <- 100000
pop <- 1:365
anydup <- function(i)
any(duplicated(
sample(pop, n, replace=TRUE)))
sum(sapply(seq(ntests), anydup)) / ntests
}
library(doMC)
registerDoMC(11)
bdayp <- foreach(n=1:100) %dopar% pbirthdaysim(n)
It took ~20s to finish on my 12 thread machine, which agrees with the slide.
However, when I insert my function into the shell, the same thing happened. Multiple thread got created by only one was actually processed. My code goes like:
library(microseq)
library(foreach)
library(doParallel)
library(stringr)
library(doMC)
cn<-detectCores()-1
read1<-readFastq("pairIGK.fastq")
registerDoMC(cn)
del_list <- foreach (i=1:nrow(read1), .inorder = FALSE, .packages = c("stringr")) %dopar% {
if(str_count(read1$Quality[i],",")!=0) i
}
del_list<-unlist(del_list)
read1<-read1[c(-(del_list)),]
Update 3.
I'm really confused and will go on investigating.

Dynamic library dependencies not recognized when run in parallel under R foreach()

I'm using the Rfast package, which imports the package RcppZiggurat. I'm running R 3.6.3 on a Linux cluster (Red Hat 6.1). The packages are installed on my local directory but R is installed system-wide.
The Rfast functions (e.g. colsums()) work well when I call them directly. But when I call them in a foreach() loop like the following (EDIT: I added the code to register the cluster as pointed out by Rui Barradas but it didn't fix the problem).
library(Rfast)
library(doParallel)
library(foreach)
cores <- detectCores()
cl <- makeCluster(cores)
registerDoParallel(cl)
A <- matrix(rnorm(1e6), 1000, 1000)
cm <- foreach(n = 1:4, .packages = 'Rfast') %dopar% colmeans(A)
stopCluster(cl)
then I get an error:
unable to load shared object '/home/users/sutd/R/x86_64-pc-linux-gnu-library/3.6/RcppZiggurat/libs/RcppZiggurat.so':
libgsl.so.0: cannot open shared object file: No such file or directory
Somehow, the dynamic library is recognized when called directly but not when called under foreach().
I know that libgsl.so is located in /usr/lib64/, so I've added the following line at the beginning of my R script
Sys.setenv(LD_LIBRARY_PATH=paste("/usr/lib64/", Sys.getenv("LD_LIBRARY_PATH"), sep = ":"))
But it didn't work.
I have also tried to do dyn.load('/usr/lib64/libgsl.so') but I get the following error:
Error in dyn.load("/usr/lib64/libgsl.so") : unable to load shared object '/usr/lib64/libgsl.so':
/usr/lib64/libgsl.so: undefined symbol: cblas_ctrmv
How do I make the dependencies available in the foreach() parallel loops?
NOTE
In the actual use case I am using the genetic algorithm package GA, and have GA::ga() which handles the foreach() loop, and within the loop I use a function in my own package which calls the Rfast functions. So I'm hoping that there is a solution where I don't have to modify the foreach() call.
The following works with no problems. Unlike the code in the question, it starts by detecting the number of available cores, create a cluster and make it available to foreach.
library(Rfast)
library(doParallel)
library(foreach)
cores <- detectCores()
cl <- makeCluster(cores)
registerDoParallel(cl)
set.seed(2020)
A <- matrix(rnorm(1e6), 1000, 1000)
cm <- foreach(n = 1:4,
.combine = rbind,
.packages = "Rfast") %dopar% {
colmeans(A)
}
stopCluster(cl)
str(cm)
#num [1:4, 1:1000] -0.02668 -0.02668 -0.02668 -0.02668 0.00172 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:4] "result.1" "result.2" "result.3" "result.4"
# ..$ : NULL
The foreach package was great for its time. But, now, parallel computations should be done with future just for the static code analysis handling the correct export to workers. As a result, under the future approach, registering a package with .packages= isn't needed. Moreover, the future mirrors usual R code with only a slight change regarding the setup of an output variable as a listenv. For example, we have:
library("future")
library("listenv")
library("Rfast")
plan(tweak(multiprocess , workers = 2L))
# For all cores, directly use:
# plan(multiprocess)
# Generate matrix once
A <- matrix(rnorm(1e6), 1000, 1000)
# Setup output
x <- listenv()
# Iterate 4 times
for(i in 1:4) {
# On each core, compute the colmeans()
x[[i]] %<-% {
colmeans(A)
# For better control over function applies, use a namespace call
# e.g. Rfast::colmeans(A)
}
}
# Switch from listenv to list
output <- as.list(x)
Thanks to the answers by #RuiBarradas and #coatless, I realize that the problem is not with foreach(), because (1) the problem occurred when I ran the code with future too, and (2) it occurred with the foreach() code even with the wrong call, when I didn't register the cluster. When there is no cluster registered, foreach() will throw a warning and runs in sequential mode instead. But that didn't happen.
Therefore, I realize that the problem must have occurred even before the foreach() call. In the log, it appeared right after the message Loading package RcppZiggurat. Something must have gone wrong when this package is loaded.
I then checked the dependencies of RcppZiggurat, and found that it depends on another package called RcppGSL, which interfaces R and the GSL library. Bingo, that's where libgsl.so.0 is needed when RcppZiggurat is called.
So I made an R script named test-gsl.R, which has the following two lines.
library(RcppZiggurat)
print(‘OK’)
Now, I run the following on the head node
$ module load R/3.6.3
$ Rscript test-gsl.R
And everything works fine. The ‘OK’ is printed.
But this doesn’t work if I submit the job on the compute node. First, the PBS script, called test.sh, is as follows
### Resources request
#PBS -l select=1:ncpus=1:mem=1GB
### Walltime
#PBS -l walltime=00:01:00
echo Working directory is $PBS_O_WORKDIR
cd $PBS_O_WORKDIR
### Run R
module load R/3.6.3
Rscript test-gsl.R
Then I ran
qsub test.sh
And the error popped out. This means that there is something different between the compute node and the head node on my system, and nothing to do with the packages. I contacted the system administrator, who explained to me that the GSL library is available on the head node at the default path, but not on the compute node. So in my shell script, I need to add module load gsl/2.1 before running my R script. I tested that and everything worked.
The solution seems simple enough, but I know very little about Linux administration to realize it. Only after asking around and trying (rather blindly) many things did I finally come to this solution. So thanks to those who've offered help, and mea culpa for not being able to describe the problem accurately at the beginning.

Exposing warnings from nodes using snow

I have an lapply operation that I've parallelised using snow. This works fine except that any warnings generated seem to just get ignored and are hence never shown to the user. Is there a way of exposing warnings on individual nodes so they come through in the main R process?
My best idea at the moment is to have all nodes write their warnings to files, and read those at the end, but there must be a better way!
Here's a reprex:
library(snow)
f <- function(x){
warning("mywarning")
return(NULL)
}
cl <- makeCluster(2, type="SOCK")
lapply(1:2, f) # Gives me warnings, as desired
clusterApply(cl, 1:2, f) # Gives me the same output, faster, but with no warnings
In the end I ended up switching from snow to the future.apply package (in conjunction with parallel). future.apply now has this behaviour by default.
Unfortunately in most cases the messages/warnings don't appear until the whole run has finished, but that's a whole new issue.

How to parallelize a function for a package in R

I would like to parallelize a portion of a package I am working on. Which packages and what syntax should I use to make the package flexible and usable on different architectures? My problem sits in a single sapply() call as shown in this mock code:
.heavyStuff <- function(x) {
# do a lot of work
Sys.sleep(1)
}
listOfX <- 1:20
userFunc1 <- function(listOfX) {
res <- sapply(listOfX, .heavyStuff)
return(res)
}
Based on different guides, I have concocted the following:
userFunc2 <- function(listOfX, dopar.arg=2) {
if(requireNamespace("doParallel")) {
doParallel::registerDoParallel(dopar.arg)
res <- foreach(i=1:length(listOfX)) %dopar% {
.heavyStuff(listOfX[[i]])
}
names(res) <- names(listOfX)
} else {
res <- sapply(listOfX, .heavyStuff)
}
return(res)
}
Questions:
Can I safely use such a code in a package? Will it work well on a range of platforms?
Is there a way to avoid the foreach() construct? I'd much prefer to use a sapply- or lapply-like function. However, the constructs in the parallel library appear to be much more platform specific.
The above code doesn't work if dopar.arg==NULL, even though the introduction to doParallel says that without any arguments "you will get three workers and on Unix-like systems
you will get a number of workers equal to approximately half the number of cores on your system."
As the author of the future framework, I suggest that you have a look at the future.apply package, e.g.
library(future.apply)
userFunc2 <- function(listOfX) {
res <- future_sapply(listOfX, .heavyStuff)
return(res)
}
The default is that things runs sequentially, but if the user wishes, they can use whatever parallel future backend they'd like, e.g.
library(future)
plan(multiprocess) # parallel on local machine - all cores by default
library(future.batchtools)
plan(batchtools_sge) # parallel on an SGE compute cluster
library(future)
plan(sequential) # sequentially
The design pattern is that you decide what to parallelize whereas the user how to parallelize.

Why doesn't the plyr package use my parallel backend?

I'm trying to use the parallel package in R for parallel operations rather than doSNOW since it's built-in and ostensibly the way the R Project wants things to go. I'm doing something wrong that I can't pin down though. Take for example this:
a <- rnorm(50)
b <- rnorm(50)
arr <- matrix(cbind(a,b),nrow=50)
aaply(arr,.margin=1,function(x){x[1]+x[2]},.parallel=F)
This works just fine, producing the sums of my two columns. But if I try to bring in the parallel package:
library(parallel)
nodes <- detectCores()
cl <- makeCluster(nodes)
setDefaultCluster(cl)
aaply(arr,.margin=1,function(x){x[1]+x[2]},.parallel=T)
It throws the error
2: In setup_parallel() : No parallel backend registered
3: executing %dopar% sequentially: no parallel backend registered
Am I initializing the backend wrong?
Try this setup:
library(doParallel)
library(plyr)
nodes <- detectCores()
cl <- makeCluster(nodes)
registerDoParallel(cl)
aaply(ozone, 1, mean,.parallel=TRUE)
stopCluster(cl)
Since I have never used plyr for parallel computing I have no idea why this issues warnings. The result is correct anyway.
The documentation for aaply states
.parallel: if ‘TRUE’, apply function in parallel, using parallel
backend provided by foreach
so presumably you need to use the foreach package rather than the parallel package.

Resources