MPI cluster based parallel calculation in R on WestGrid (pbs file) - r

I am now dealing with a large dataset and I want to use parallel calculation to accelerate the process. WestGird is a Canadian computing system which has clusters with interconnect.
I use two packages doSNOW and parallel to do parallel jobs. My question is how I should write the pbs file. When I submit the job using qsub, an error occurs: mpirun noticed that the job aborted, but has no info as to the process that caused that situation.
Here is the R script code:
install.packages("fume_1.0.tar.gz")
library(fume)
library(foreach)
library(doSNOW)
load("spei03_df.rdata",.GlobalEnv)
cl <- makeCluster(mpi.universe.size(), type='MPI' )
registerDoSNOW(cl)
MK_grid <-
foreach(i=1:6000, .packages="fume",.combine='rbind') %dopar% {
abc <- mkTrend(as.matrix(spei03_data)[i,])
data.frame(P_value=abc$`Corrected p.value`, Slope=abc$`Sen's Slope`*10,Zc=abc$Zc)
}
stopCluster(cl)
save(MK_grid,file="MK_grid.rdata")
mpi.exit()
The "fume" package is download from https://cran.r-project.org/src/contrib/Archive/fume/ .
Here is the pbs file:
#!/bin/bash
#PBS -l nodes=2:ppn=12
#PBS -l walltime=2:00:00
module load application/R/3.3.1
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=1
mpirun -np 1 -hostfile $PBS_NODEFILE R CMD BATCH Trend.R
Can anyone help? Thanks a lot.

It's difficult to give advice on how to use a compute cluster that I've never used since each cluster is setup somewhat differently, but I can give you some general advice that may help.
Your job script looks reasonable to me. It's very similar to what I use on one of our Torque/Moab clusters. It's a good idea to verify that you're able to load all of the necessary R packages interactively because sometimes additional module files may need to be loaded. If you need to install packages yourself, make sure you install them in the standard "personal library" which is called something like "~/R/x86_64-pc-linux-gnu-library/3.3". That often avoids errors loading packages in the R script when executing in parallel.
I have more to say about your R script:
You need to load the Rmpi package in your R script using library(Rmpi). It isn't automatically loaded when loading doSNOW, so you will get an error when calling mpi.universe.size().
I don't recommend installing R packages in the R script itself. That will fail if install.script needs to prompt you for the CRAN repository, for example, since you can't execute interactive functions from an R script executed via mpirun.
I suggest starting mpi.universe.size() - 1 cluster workers when calling makeCluster. Since mpirun starts one worker, it may not be safe for makeCluster to spawn mpi.universe.size() additional workers since that would result in a total of mpi.universize.size() + 1 MPI processes. That works on some clusters, but it fails on at least one of our clusters.
While debugging, try using the makeCluster outfile='' option. Depending on your MPI installation, that may let you see error messages that would otherwise be hidden.

Related

How to release R's prompt when using 'system'?

I am writing an R code on a Linux system using RStudio. At some point in the code, I need to use a system call to a command that will download a few thousand of files from the lines of a text file:
down.command <- paste0("parallel --gnu -a links.txt wget")
system(down.command)
However, this command takes a little while to run (a couple of hours), and the R prompt stays locked while the command runs. I would like to keep using R while the command runs on the background.
I tried to use nohup like this:
down.command <- paste0("nohup parallel --gnu -a links.txt wget > ~/down.log 2>&1")
system(down.command)
but the R prompt still gets "locked" waiting for the end of the command.
Is there any way to circumvent this? Is there a way to submit system commands from R and keep them running on the background?
Using ‘processx’, here’s how to create a new process that redirects both stdout and stderr to the same file:
args = c('--gnu', '-a', 'links.txt', 'wget')
p = processx::process$new('parallel', args, stdout = '~/down.log', stderr = '2>&1')
This launches the process and resumes the execution of the R script. You can then interact with the running process via the p name. Notably you can signal to it, you can query its status (e.g. is_alive()), and you can synchronously wait for its completion (optionally with a timeout after which to kill it):
p$wait()
result = p$get_exit_status()
Based on the comment by #KonradRudolph, I became aware of the processx R package that very smartly deals with system process submissions from within R.
All I had to do was:
library(processx)
down.command <- c("parallel","--gnu", "-a", "links.txt", "wget", ">", "~/down.log", "2>&1")
processx::process$new("nohup", down.comm, cleanup=FALSE)
As simple as that, and very effective.

Several questions on running Rmpi and foreach on a HPC cluster

I am queueing and running an R script on a HPC cluster via sbatch and mpirun; the script is meant to use foreach in parallel. To do this I've used several useful questions & answers from StackOverflow: R Running foreach dopar loop on HPC MPIcluster, Single R script on multiple nodes, Slurm: Use cores from multiple nodes for R parallelization.
It seems that the script completes, but a couple of strange things happen. The most important is that the slurm job keeps on running afterwards, doing nothing(?). I'd like to understand if I'm doing things properly. I'll first give some more specific information, then explain the strange things I'm seeing, then I'll ask my questions.
– Information:
R is loaded as a module, which also calls an OpenMPI module. The packages Rmpi, doParallel, snow, foreach were already compiled and included in the module.
The cluster has nodes with 20 CPUs each. My sbatch file books 2 nodes and 20 CPUs per node.
The R script myscript.R is called in the sbatch file like this:
mpirun -np 1 Rscript -e "source('myscript.R')"
My script calls several libraries in this order:
library('snow')
library('Rmpi')
library('doParallel')
library('foreach')
and then sets up parallelization as follows at the beginning:
workers <- mpi.universe.size() - 1
cl <- makeMPIcluster(workers, outfile='', type='MPI')
registerDoParallel(cl)
Then several foreach-dopar are called in succession – that is, each starts after the previous has finished. Finally
stopCluster(cl)
mpi.quit()
are called at the very end of the script.
mpi.universe.size() correctly gives 40, as expected. Also, getDoParWorkers() gives doParallelSNOW. The slurm log encouragingly says
39 slaves are spawned successfully. 0 failed.
starting MPI worker
starting MPI worker
...
Also, calling print(clusterCall(cl, function() Sys.info()[c("nodename","machine")])) from within the script correctly reports the node names shown in the slurm queue.
– What's strange:
The R script completes all its operations, the last one being saving a plot as pdf, which I do see and is correct. But the slurm job doesn't end, it remains in the queue indefinitely with status "running".
The slurm log shows very many lines with
Type: EXEC. I can't find any relation between their number and the number of foreach called. At the very end the log shows 19 lines with Type: DONE (which make sense to me).
– My questions:
Why does the slurm job run indefinitely after the script has finished?
Why the numerous Type: EXEC messages? are they normal?
There is some masking between packages snow and doParallel. Am I calling the right packages and in the right order?
Some answers to the StackOverflow questions mentioned above recommend to call the script with
mpirun -np 1 R --slave -f 'myscript.R'
instead of using Rscript as I did. What's the difference? Note that the problems I mentioned remain even if I call the script this way, though.
I thank you very much for your help!

R plumber package for node.js parallel processing

I would like to use the plumber package to carry out some flexible parallel processing and was hoping it would work within a node.js framework such that it is non-blocking...
I have the following plumber file.
# myfile.R
#* #get /mean
normalMean <- function(samples=10){
Sys.sleep(5)
data <- rnorm(samples)
mean(data)
}
I have also installed pm2 as suggested here http://plumber.trestletech.com/docs/hosting/
I have also made the same run-myfile.sh file i.e.
#!/bin/bash
R -e "library(plumber); pr <- plumb('myfile.R'); pr\$run(port=4000)"
and made it executable as suggested...
I have started up pm2 using
pm2 start /path/to/run-myfile.sh
and wanted to test to see if it could carry out a non-blocking node.js framework...
by opening up another R console and running the following...
foo <- function(){
con <- curl::curl('http://localhost:4000/mean?samples=10000',handle = curl::new_handle())
on.exit(close(con))
return(readLines(con, n = 1, ok = FALSE, warn = FALSE))
}
system.time(for (i in seq(5)){
print(foo())
})
Perhaps it is my miss-understanding of how a node.js non-blocking framework is meant to work, but in my head the last loop should take only a bit of over 5 seconds. But it seems to take 25 seconds, suggesting everything is sequential rather than parallel.
How could I use the plumber package to carry out that non-blocking nature?
pm2 can't load-balance R processes for you, unfortunately. R is single-threaded and doesn't really have libraries that allow it to behave in asynchronous fashion like NodeJS does (yet), so there aren't many great ways to parallelize code like this in plumber today. The best option would be to run multiple plumber R back-ends and distribute traffic across them. See the "load balancing" section here: http://plumber.trestletech.com/docs/docker-advanced
Basically concurrent requests are queued by httpuv so that it is not performant by itself. The author recommends multiple docker containers but it can be complicated as well as response-demanding.
There are other tech eg Rserve and rApache. Rserve forks prosesses and it is possible to configure rApache to pre-fork so as to handle concurrent requests.
See the following posts for comparison
https://www.linkedin.com/pulse/api-development-r-part-i-jaehyeon-kim/
https://www.linkedin.com/pulse/api-development-r-part-ii-jaehyeon-kim/

Performing Parallel computation using doSNOW and multiple servers in R

I am trying to do a multi-server (not multi-core) computation using doSNOW and foreach packages.
I have 2 Windows servers and I want to start a parallel computation on both of these Windows machines.
I have the following code:
library(foreach)
library(doSNOW)
winOptionsServer1 <-
list(host="Server1",
rscript="C:/Program Files/R/R-3.1.2/bin/Rscript.exe",
snowlib="C:/Program Files/R/R-3.1.2/library")
winOptionsServer2 <-
list(host="Server2",
rscript="C:/Program Files/R/R-3.1.2/bin/Rscript.exe",
snowlib="C:/Program Files/R/R-3.1.2/library")
cl <- makeCluster(c(rep(winOptionsServer1, 2), rep(winOptionsServer1, 2)), type="SOCK")
After calling makeCluster my machine does somrthing, but never actually completes the call. When I hit Stop in RStudio I get the following error message:
running command 'ssh -l mypc Server1 C:/PROGRA~1/R/R-31~1.2/bin/Rscript.exe "C:/Program Files/R/R-3.1.2/library/snow/RSOCKnode.R" MASTER=MY-PC PORT=11764 OUT=/dev/null SNOWLIB=C:/Program Files/R/R-3.1.2/library' had status 127
Does it mean that I have to configure something on these remote servers? What exactly should I configure? ssh? And how do I do it? Maybe I should open some ports on my remote machines? Which ones?
Does anyone have a full example of steps I need to do to run R across 2 or more machines.
P.S. doSnow works really well with multi-core running, no problem with that. I have problems with multi-servers running

R no longer runs in parallel

I run (k)Ubuntu 12.04.2 and R 3.0.1
I wrote a bunch of code that used to run in parallel, but now it no longer does. Not even this runs in parallel any more:
library(doMC)
registerDoMC(4)
Results = foreach (i = 1:1e6, .combine = "c") %dopar% {
sqrt(i)
}
And that definitely should. What I think broke it is either the R 3.0.1 update or a -dev, -devel BLAS package I installed. (openBLAS I think)
I've tried system(sprintf("taskset -p 0xffffffff %d", Sys.getpid())) as suggested elsewhere, and get this result:
pid 2415's current affinity mask: 1
pid 2415's new affinity mask: f
I've also tried running R with:
taskset 0xffff R
However after either of these steps running the loop still only uses one core.
I want parallel processing back! How can I get it?
I found the solution! Ironically, to get parallel processing back I had to do both of the steps I mentioned in the Q at the same time
So, start R with
taskset 0xffff R
Then run
system(sprintf("taskset -p 0xffffffff %d", Sys.getpid()))
Within R.
Voila, parallel processing returns

Resources