How to create cluster for multiple processes? - r

I have a 31 CPU machine available for parallel computations. I would like to create a single 31-node cluster which would then serve for parallel computations to several different R processes. How can this be done?
I am currently using makeCluster in a way like this:
cl <- makeCluster(5)
registerDoParallel(cl)
but this will only serve the current R process. How can I connect to a cluster created in different R process?
PS: The reason why I want multiple processes to access one cluster is that I want to be constantly adding new sets of computations which will be waiting in the queue for the running processes to finish. I hope it will work this way? I have used doRedis for this in the past, but there were some problems and I would like to use a simple cluster for the purpose.

Related

R package Future - why does a loop with remote workers hangs the local R session

Please let me know if you need an example, but I don't think it is necessary.
I've written a for loop that makes futures and store the results of each in a list. The plan is remote, say, made of 4 nodes on an internet machine.
After the 4th future is deployed and all cores of the remote machine are busy, R hangs until one of them is free. As I'm not using any of my local cores, why does it have to hang? Is that a way to change this behavior?
Author of the future framework here. This behavior is by design.
Your main R session has a certain number of workers available. The number of workers depends on what future plan you have set up. You can always check then number of workers set up by calling nbrOfWorkers(). In your case, you have four remote workers, which means that nbrOfWorkers() returns 4.
You can this number of futures (= nbrOfWorkers()) active at any time without blocking. When you attempt to create one more future, there are no more workers available to take it on. At this point, the only option is to block.
Now, it could be that you are asking: How can I make use of my local machine when the remote workers are all busy?
The easiest way to achieve this is by adding one of more local workers in the mix of remote workers. For example, if you allow yourself to use two workers on your local machine, you can do this as:
library(future)
remote_workers <- makeClusterPSOCK(c("n1.remote.org", "n2.remote.org"))
local_workers <- makeClusterPSOCK(2)
plan(cluster, workers = c(remote_workers, local_workers))
or even just
library(future)
remote_workers <- c("n1.remote.org", "n2.remote.org")
local_workers <- rep("localhost", times = 2)
plan(cluster, workers = c(remote_workers, local_workers))

Do I have to registerDoParallel() and stopCluster() every time I want to use foreach() in R?

I read you had to use stopCluster() after running parallel function: foreach() in R. However, I can get away with registerDoParallel() and then running foreach() as many times as I want without ever using stopCluster(). So do I need stopCluster() or not?
Does not using stopCluster() mean your cores are occupied with your current task? So if I am using parallel programming with only a little bit of single core sequential tasks in between, then I don't need to stopCluster()? I understand there are also significant overhead time consumption with setting up parallel.
parallel::makeCluster() and doParallel::registerDoParallel() create a set of copies of R running in parallel. The copies are called workers.
parallel::stopCluster() and doParallel::stopImplicitCluster() are safe ways of shutting down the workers. From the help page ?stopCluster:
It is good practice to shut down the workers by calling ‘stopCluster’: however the workers will terminate themselves once the socket on which they are listening for commands becomes unavailable, which it should if the master R session is completed (or its process dies).
Indeed, CPU usage of unused workers is often negligible. However, if the workers load large R objects, e.g., large datasets, they may use large parts of the memory and, as a consequence, slow down computations. In that case, it is more efficient to shut down the unused workers.

how to combine doMPI and doMC on a cluster?

I was used to do parallel computation with doMC and foreach and I have now access to a cluster. My problem is similar to this one Going from multi-core to multi-node in R but there is no response on this post.
Basically I can request a number of tasks -n and a number of cores per task -c to my batch queuing system. I do manage to use doMPI to make parallel simulations on the number of tasks I request, but I now want to use the maxcores options of startMPIcluster to make each MPI process use multicore functionality.
Something I have notices is that parallel::detectCores() does not seem to see how many cores I have been attributed and return the maximum number of core of a node.
For now I have tried:
ncore = 3 #same number as the one I put with -c option
library(Rmpi)
library(doMPI)
cl <- startMPIcluster(maxcores = ncore)
registerDoMPI(cl)
## now some parallel simulations
foreach(icount(10), .packages = c('foreach', 'iterators', 'doParallel')) %dopar% {
## here I'd like to use the `ncore` cores on each simulation of `myfun()`
registerDoParallel(cores = ncore)
myfun()
}
(myfun has indeed some foreach loop inside) but if I set ncore > 1 then I got an error:
Error in { : task 1 failed - "'mckill' failed"
thanks
EDIT
the machineI have access to is http://www-ccrt.cea.fr/fr/moyen_de_calcul/airain.htm, where it's specified "Librairies MPI: BullxMPI, distribution Bull MPI optimised and compatible with OpenMPI"
You are trying to use a lot of different concepts at the same time. You are using an MPI-based cluster to launch on different computers, but are trying to use multi-core processing at the same time. This makes things needlessly complicated.
The cluster you are using is probably spread out over multiple nodes. You need some way to transfer data between these nodes if you want to do parallel processing.
In comes MPI. This is a way to easily connect between different workers on different machines, without needing to specify IP addresses or ports. And this is indeed why you want to launch your process using mpirun or ccc_mprun (which is probably a script with some extra arguments for your specific cluster).
How do we now use this system in R? (see also https://cran.r-project.org/web/packages/doMPI/vignettes/doMPI.pdf)
Launch your script using: mpirun -n 24 R --slave -f myScriptMPI.R, for launching on 24 worker processes. The cluster management system will decide where to launch these worker processes. It might launch all 24 of them on the same (powerful) machine, or it might spread them over 24 different machines. This depends on things like workload, available resources, available memory, machines currently in SLEEP mode, etc.
The above command will launch myScriptMPI.R on 24 different machines. How do we now collaborate?
library(doMPI)
cl <- startMPIcluster()
#the "master" process will continue
#the "worker" processes will wait here until receiving work from the master
registerDoMPI(cl)
## now some parallel simulations
foreach(icount(24), .packages = c('foreach', 'iterators'), .export='myfun') %dopar% {
myfun()
}
Your data will get transferred automatically from master to workers using the MPI protocol.
If you want more control over your allocation, including making "nested" MPI clusters for multicore vs. inter-node parallelization, I suggest you read the doMPI vignette.

How to see how many nodes a process is using on a cluster with Sun grid engine?

I am (trying to) run R on a multicore computing cluster with a Sun grid engine. I would like to run R in parallel using the MPI environment and the snow / snowfall parLapply() functions. My code is working at least on my laptop, but to be sure whether it does what it is supposed to on the cluster as well, I have the following questions.
If I request a number of slots / nodes, say 4, how can I check whether a running process actually uses the full number of the requested CPUs? Is there a commend that can show details about the CPU usage on the requested nodes for a process?
In order to verify that the cluster workers really started on the appropriate nodes, I often use the following command right after creating the cluster object:
clusterEvalQ(cl, Sys.info()['nodename'])
This should match the list of allocated nodes reported by the qstat command.
To actually get details on the CPU usage, I often ssh to each node and use commands like top and ps, but that can be painful if there are many nodes to check. We have the Ganglia monitoring system set up on our clusters, so I can use Ganglia's web interface to check various node statistics. You might want to check with your system administrators to see if they have set anything up for monitoring.

Passing parameters to parallel R jobs

I am trying to run parallel R jobs using the multicore package. Every job is the execution of the same script.R with different arguments.
A general idea is to define a function that takes the args and then calls source("script.R"). The problem is that I cannot pass the args to the script.R. Since I am running in parallel the args cannot be defined in the global scope.
Any help is welcomed.
As running parallel R instances which might be even on different nodes/computers, using an outer database to store parameters might be a good option.
I would use redis as being extremely fast and fully accessible in R, and for parallel runs its brother: doredis.
So you could have a redis server (or even a replicated, slave database on every host) which could be fetched for parameters. You could instantly update the parameters even from outside of R available to all workers and could easily add new workers for the task with doredis.

Resources