I have a 31 CPU machine available for parallel computations. I would like to create a single 31-node cluster which would then serve for parallel computations to several different R processes. How can this be done?
I am currently using makeCluster in a way like this:
cl <- makeCluster(5)
registerDoParallel(cl)
but this will only serve the current R process. How can I connect to a cluster created in different R process?
PS: The reason why I want multiple processes to access one cluster is that I want to be constantly adding new sets of computations which will be waiting in the queue for the running processes to finish. I hope it will work this way? I have used doRedis for this in the past, but there were some problems and I would like to use a simple cluster for the purpose.
I read you had to use stopCluster() after running parallel function: foreach() in R. However, I can get away with registerDoParallel() and then running foreach() as many times as I want without ever using stopCluster(). So do I need stopCluster() or not?
Does not using stopCluster() mean your cores are occupied with your current task? So if I am using parallel programming with only a little bit of single core sequential tasks in between, then I don't need to stopCluster()? I understand there are also significant overhead time consumption with setting up parallel.
parallel::makeCluster() and doParallel::registerDoParallel() create a set of copies of R running in parallel. The copies are called workers.
parallel::stopCluster() and doParallel::stopImplicitCluster() are safe ways of shutting down the workers. From the help page ?stopCluster:
It is good practice to shut down the workers by calling ‘stopCluster’: however the workers will terminate themselves once the socket on which they are listening for commands becomes unavailable, which it should if the master R session is completed (or its process dies).
Indeed, CPU usage of unused workers is often negligible. However, if the workers load large R objects, e.g., large datasets, they may use large parts of the memory and, as a consequence, slow down computations. In that case, it is more efficient to shut down the unused workers.
I have recently inherited a legacy R script that at some point trains a gradient boost model with a large regression matrix. This task is parallelised using the doParallel::registerDoParallel function. Originally, the script started the parallel back end with:
cl = makeCluster(10)
doParallel::registerDoParallel(c)
The workstation has 12 CPU and 28 GB of RAM. The regression matrix being just over 2 GB, I thought this set up would be manageable, however, it would launch dozens of sub-processes, exhausting the memory and crashing R in a few seconds. Eventually I understood that on Linux, a more predictable set up can be achieved by using the cores parameter:
doParallel::registerDoParallel(cores = 1)
Where cores is actually the number of sub-processes per CPU, i.e., with cores = 2 it launches 24 sub-process in this 12 CPU architecture. The problem is that with such a large regression matrix, even 12 sub-processes is too much.
How can the number of sub-processes launched by registerDoParallel be restricted to, say, 8? Would it help using a different parallelisation library?
Update: To ascertain the memory required by each sub-process, I ran the script without starting a cluster or using registerDoParallel; still ,12 back end sub-processes were spawned. The culprit is the caret::train function, it seems to be managing parallelisation without oversight. I opened a new issue at GitHub asking for instructions on how to limit the resources used by this function.
I am parallelizing my R code with snow using clusters (SOCK type) in an Ubuntu 16 LTS. A simpler code example is below:
# Make cluster of type SOCK
cl <- makeCluster(hostsList, type = "SOCK")
clusterExport(cl, "task");
# Compute long time tasks
result <- clusterApplyLB(cl, 1:50, function(x) task(x))
# Stop cluster
stopCluster(cl)
Task function can take a long time (minutes or hours), but when for some reasons in my application there is no need to continue computing the tasks, application is not able to stop all slaves processes.
I can kill master R process but R slave processes remain until they finish (i.e. remain using CPU during several time).
I can not kill slave processes because their parent process is the system one (PPID = 1) so I don't know which slaves are related to master process I want to stop. I also tried to use a kind of interrupt to let master R process execute stopCluster function without succeed.
After a depth search, I did not found a solution for this. So, does anybody know a way to stop/kill the slaves or has an idea to solve this?
Thanks in advance!
What is the difference between cluster and cores in registerDoParallel when using doParallel package?
Is my understanding correct that on single machine these are interchangeable and I will get same results for :
cl <- makeCluster(4)
registerDoParallel(cl)
and
registerDoParallel(cores = 4)
The only difference I see that makeCluster() has to be stopped explicitly using stopCluster().
I think the chosen answer is too general and actually not accurate, since it didn't touch the detail of doParallel package itself. If you read the vignettes, it's actually pretty clear.
The parallel package is essentially a merger of the multicore
package, which was written by Simon Urbanek, and the snow package,
which was written by Luke Tierney and others. The multicore
functionality supports multiple workers only on those operating
systems that support the fork system call; this excludes Windows. By
default, doParallel uses multicore functionality on Unix-like systems
and snow functionality on Windows.
We will use snow-like functionality in this vignette, so we start by
loading the package and starting a cluster
To use multicore-like functionality, we would specify the number of
cores to use instead
In summary, this is system dependent. Cluster is the more general mode cover all platforms, and cores is only for unix-like system.
To make the interface consistent, the package used same function for these two modes.
> library(doParallel)
> cl <- makeCluster(4)
> registerDoParallel(cl)
> getDoParName()
[1] "doParallelSNOW"
> registerDoParallel(cores=4)
> getDoParName()
[1] "doParallelMC"
Yes, it's right from the software view.
on single machine these are interchangeable and I will get same results.
To understand 'cluster' and 'cores' clearly, I suggest thinking from the 'hardware' and 'software' level.
At the hardware level, 'cluster' means network connected machines that can work together by communications such as by socket (Need more init/stop operations as stopCluster you pointed). While 'cores' means several hardware cores in local CPU, and they work together by shared memory typically (don't need to send message explicitly from A to B).
At the software level, sometimes, the boundary of cluster and cores is not that clear. The program can be run locally by cores or remote by cluster, and the high-level software doesn't need to know the details. So, we can mix two modes such as using explicit communication in local as setting cl in one machine,
and also can run multicores in each of the remote machines.
Back to your question, is setting cl or cores equal?
From the software, it will be the same that the program will be run by the same number of clients/servers and then get the same results.
From the hardware, it may be different. cl means to communicate explicitly and cores to shared memory, but if the high-level software optimized very well. In a local machine, both setting will goes into the same flow. I don't look into doParallel very deep now, so I am not very sure if these two are the same.
But in practice, it is better to specify cores for a single machine and cl for the cluster.
Hope this helps you.
The behavior of doParallel::registerDoParallel(<numeric>) depends on the operating system, see print(doParallel::registerDoParallel) for details.
On Windows machines,
doParallel::registerDoParallel(4)
effectively does
cl <- makeCluster(4)
doParallel::registerDoParallel(cl)
i.e. it set up four ("PSOCK") workers that run in background R sessions. Then, %dopar% will basically utilize the parallel::parLapply() machinery. With this setup, you do have to worry about global variables and packages being attached on each of the workers.
However, on non-Windows machines,
doParallel::registerDoParallel(4)
the result will be that %dopar% will utilize the parallel::mclapply() machinery, which in turn relies on forked processes. Since forking is used, you don't have to worry about globals and packages.