The documentation for doMC seems very sparse, listing only doMC-package and registerDoMC(). The problem I'm encountering is I'll spawn several workers via doMC/foreach, but then when the job is done they just sit there taking up memory. I can go and hunt their process IDs, but I often kill the master process by accident.
library(doMC)
library(foreach)
registerDoMC(32)
foreach(i=1:32) %dopar% foo()
##kill command here?
I've tried following with registerDoSEQ() but it doesn't seem to kill off the processes.
The doMC package is basically a wrapper around the mclapply function, and mclapply forks workers that should exit before it returns. It doesn't use persistent workers like the snow package or the snow-derived functions in the parallel package, so it doesn't need a function like stopCluster to shutdown the workers.
Do you see the same problem when using mclapply directly? Does it work any better when you call registerDoMC with a smaller value for cores?
Are you using doMC from a IDE such as RStudio or R.app on a Mac? If so, you might want try using R from a terminal to see if that makes a difference. There could be a problem calling fork in an IDE.
I never did find a suitable solution for doMC, so for a while I've been doing the following:
library(doParallel)
cl <- makePSOCKcluster(4) # number of cores to use
registerDoParallel(cl)
## computation
stopCluster(cl)
Works every time.
if you using doParallel package, and using registerDoParallel(8) with numbers
you can using unloadNamespace("doParallel") to kill the multi process
And if you has the name for the clusters you can using stopCluster(cl) to remove extra workers
By using registerDoSEQ() you simply register the sequential worker, so all parallel workers should stop. This is not a complete solution, but it should work in some cases.
Related
I have a 31 CPU machine available for parallel computations. I would like to create a single 31-node cluster which would then serve for parallel computations to several different R processes. How can this be done?
I am currently using makeCluster in a way like this:
cl <- makeCluster(5)
registerDoParallel(cl)
but this will only serve the current R process. How can I connect to a cluster created in different R process?
PS: The reason why I want multiple processes to access one cluster is that I want to be constantly adding new sets of computations which will be waiting in the queue for the running processes to finish. I hope it will work this way? I have used doRedis for this in the past, but there were some problems and I would like to use a simple cluster for the purpose.
I read you had to use stopCluster() after running parallel function: foreach() in R. However, I can get away with registerDoParallel() and then running foreach() as many times as I want without ever using stopCluster(). So do I need stopCluster() or not?
Does not using stopCluster() mean your cores are occupied with your current task? So if I am using parallel programming with only a little bit of single core sequential tasks in between, then I don't need to stopCluster()? I understand there are also significant overhead time consumption with setting up parallel.
parallel::makeCluster() and doParallel::registerDoParallel() create a set of copies of R running in parallel. The copies are called workers.
parallel::stopCluster() and doParallel::stopImplicitCluster() are safe ways of shutting down the workers. From the help page ?stopCluster:
It is good practice to shut down the workers by calling ‘stopCluster’: however the workers will terminate themselves once the socket on which they are listening for commands becomes unavailable, which it should if the master R session is completed (or its process dies).
Indeed, CPU usage of unused workers is often negligible. However, if the workers load large R objects, e.g., large datasets, they may use large parts of the memory and, as a consequence, slow down computations. In that case, it is more efficient to shut down the unused workers.
I was used to do parallel computation with doMC and foreach and I have now access to a cluster. My problem is similar to this one Going from multi-core to multi-node in R but there is no response on this post.
Basically I can request a number of tasks -n and a number of cores per task -c to my batch queuing system. I do manage to use doMPI to make parallel simulations on the number of tasks I request, but I now want to use the maxcores options of startMPIcluster to make each MPI process use multicore functionality.
Something I have notices is that parallel::detectCores() does not seem to see how many cores I have been attributed and return the maximum number of core of a node.
For now I have tried:
ncore = 3 #same number as the one I put with -c option
library(Rmpi)
library(doMPI)
cl <- startMPIcluster(maxcores = ncore)
registerDoMPI(cl)
## now some parallel simulations
foreach(icount(10), .packages = c('foreach', 'iterators', 'doParallel')) %dopar% {
## here I'd like to use the `ncore` cores on each simulation of `myfun()`
registerDoParallel(cores = ncore)
myfun()
}
(myfun has indeed some foreach loop inside) but if I set ncore > 1 then I got an error:
Error in { : task 1 failed - "'mckill' failed"
thanks
EDIT
the machineI have access to is http://www-ccrt.cea.fr/fr/moyen_de_calcul/airain.htm, where it's specified "Librairies MPI: BullxMPI, distribution Bull MPI optimised and compatible with OpenMPI"
You are trying to use a lot of different concepts at the same time. You are using an MPI-based cluster to launch on different computers, but are trying to use multi-core processing at the same time. This makes things needlessly complicated.
The cluster you are using is probably spread out over multiple nodes. You need some way to transfer data between these nodes if you want to do parallel processing.
In comes MPI. This is a way to easily connect between different workers on different machines, without needing to specify IP addresses or ports. And this is indeed why you want to launch your process using mpirun or ccc_mprun (which is probably a script with some extra arguments for your specific cluster).
How do we now use this system in R? (see also https://cran.r-project.org/web/packages/doMPI/vignettes/doMPI.pdf)
Launch your script using: mpirun -n 24 R --slave -f myScriptMPI.R, for launching on 24 worker processes. The cluster management system will decide where to launch these worker processes. It might launch all 24 of them on the same (powerful) machine, or it might spread them over 24 different machines. This depends on things like workload, available resources, available memory, machines currently in SLEEP mode, etc.
The above command will launch myScriptMPI.R on 24 different machines. How do we now collaborate?
library(doMPI)
cl <- startMPIcluster()
#the "master" process will continue
#the "worker" processes will wait here until receiving work from the master
registerDoMPI(cl)
## now some parallel simulations
foreach(icount(24), .packages = c('foreach', 'iterators'), .export='myfun') %dopar% {
myfun()
}
Your data will get transferred automatically from master to workers using the MPI protocol.
If you want more control over your allocation, including making "nested" MPI clusters for multicore vs. inter-node parallelization, I suggest you read the doMPI vignette.
I am using the parallel package to run a server function multiple times at once. The server function loops until the session is manually stopped by the user.
It looks like:
library(parallel)
cluster <- makeCluster(3)
clusterCall(cluster, f)
On Windows, parallel works by creating an Rscript process for each worker in a cluster. However, these processes do not get closed when terminating the R session; they must be manually removed in the task manager. With a dozen or so workers, this is quickly becoming a hassle.
Is it possible to set these processes to close when the parent R session closes?
You always must close the connections after the parallel processing. Try the following example:
library(parallel)
cluster <- makeCluster(3)
clusterCall(cluster, f)
stopCluster(cluster) # always add this line in the end of the script
What is the difference between cluster and cores in registerDoParallel when using doParallel package?
Is my understanding correct that on single machine these are interchangeable and I will get same results for :
cl <- makeCluster(4)
registerDoParallel(cl)
and
registerDoParallel(cores = 4)
The only difference I see that makeCluster() has to be stopped explicitly using stopCluster().
I think the chosen answer is too general and actually not accurate, since it didn't touch the detail of doParallel package itself. If you read the vignettes, it's actually pretty clear.
The parallel package is essentially a merger of the multicore
package, which was written by Simon Urbanek, and the snow package,
which was written by Luke Tierney and others. The multicore
functionality supports multiple workers only on those operating
systems that support the fork system call; this excludes Windows. By
default, doParallel uses multicore functionality on Unix-like systems
and snow functionality on Windows.
We will use snow-like functionality in this vignette, so we start by
loading the package and starting a cluster
To use multicore-like functionality, we would specify the number of
cores to use instead
In summary, this is system dependent. Cluster is the more general mode cover all platforms, and cores is only for unix-like system.
To make the interface consistent, the package used same function for these two modes.
> library(doParallel)
> cl <- makeCluster(4)
> registerDoParallel(cl)
> getDoParName()
[1] "doParallelSNOW"
> registerDoParallel(cores=4)
> getDoParName()
[1] "doParallelMC"
Yes, it's right from the software view.
on single machine these are interchangeable and I will get same results.
To understand 'cluster' and 'cores' clearly, I suggest thinking from the 'hardware' and 'software' level.
At the hardware level, 'cluster' means network connected machines that can work together by communications such as by socket (Need more init/stop operations as stopCluster you pointed). While 'cores' means several hardware cores in local CPU, and they work together by shared memory typically (don't need to send message explicitly from A to B).
At the software level, sometimes, the boundary of cluster and cores is not that clear. The program can be run locally by cores or remote by cluster, and the high-level software doesn't need to know the details. So, we can mix two modes such as using explicit communication in local as setting cl in one machine,
and also can run multicores in each of the remote machines.
Back to your question, is setting cl or cores equal?
From the software, it will be the same that the program will be run by the same number of clients/servers and then get the same results.
From the hardware, it may be different. cl means to communicate explicitly and cores to shared memory, but if the high-level software optimized very well. In a local machine, both setting will goes into the same flow. I don't look into doParallel very deep now, so I am not very sure if these two are the same.
But in practice, it is better to specify cores for a single machine and cl for the cluster.
Hope this helps you.
The behavior of doParallel::registerDoParallel(<numeric>) depends on the operating system, see print(doParallel::registerDoParallel) for details.
On Windows machines,
doParallel::registerDoParallel(4)
effectively does
cl <- makeCluster(4)
doParallel::registerDoParallel(cl)
i.e. it set up four ("PSOCK") workers that run in background R sessions. Then, %dopar% will basically utilize the parallel::parLapply() machinery. With this setup, you do have to worry about global variables and packages being attached on each of the workers.
However, on non-Windows machines,
doParallel::registerDoParallel(4)
the result will be that %dopar% will utilize the parallel::mclapply() machinery, which in turn relies on forked processes. Since forking is used, you don't have to worry about globals and packages.