What is the difference between cluster and cores in registerDoParallel when using doParallel package?
Is my understanding correct that on single machine these are interchangeable and I will get same results for :
cl <- makeCluster(4)
registerDoParallel(cl)
and
registerDoParallel(cores = 4)
The only difference I see that makeCluster() has to be stopped explicitly using stopCluster().
I think the chosen answer is too general and actually not accurate, since it didn't touch the detail of doParallel package itself. If you read the vignettes, it's actually pretty clear.
The parallel package is essentially a merger of the multicore
package, which was written by Simon Urbanek, and the snow package,
which was written by Luke Tierney and others. The multicore
functionality supports multiple workers only on those operating
systems that support the fork system call; this excludes Windows. By
default, doParallel uses multicore functionality on Unix-like systems
and snow functionality on Windows.
We will use snow-like functionality in this vignette, so we start by
loading the package and starting a cluster
To use multicore-like functionality, we would specify the number of
cores to use instead
In summary, this is system dependent. Cluster is the more general mode cover all platforms, and cores is only for unix-like system.
To make the interface consistent, the package used same function for these two modes.
> library(doParallel)
> cl <- makeCluster(4)
> registerDoParallel(cl)
> getDoParName()
[1] "doParallelSNOW"
> registerDoParallel(cores=4)
> getDoParName()
[1] "doParallelMC"
Yes, it's right from the software view.
on single machine these are interchangeable and I will get same results.
To understand 'cluster' and 'cores' clearly, I suggest thinking from the 'hardware' and 'software' level.
At the hardware level, 'cluster' means network connected machines that can work together by communications such as by socket (Need more init/stop operations as stopCluster you pointed). While 'cores' means several hardware cores in local CPU, and they work together by shared memory typically (don't need to send message explicitly from A to B).
At the software level, sometimes, the boundary of cluster and cores is not that clear. The program can be run locally by cores or remote by cluster, and the high-level software doesn't need to know the details. So, we can mix two modes such as using explicit communication in local as setting cl in one machine,
and also can run multicores in each of the remote machines.
Back to your question, is setting cl or cores equal?
From the software, it will be the same that the program will be run by the same number of clients/servers and then get the same results.
From the hardware, it may be different. cl means to communicate explicitly and cores to shared memory, but if the high-level software optimized very well. In a local machine, both setting will goes into the same flow. I don't look into doParallel very deep now, so I am not very sure if these two are the same.
But in practice, it is better to specify cores for a single machine and cl for the cluster.
Hope this helps you.
The behavior of doParallel::registerDoParallel(<numeric>) depends on the operating system, see print(doParallel::registerDoParallel) for details.
On Windows machines,
doParallel::registerDoParallel(4)
effectively does
cl <- makeCluster(4)
doParallel::registerDoParallel(cl)
i.e. it set up four ("PSOCK") workers that run in background R sessions. Then, %dopar% will basically utilize the parallel::parLapply() machinery. With this setup, you do have to worry about global variables and packages being attached on each of the workers.
However, on non-Windows machines,
doParallel::registerDoParallel(4)
the result will be that %dopar% will utilize the parallel::mclapply() machinery, which in turn relies on forked processes. Since forking is used, you don't have to worry about globals and packages.
Related
I want to run R in parallel on HPC clusters using MPI. I understand the makeCluster function from package snow can be used to specify the number of nodes. For example: makeCluster(2,type="MPI") means 2 nodes. Without specifying the type to "MPI", makeCluster(2) would mean 2 cores on a single node.
Is there a way to specify both the number of nodes and cores?
How many cores on each node at default would be used if specifying the type to MPI?
The snow and parallel packages do not to the best of my knowledge expose this -- as it gets hardware-dependent very quickly.
No, if your MPI implementation is, say, OpenMPI then you can specify this in your server's MPI configuration. There is a pretty rich grammar for this, and the hwloc library should give you hardware locality.
But R at the very end of this only knows 'number of worker nodes' and passes the how and which down to the particular implementation you pick.
I'm running models in the r package 'secr'. The simplest models take days to complete on a 4G macbook and I've already done everything possible within the model's setup to decrease run time. Parallel (multicore) processing is possible and straightforward in secr, but benefits are minimal and run time may actually increase. Am I likely to see improvement in run time if I switch to a high-powered virtual machine in the cloud (e.g. AWS's EC2 with 16 RAM and 4 vCPUs), or do the EC2's four vCPUs function like a multicore system (in which case I would only benefit from one vCPU despite having 4)?
I've asked this question in a couple of different forums and received conflicting answers.
You can think of the vCPUs just like a multicore system. They would appear as multiple cores to any software running on the system.
Good question. It depends. You may see a improvement in runtime if you switch to a EC2 instance type with better virtual hardware specifications. AWS runs a custom version of the Xen hypervisor, and your getting vCPUs as you pointed out. Performance will depend on the variability of the other guests's workloads. If the vCPUs are all assigned to instances, and each instance is running CPU heavy workloads, your going to see a downward trend in performance. It depends on the pattern of usage of all the instances running on the hypervisor. This article from Citrix explains some of the nuances of balancing vCPU time between instances on Xen and why performance will vary:
Citrix on Xen vCPU Performance
The instance type matters, not only the vCPUs and RAM. Avoid the T2 instances because they are 'burstable' and CPU performance will certainly vary. This article from AWS recommends to try M4 instance types for parallelization with R:
Running R on AWS
For specific types of EC2 instances you can control the C-state (sleep levels a core can enter when it is idle) and P-State (desired performance in frequency from a core). This would allow you to tune your instance performance for your workload. The following link explains in detail what instance types allow for C-State and P-State control and shows you how to use the utility "stress" to benchmark and tune different configurations.
EC2: Processor State Control
It would be best to design a test when you first provision the instance to see if it the type meets your performance requirements and then run the test again later to see if the performance benchmark holds.
I use the doMC that uses the package multicore. It happened (several times) that when I was debugging (in the console) it went sideways and fork-bombed.
Does R have the setrlimit() syscall?
In pyhton for this i would use resource.RLIMIT_NPROC
Ideally I'd like to restrict the number of R processes running to a number
EDIT: OS is linux CentOS 6
There should be several choices. Here is the relevant section from Writing R Extensions, Section 1.2.1.1
Packages are not standard-alone programs, and an R process could
contain more than one OpenMP-enabled package as well as other components
(for example, an optimized BLAS) making use of OpenMP. So careful
consideration needs to be given to resource usage. OpenMP works with
parallel regions, and for most implementations the default is to use as
many threads as 'CPUs' for such regions. Parallel regions can be
nested, although it is common to use only a single thread below the
first level. The correctness of the detected number of 'CPUs' and the
assumption that the R process is entitled to use them all are both
dubious assumptions. The best way to limit resources is to limit the
overall number of threads available to OpenMP in the R process: this can
be done via environment variable 'OMP_THREAD_LIMIT', where
implemented.(4) Alternatively, the number of threads per region can be
limited by the environment variable 'OMP_NUM_THREADS' or API call
'omp_set_num_threads', or, better, for the regions in your code as part
of their specification. E.g. R uses
#pragma omp parallel for num_threads(nthreads) ...
That way you only control your own code and not that of other OpenMP
users.
One of my favourite tools is a package controlling this: RhpcBLASctl. Here is its Description:
Control the number of threads on 'BLAS' (Aka 'GotoBLAS', 'ACML' and
'MKL'). and possible to control the number of threads in 'OpenMP'. get
a number of logical cores and physical cores if feasible.
After all you need to control the number of parallel session as well as the number of BLAS cores allocated to each of the parallel threads. There is a reason the parallel package has a default of 2 threads per session...
All of this should be largely independent of the flavour of Linux or Unix you are running. Well, apart from the fact that OS X of course (still !!) does not give you OpenMP.
And the very outer level you can control from doMC and friends.
You can use registerDoMC (see the doc here)
registerDoMC(cores=<some number>)
Another option is to use the ulimit command before running the R script:
ulimit -u <some number>
to limit the number of processes R will be able to spawn.
If you want to limit the total number of CPUs several R processes use at the same time, you will need to use cgroups or cpusets and attach the R processes to the cgroup or cpuset. They will then be confined to the physical CPUS defined in the cgroup or cpuset. cgroups allow more control (for instance also memory) but are more complex to setup.
I was used to do parallel computation with doMC and foreach and I have now access to a cluster. My problem is similar to this one Going from multi-core to multi-node in R but there is no response on this post.
Basically I can request a number of tasks -n and a number of cores per task -c to my batch queuing system. I do manage to use doMPI to make parallel simulations on the number of tasks I request, but I now want to use the maxcores options of startMPIcluster to make each MPI process use multicore functionality.
Something I have notices is that parallel::detectCores() does not seem to see how many cores I have been attributed and return the maximum number of core of a node.
For now I have tried:
ncore = 3 #same number as the one I put with -c option
library(Rmpi)
library(doMPI)
cl <- startMPIcluster(maxcores = ncore)
registerDoMPI(cl)
## now some parallel simulations
foreach(icount(10), .packages = c('foreach', 'iterators', 'doParallel')) %dopar% {
## here I'd like to use the `ncore` cores on each simulation of `myfun()`
registerDoParallel(cores = ncore)
myfun()
}
(myfun has indeed some foreach loop inside) but if I set ncore > 1 then I got an error:
Error in { : task 1 failed - "'mckill' failed"
thanks
EDIT
the machineI have access to is http://www-ccrt.cea.fr/fr/moyen_de_calcul/airain.htm, where it's specified "Librairies MPI: BullxMPI, distribution Bull MPI optimised and compatible with OpenMPI"
You are trying to use a lot of different concepts at the same time. You are using an MPI-based cluster to launch on different computers, but are trying to use multi-core processing at the same time. This makes things needlessly complicated.
The cluster you are using is probably spread out over multiple nodes. You need some way to transfer data between these nodes if you want to do parallel processing.
In comes MPI. This is a way to easily connect between different workers on different machines, without needing to specify IP addresses or ports. And this is indeed why you want to launch your process using mpirun or ccc_mprun (which is probably a script with some extra arguments for your specific cluster).
How do we now use this system in R? (see also https://cran.r-project.org/web/packages/doMPI/vignettes/doMPI.pdf)
Launch your script using: mpirun -n 24 R --slave -f myScriptMPI.R, for launching on 24 worker processes. The cluster management system will decide where to launch these worker processes. It might launch all 24 of them on the same (powerful) machine, or it might spread them over 24 different machines. This depends on things like workload, available resources, available memory, machines currently in SLEEP mode, etc.
The above command will launch myScriptMPI.R on 24 different machines. How do we now collaborate?
library(doMPI)
cl <- startMPIcluster()
#the "master" process will continue
#the "worker" processes will wait here until receiving work from the master
registerDoMPI(cl)
## now some parallel simulations
foreach(icount(24), .packages = c('foreach', 'iterators'), .export='myfun') %dopar% {
myfun()
}
Your data will get transferred automatically from master to workers using the MPI protocol.
If you want more control over your allocation, including making "nested" MPI clusters for multicore vs. inter-node parallelization, I suggest you read the doMPI vignette.
I'm running a foreach loop with the snow back-end on a windows machine. I have 8 cores to work with. The rscript is exectuted via a system call embedded in a python script, so there would be an active python instance too.
Is there any benefit to not have #workers=#cores and instead #workers<#cores so there is always an opening for system processes or the python instance?
It successfully runs having #workers=#cores but do I take a performance hit by saturating the cores (max possible threads) with the r worker instances?
It will depend on
Your processor (specifically hyperthreading)
How much info has to be copied to/from the different images
If you're implementing this over multiple boxes (LAN)
For 1) hyperthreading helps. I know my machine does it so I typically have twice as many workers are cores and my code completes in about 85% of the time compared to if I matched the number of workers with cores. It won't improve more than that.
2) If you're not forking, using sockets for instance, you're working as if you're in a distributed memory paradigm, which means creating one copy in memory for every worker. This can be a non-trivial amount of time. Also, multiple images in the same machine may take up a lot of space, depending on what you're working on. I often match the number of workers with number because doubling workers will make me run out of memory.
This is compounded by 3) network speeds over multiple workstations. Locally between machines our switch will transfer things at about 20 mbytes/second which is 10x faster than my internet download speeds at home, but is a snail's pace compared to making copies in the same box.
You might consider increasing R's nice value so that the python has priority when it needs to do something.