I use the doMC that uses the package multicore. It happened (several times) that when I was debugging (in the console) it went sideways and fork-bombed.
Does R have the setrlimit() syscall?
In pyhton for this i would use resource.RLIMIT_NPROC
Ideally I'd like to restrict the number of R processes running to a number
EDIT: OS is linux CentOS 6
There should be several choices. Here is the relevant section from Writing R Extensions, Section 1.2.1.1
Packages are not standard-alone programs, and an R process could
contain more than one OpenMP-enabled package as well as other components
(for example, an optimized BLAS) making use of OpenMP. So careful
consideration needs to be given to resource usage. OpenMP works with
parallel regions, and for most implementations the default is to use as
many threads as 'CPUs' for such regions. Parallel regions can be
nested, although it is common to use only a single thread below the
first level. The correctness of the detected number of 'CPUs' and the
assumption that the R process is entitled to use them all are both
dubious assumptions. The best way to limit resources is to limit the
overall number of threads available to OpenMP in the R process: this can
be done via environment variable 'OMP_THREAD_LIMIT', where
implemented.(4) Alternatively, the number of threads per region can be
limited by the environment variable 'OMP_NUM_THREADS' or API call
'omp_set_num_threads', or, better, for the regions in your code as part
of their specification. E.g. R uses
#pragma omp parallel for num_threads(nthreads) ...
That way you only control your own code and not that of other OpenMP
users.
One of my favourite tools is a package controlling this: RhpcBLASctl. Here is its Description:
Control the number of threads on 'BLAS' (Aka 'GotoBLAS', 'ACML' and
'MKL'). and possible to control the number of threads in 'OpenMP'. get
a number of logical cores and physical cores if feasible.
After all you need to control the number of parallel session as well as the number of BLAS cores allocated to each of the parallel threads. There is a reason the parallel package has a default of 2 threads per session...
All of this should be largely independent of the flavour of Linux or Unix you are running. Well, apart from the fact that OS X of course (still !!) does not give you OpenMP.
And the very outer level you can control from doMC and friends.
You can use registerDoMC (see the doc here)
registerDoMC(cores=<some number>)
Another option is to use the ulimit command before running the R script:
ulimit -u <some number>
to limit the number of processes R will be able to spawn.
If you want to limit the total number of CPUs several R processes use at the same time, you will need to use cgroups or cpusets and attach the R processes to the cgroup or cpuset. They will then be confined to the physical CPUS defined in the cgroup or cpuset. cgroups allow more control (for instance also memory) but are more complex to setup.
Related
I am using mclapply in my R script for parallel computing. It saves overall memory usage and it is fast so I want to keep it in my script. However, one thing I noticed is that the number of child processes generated during running the script is more than the number of cores I specified using mc.cores. Specifically, I am running my script on a server with 128 cores. And when I run my script, I set mc.cores to 18. During the running of the script, I checked the processes related to my script using htop. First, I can find 18 processes like this:
enter image description here
3_GA_optimization.R is my script. This all look good. But I also found more than 100 processes running at the same time with similar memory and CPU usage. The screenshot below shows some of them:
enter image description here
The problem of this is that although I only required 18 cores, the script actually uses all the 128 cores on the server and this makes the server very slow. So my first question is why is this happening? And what is the difference between these processes with green color compared to the 18 processes with black color?
My second question is that I tried to use ulimit -Su 100 to set the soft limit of maximum number of processes that I can use before running Rscript 3_GA_optimization.R. I chose 100 based on the current number of processes I am using before running the script and the number of cores I want to use when running the script. However, I got an error saying:
Error in mcfork():
unable to fork, possible reason: Resource temporarily unavailable
So it seems that mclapply has to generate a lot more processes than mc.cores in order for the script to run, which is confusing to me. So my second question is that why does mclapply behaves in this way? Is there any other way to fix the total number of cores mclapply can use?
OP followed up in a comment on 2021-05-17 and confirmed that the problem was their parallelization via mclapply() called functions of the ranger package, which in turn parallelized using all available CPU cores. This nested parallelism, cause R to use many more CPU cores than available on the machine.
I'd like to know optimal number of cores needed to build a project with GNU make.
I can use --max-load to tune for an existing system, but I want to know if doubling or tripling the core count and memory would improve build wall clock times.
If I could collect statistics on how many recipes make holds waiting for a free core to execute and how long they occupy the core, this could be turned into a standard job scheduling problem.
I don't think there's any way to answer your question, really. Maybe you can be more specific about what you'd like to know.
Obviously the more cores you have, assuming sufficient memory to support them, then the more recipes make can invoke in parallel without crushing your system.
If you have 2 cores and you run make -j300 then make will dutifully invoke 300 jobs at once and your operating system will dutifully attempt to run all of them at the same time. Most likely, your system will be swapping and context switching so much that it will make very little progress and it would take less wall clock time to run make -j2 instead.
On the other hand, if you have 256 cores then make -j300 is probably quite reasonable... assuming you have enough memory to ensure that all those jobs don't wait swapping memory out.
And of course, at some point (but probably far away from any reasonable number of cores you have unless you have a lot of money to spend) you will run into disk IO issues with so many compiler processes running at the same time trying to read source from the disk to compile.
My goto number is num cpus + 1. This is based on a lot of informal benchmarks, and is usually very close to the optimal number. -j9 on a hyper-threaded four core laptop, and -j49 on my usual production build server.
The + 1 means that make keeps all the CPUs occupied, even as jobs are being retired, and is usually a teensy-weensy bit faster than without the increment.
It also means that other users can use the same multiplier without melting the machine.
Be aware though, that although -j49 ensures there are only 49 processes actually running, the parent make will potentially have many more child processes than that. For instance, a single compile may mean the shell is called, which calls a shell script, which calls the compiler driver, which calls the correct compiler stage. On some toolchains my -j49 builds have a peak of 245 child processes. A bit annoying when my ulimit max user processes is only 512.
I want to run R in parallel on HPC clusters using MPI. I understand the makeCluster function from package snow can be used to specify the number of nodes. For example: makeCluster(2,type="MPI") means 2 nodes. Without specifying the type to "MPI", makeCluster(2) would mean 2 cores on a single node.
Is there a way to specify both the number of nodes and cores?
How many cores on each node at default would be used if specifying the type to MPI?
The snow and parallel packages do not to the best of my knowledge expose this -- as it gets hardware-dependent very quickly.
No, if your MPI implementation is, say, OpenMPI then you can specify this in your server's MPI configuration. There is a pretty rich grammar for this, and the hwloc library should give you hardware locality.
But R at the very end of this only knows 'number of worker nodes' and passes the how and which down to the particular implementation you pick.
I'm running a foreach loop with the snow back-end on a windows machine. I have 8 cores to work with. The rscript is exectuted via a system call embedded in a python script, so there would be an active python instance too.
Is there any benefit to not have #workers=#cores and instead #workers<#cores so there is always an opening for system processes or the python instance?
It successfully runs having #workers=#cores but do I take a performance hit by saturating the cores (max possible threads) with the r worker instances?
It will depend on
Your processor (specifically hyperthreading)
How much info has to be copied to/from the different images
If you're implementing this over multiple boxes (LAN)
For 1) hyperthreading helps. I know my machine does it so I typically have twice as many workers are cores and my code completes in about 85% of the time compared to if I matched the number of workers with cores. It won't improve more than that.
2) If you're not forking, using sockets for instance, you're working as if you're in a distributed memory paradigm, which means creating one copy in memory for every worker. This can be a non-trivial amount of time. Also, multiple images in the same machine may take up a lot of space, depending on what you're working on. I often match the number of workers with number because doubling workers will make me run out of memory.
This is compounded by 3) network speeds over multiple workstations. Locally between machines our switch will transfer things at about 20 mbytes/second which is 10x faster than my internet download speeds at home, but is a snail's pace compared to making copies in the same box.
You might consider increasing R's nice value so that the python has priority when it needs to do something.
I am (trying to) run R on a multicore computing cluster with a Sun grid engine. I would like to run R in parallel using the MPI environment and the snow / snowfall parLapply() functions. My code is working at least on my laptop, but to be sure whether it does what it is supposed to on the cluster as well, I have the following questions.
If I request a number of slots / nodes, say 4, how can I check whether a running process actually uses the full number of the requested CPUs? Is there a commend that can show details about the CPU usage on the requested nodes for a process?
In order to verify that the cluster workers really started on the appropriate nodes, I often use the following command right after creating the cluster object:
clusterEvalQ(cl, Sys.info()['nodename'])
This should match the list of allocated nodes reported by the qstat command.
To actually get details on the CPU usage, I often ssh to each node and use commands like top and ps, but that can be painful if there are many nodes to check. We have the Ganglia monitoring system set up on our clusters, so I can use Ganglia's web interface to check various node statistics. You might want to check with your system administrators to see if they have set anything up for monitoring.