I have recently inherited a legacy R script that at some point trains a gradient boost model with a large regression matrix. This task is parallelised using the doParallel::registerDoParallel function. Originally, the script started the parallel back end with:
cl = makeCluster(10)
doParallel::registerDoParallel(c)
The workstation has 12 CPU and 28 GB of RAM. The regression matrix being just over 2 GB, I thought this set up would be manageable, however, it would launch dozens of sub-processes, exhausting the memory and crashing R in a few seconds. Eventually I understood that on Linux, a more predictable set up can be achieved by using the cores parameter:
doParallel::registerDoParallel(cores = 1)
Where cores is actually the number of sub-processes per CPU, i.e., with cores = 2 it launches 24 sub-process in this 12 CPU architecture. The problem is that with such a large regression matrix, even 12 sub-processes is too much.
How can the number of sub-processes launched by registerDoParallel be restricted to, say, 8? Would it help using a different parallelisation library?
Update: To ascertain the memory required by each sub-process, I ran the script without starting a cluster or using registerDoParallel; still ,12 back end sub-processes were spawned. The culprit is the caret::train function, it seems to be managing parallelisation without oversight. I opened a new issue at GitHub asking for instructions on how to limit the resources used by this function.
Related
I am using mclapply in my R script for parallel computing. It saves overall memory usage and it is fast so I want to keep it in my script. However, one thing I noticed is that the number of child processes generated during running the script is more than the number of cores I specified using mc.cores. Specifically, I am running my script on a server with 128 cores. And when I run my script, I set mc.cores to 18. During the running of the script, I checked the processes related to my script using htop. First, I can find 18 processes like this:
enter image description here
3_GA_optimization.R is my script. This all look good. But I also found more than 100 processes running at the same time with similar memory and CPU usage. The screenshot below shows some of them:
enter image description here
The problem of this is that although I only required 18 cores, the script actually uses all the 128 cores on the server and this makes the server very slow. So my first question is why is this happening? And what is the difference between these processes with green color compared to the 18 processes with black color?
My second question is that I tried to use ulimit -Su 100 to set the soft limit of maximum number of processes that I can use before running Rscript 3_GA_optimization.R. I chose 100 based on the current number of processes I am using before running the script and the number of cores I want to use when running the script. However, I got an error saying:
Error in mcfork():
unable to fork, possible reason: Resource temporarily unavailable
So it seems that mclapply has to generate a lot more processes than mc.cores in order for the script to run, which is confusing to me. So my second question is that why does mclapply behaves in this way? Is there any other way to fix the total number of cores mclapply can use?
OP followed up in a comment on 2021-05-17 and confirmed that the problem was their parallelization via mclapply() called functions of the ranger package, which in turn parallelized using all available CPU cores. This nested parallelism, cause R to use many more CPU cores than available on the machine.
I'm running into memory issues using the bnlearn package's structure learning algorithms. Specifically, I notice that score based methods (e.g. hc and tabu) use LOTS of memory--especially when given a non-empty starting network.
Memory usage wouldn't be an issue except that it continually brings down both my laptop (16GB RAM) and a VM I'm using (128 GB RAM), yet the data set in question is a discrete BN with 41 nodes and ~250 rows (69KB in memory). The issue occurs both when running sequentially with 16GB of RAM and in parallel on a VM (32GB/core).
One last bit of detail: Occasionally I can get 100-200 nets with a random start to run successfully, but then one net will randomly get too big and bring the system down.
My question: I'm newer to BNs, so is this just inherent to the method or is it a memory management issue with the package?
I am currently trying to run a parallelized RStan job on a computing cluster in R. I am doing this by specifying the following two options:
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
The above allocates 48 cores available and the total RAM I have is 180 GB. I always thought in theory that more cores and more RAM was better. I am running very long jobs and I am getting insufficient memory errors in my cluster. I am wondering if I am perhaps not giving enough memory to each core. Is it possible that the 48 cores each are splitting the 180 GB and each core is then maxed out?
If I were to use the 180 GB of RAM and instead had 3 cores, would this get around memory errors? Or is it no matter how many cores I have, the total memory will always be used up at some point if its a long job? Thanks!
RStan is only going to utilize as many cores as there are chains (by default 4). And if you are using a shell on a Linux cluster, it will fork rather than making copies of the data. So, if you have idle cores, you are better off utilizing them to parallelize computations within a Stan program using the map_rect function, if possible.
But all of that is probably unrelated to your memory problems. It should not require even 1 GB in most cases.
I am trying to use caret to cross-validate an elastic net model using the glmnet implementation on an Ubuntu machine with 8 CPU cores & 32 GB of RAM. When I train sequentially, I am maxing out CPU usage on one core, but using 50% of the memory on average.
When I use doMC(cores = xxx), do I need to worry about only registering xxx = floor(100/y) cores, where y is the memory usage of the model when using a single core (in %), in order to not run out of memory?
Does caret have any heuristics that allow it to figure out the max. number of cores to use?
Is there any set of heuristics that I can use to dynamically adjust the number of cores to use my computing resources optimally across different sizes of data and model complexities?
Edit:
FWIW, attempting to use 8 cores made my machine unresponsive. Clearly caret does not check to see if the spawning xxx processes is likely to be problematic. How can I then choose the number of cores dynamically?
Clearly caret does not check to see if the spawning xxx processes is likely to be problematic.
True; it cannot predict future performance of your computer.
You should get an understanding of how much memory you use for modeling use when running sequentially. You can start the training and use top or other methods to estimate the amount of ram used then kill the process. If sequentially you use X GB of RAM sequentially, running on M cores will require X(M+1) GB of ram.
I'm running an MCMC algorithm and Microsoft R open on Windows 7 has improved my speed a lot. But right now I need to run tons of simulations using my algorithm, so I used the R snow package to parallel my code. However, it doesn't work.
To be specific, the Microsfot R open on my PC is using 4 cores for calculation, while there are 8 cores in total. So I'm thinking I will parallel 2 process on my PC since each will need 4 cores for MKL library. But the parallel isn't real at all. I set up all my 8 cores when paralleling. My test program will need 5 minutes to run. But if I'm paralleling my program with a copy of that, I hope the 2 process will take 5 minutes as well. But actually it took 10 minutes, just like running the 2 process sequentially.
The same thing happened if I tried to open two R sessions and run the programs in the two R sessions. Usually it will only need 5 mins, but now each of them will take 10 mins.
So where am I messing up? Is that the problems about two layers of parallel? One is at my level, the other one is at the intel MKL level?
There are way too many factors at play here to figure it out without knowing certain details about your code. For example, what is the affinity mask in effect for each process? What is the Tread Ideal Processor for threads in concurrent processes? It is possible, that your processes are trying to compete for the same cores. You can find more details by looking at the SetThreadIdealProcessor and SetProcessAffinityMask APIs. It is also possible that your code is using a shared resource protected by a critical section or other synchronization object. I would start by downloading Process Explorer from Sysinternals and looking at the thread list for each process. This would tell you how many physical threads are running and how many context switches are there for each thread. This will give you something to start with.