Parallel processing in R with "parallel" package - unpredictable runtime - r

I've been learning to parallelize code in R using the parallel package, and specifically, the mclapply() function with 14 cores.
Something I noticed, just from a few runs of code, is that repeat calls of mclapply() (with the same arguments and same number of cores used) take significantly different lengths of time. For example, the first run took 18s, the next run took 23s, and the next one took 34s when I did them back to back to back (on the same input). So I waited a minute, ran the code again, and it was back down to taking 18s.
Is there some equivalent of "the computer needs a second to cool down" after running the code, which would mean that running separate calls of mclapply() back to back might take longer and longer amounts of time, but waiting for a minute or so and then running mclapply() again gets it back to normal?
I don't have much experience with parallelizing in R, but this is the only ad-hoc explanation I can think of. It would be very helpful to know if my reasoning checks out, and hear in more detail about why this might be happening. Thanks!
To clarify, my calls are like:
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- mclapply(training_data, simulation, testing_data, mc.cores=14, mc.set.seed = TRUE)
Running this twice in a row takes a lot longer the second time for me. Waiting for a minute and then running it again, it becomes fast again.

I haven't used mcapply but I have used parallel, foreach and pbapply packages. I think the inconsistency lies in the fact that there are small overheads involved in firing workers and in communicating on progress of running tasks in parallel.

Related

protection from stack overflow in R with a lot of free RAM

I apologize in advance since this post will not have any reproducible example.
I am using R x64 3.4.2 to run some cross-validated analyses on quite big matrices (number of columns ~ 80000, number of rows between 40 and 180). The analyses involve several features selection steps (performed with in-house functions or with functions from the CORElearnpackage, which is written in C++), as well as some clustering of the features and the fitting of a SVM model (by means of the package RWeka, that is written in Java).
I am working on a DELL Precision T7910 machine, with 2 processors Intel Xeon E5-2695 v3 2.30 GHz, 192 Gb RAM and Windows 7 x64 operating system.
To speed up the running time of my analysis I thought to use the doParallel package in combination with foreach. I would set up the cluster as follow
cl <- makeCluster(number_of_cores, type='PSOCK')
registerDoParallel(cl)
with number_of_clusterset to various numbers between 2 and 10 (detectCore() tells me that I have 56 cores in total).
My problem is that even if only setting number_of_cluster to 2, I got a protection from stack overflowerror message. The thing is that I monitor the RAM usage while the script is running and not even 20 Gb of my 192 Gb RAM are being used.
If I run the script in a sequential way it takes its sweet time (~ 3 hours with 42 rows and ~ 80000 columns), but it does run until the end.
I have tried (almost) every trick in the book for good memory management in R:
I am loading and removing big variables as needed in order to reduce memory usage
I am breaking down the steps with functions rather than scripting them directly, to take advantage of scoping
I am calling gc()every time I delete a big object in order to prompt R to return memory to the operating system
But I am still unable to run the script in parallel.
Do someone have any suggestion about this ? Should I just give up and wait > 3 hours every time I run the analyses ? And more generally: how is it possible to have a stack overflow problem when having a lot of free RAM ?
UPDATE
I have now tried to "pseudo-parallelize" the work using the same machine: since I am running a 10-fold cross-validation scheme, I am opening 5 different instances of Rgui and running 2 folds in each instances. Proceeding in this way, everything run smoothly, and the process indeed take 10 times less than running it in a single instance of R. What makes me wonder is that if 10 instances of Rgui can run at the same time and get the job done, this means that the machine has the computational resources needed. Hence I can not really get my head around the fact that %dopar% with 10 clusters does not work.
The "protection stack overflow" means that you have run out of the "protection stack", that is too many pointers have been PROTECTed but not (yet) UNPROTECTed. This could be because of a bug or inefficiency in the code you are running (in native code of a package or in native code of R, but not a bug in R source code).
This problem has nothing to do with the amount of available memory on the heap, so calling gc() will have no impact, and it is not important how much physical memory the machine has. Please do not call gc() explicitly at all, even if there was a problem with the heap usage, it just makes the program run slower but does not help: if there is not enough heap space but it could be obtained by garbage collection, the garbage collector will run automatically. As the problem is the protection stack, neither restructuring the R code nor removing dead variables explicitly will help. In principle, structuring the code into (relatively small) functions is a good thing for maintainability/readability and it also indirectly reduces scope of variables, so removing variables explicitly should become unnecessary.
It might help to increase the pointer protection stack size, which can be done at R startup from the command line using --max-ppsize.

When (if ever) should I tell R parallel to not use all cores?

I've been using this code:
library(parallel)
cl <- makeCluster( detectCores() - 1)
clusterCall(cl, function(){library(imager)})
then I have a wrapper function looking something like this:
d <- matrix #Loading a batch of data into a matrix
res <- parApply(cl, d, 1, FUN, ...)
# Upload `res` somewhere
I tested on my notebook, with 8 cores (4 cores, hyperthreading). When I ran it on a 50,000 row, 800 column, matrix, it took 177.5s to complete, and for most of the time the 7 cores were kept at near 100% (according to top), then it sat there for the last 15 or so seconds, which I guess was combining results. According to system.time(), user time was 14s, so that matches.
Now I'm running on EC2, a 36-core c4.8xlarge, and I'm seeing it spending almost all of its time with just one core at 100%. More precisely: There is an approx 10-20 secs burst where all cores are being used, then about 90 secs of just one core at 100% (being used by R), then about 45 secs of other stuff (where I save results and load the next batch of data). I'm doing batches of 40,000 rows, 800 columns.
The long-term load average, according to top, is hovering around 5.00.
Does this seem reasonable? Or is there a point where R parallelism spends more time with communication overhead, and I should be limiting to e.g. 16 cores. Any rules of thumb here?
Ref: CPU spec I'm using "Linux 4.4.5-15.26.amzn1.x86_64 (amd64)". R version 3.2.2 (2015-08-14)
UPDATE: I tried with 16 cores. For the smallest data, run-time increased from 13.9s to 18.3s. For the medium-sized data:
With 16 cores:
user system elapsed
30.424 0.580 60.034
With 35 cores:
user system elapsed
30.220 0.604 54.395
I.e. the overhead part took the same amount of time, but the parallel bit had fewer cores so took longer, and so it took longer overall.
I also tried using mclapply(), as suggested in the comments. It did appear to be a bit quicker (something like 330s vs. 360s on the particular test data I tried it on), but that was on my notebook, where other processes, or over-heating, could affect the results. So, I'm not drawing any conclusions on that yet.
There are no useful rules of thumb — the number of cores that a parallel task is optimal for is entirely determined by said task. For a more general discussion see Gustafson’s law.
The high single-core portion that you’re seeing in your code probably comes from the end phase of the algorithm (the “join” phase), where the parallel results are collated into a single data structure. Since this far surpasses the parallel computation phase, this may indeed be an indication that fewer cores could be beneficial.
I'd add that in case you are not aware of this wonderful resource for parallel computing in R, you may find reading Norman Matloff's recent book Parallel Computing for Data Science: With Examples in R, C++ and CUDA a very helpful read. I'd highly recommend it (I learnt a lot, not coming from a CS background).
The book answers your question in depth (Chapter 2 specifically). The book gives a high level overview of the causes of overhead that lead to bottlenecks to parallel programs.
Quoting section 2.1, which implicitly partially answers your question:
There are two main performance issues in parallel programming:
Communications overhead: Typically data must be transferred back and
forth between processes. This takes time, which can take quite a toll
on performance. In addition, the processes can get in each other’s way
if they all try to access the same data at once. They can collide when
trying to access the same communications channel, the same memory
module, and so on. This is another sap on speed. The term granularity
is used to refer, roughly, to the ratio of computa- tion to overhead.
Large-grained or coarse-grained algorithms involve large enough chunks
of computation that the overhead isn’t much of a problem. In
fine-grained algorithms, we really need to avoid overhead as much as
possible.
^ When overhead is high, less cores for the problem at hand can give shorter total computation time.
Load balance: As noted in the last chapter, if we are not
careful in the way in which we assign work to processes, we risk
assigning much more work to some than to others. This compromises
performance, as it leaves some processes unproductive at the end of
the run, while there is still work to be done.
When if ever do not use all cores? One example from my personal experience in running daily cronjobs in R on data that amounts to 100-200GB data in RAM, in which multiple cores are run to crunch blocks of data, I've indeed found running with say 6 out of 32 available cores to be faster than using 20-30 of the cores. A major reason was memory requirements for children processes (After a certain number of children processes were in action, memory usage got high and things slowed down considerably).

Whether to use the detectCores function in R to specify the number of cores for parallel processing?

In the help for detectCores() it says:
This is not suitable for use directly for the mc.cores argument of
mclapply nor specifying the number of cores in makeCluster. First
because it may return NA, and second because it does not give the
number of allowed cores.
However, I've seen quite a bit of sample code like the following:
library(parallel)
k <- 1000
m <- lapply(1:7, function(X) matrix(rnorm(k^2), nrow=k))
cl <- makeCluster(detectCores() - 1, type = "FORK")
test <- parLapply(cl, m, solve)
stopCluster(cl)
where detectCores() is used to specify the number of cores in makeCluster.
My use cases involve running parallel processing both on my own multicore laptop (OSX) and running it on various multicore servers (Linux). So, I wasn't sure whether there is a better way to specify the number of cores or whether perhaps that advice about not using detectCores was more for package developers where code is meant to run over a wide range of hardware and OS environments.
So in summary:
Should you use the detectCores function in R to specify the number of cores for parallel processing?
What is the distinction mean between detected and allowed cores and when is it relevant?
I think it's perfectly reasonable to use detectCores as a starting point for the number of workers/processes when calling mclapply or makeCluster. However, there are many reasons that you may want or need to start fewer workers, and even some cases where you can reasonably start more.
On some hyperthreaded machines it may not be a good idea to set mc.cores=detectCores(), for example. Or if your script is running on an HPC cluster, you shouldn't use any more resources than the job scheduler has allocated to your job. You also have to be careful in nested parallel situations, as when your code may be executed in parallel by a calling function, or you're executing a multithreaded function in parallel. In general, it's a good idea to run some preliminary benchmarks before starting a long job to determine the best number of workers. I usually monitor the benchmark with top to see if the number of processes and threads makes sense, and to verify that the memory usage is reasonable.
The advice that you quoted is particularly appropriate for package developers. It's certainly a bad idea for a package developer to always start detectCores() workers when calling mclapply or makeCluster, so it's best to leave the decision up to the end user. At least the package should allow the user to specify the number of workers to start, but arguably detectCores() isn't even a good default value. That's why the default value for mc.cores changed from detectCores() to getOptions("mc.cores", 2L) when mclapply was included in the parallel package.
I think the real point of the warning that you quoted is that R functions should not assume that they own the whole machine, or that they are the only function in your script that is using multiple cores. If you call mclapply with mc.cores=detectCores() in a package that you submit to CRAN, I expect your package will be rejected until you change it. But if you're the end user, running a parallel script on your own machine, then it's up to you to decide how many cores the script is allowed to use.
Author of the parallelly package here: The parallelly::availableCores() function acknowledges various HPC environment variables (e.g. NSLOTS, PBS_NUM_PPN, and SLURM_CPUS_PER_TASK) and system and R settings that are used to specify the number of cores available to the process, and if not specified, it'll fall back to parallel::detectCores(). As I, or others, become aware of more settings, I'll be happy to add automatic support also for those; there is an always open GitHub issue for this over at https://github.com/HenrikBengtsson/parallelly/issues/17 (there are some open requests for help).
Also, if the sysadm sets environment variable R_PARALLELLY_AVAILABLECORES_FALLBACK=1 sitewide, then parallelly::availableCores() will return 1, unless explicitly overridden by other means (by the job scheduler, by the user settings, ...). This further protects against software tools taking over all cores by default.
In other words, if you use parallelly::availableCores() rather than parallel::detectCores() you can be fairly sure that your code plays nice in multi-tenant environments (if it turns out it's not enough, please let us know in the above GitHub issue) and that any end user can still control the number of cores without you having to change your code.
EDIT 2021-07-26: availableCores() was moved from future to parallelly in October 2020. For now and for backward compatible reasons, availableCores() function is re-exported by the 'future' package.
Better in my case (I use mac) is future::availableCores() because detectCores() shows 160 which is obviously wrong.

R package parallel stopCluster method seems to hang

I am currently working on a project for my company where I am trying to forecast demand for certain flows in parallel. For that, I'm using the following statements from the R parallel package:
cl = makeCluster(number_of_sessions)
parRapply(cl, range_list_small, context = context, fun = forecastDemand)
stopCluster(cl)
The context object in this case, is an environment which contains certain objects.
The problem is the following, I tried the script for a small sample of flows and it works perfectly. However when I runned the script for a big number of flows, it hangs for a long time (sometimes a few hours) on the stopCluster(cl) statement. I googled around, but it seems that nobody ever had the same problem before. Does somebody recognizes the problem? Or is there another way to close the cluster object. Because after this first parallel session my script has to do another parallel session for other calculations and this parallel session does not start until the stopCluster method has finished.

Running R jobs on a grid computing environment

I am running some large regression models in R in a grid computing environment. As far as I know, the grid just gives me more memory and faster processors, so I think this question would also apply for those who are using R on a powerful computer.
The regression models I am running have lots of observations, and several factor variables that have many (10s or 100s) of levels each. As a result, the regression can get computationally intensive. I have noticed that when I line up 3 regressions in a script and submit it to the grid, it exits (crashes) due to memory constraints. However, if I run it as 3 different scripts, it runs fine.
I'm doing some clean up, so after each model runs, I save the model object to a separate file, rm(list=ls()) to clear all memory, then run gc() before the next model is run. Still, running all three in one script seems to crash, but breaking up the job seems to be fine.
The sys admin says that breaking it up is important, but I don't see why, if I'm cleaning up after each run. 3 in one script runs them in sequence anyways. Does anyone have an idea why running three individual scripts works, but running all the models in one script would cause R to have memory issues?
thanks! EXL
Similar questions that are worth reading through:
Forcing garbage collection to run in R with the gc() command
Memory Usage in R
My experience has been that R isn't superb at memory management. You can try putting each regression in a function in the hope that letting variables go out of scope works better than gc(), but I wouldn't hold your breath. Is there a particular reason you can't run each in its own batch? More information as Joris requested would help as well.

Resources