protection from stack overflow in R with a lot of free RAM - r

I apologize in advance since this post will not have any reproducible example.
I am using R x64 3.4.2 to run some cross-validated analyses on quite big matrices (number of columns ~ 80000, number of rows between 40 and 180). The analyses involve several features selection steps (performed with in-house functions or with functions from the CORElearnpackage, which is written in C++), as well as some clustering of the features and the fitting of a SVM model (by means of the package RWeka, that is written in Java).
I am working on a DELL Precision T7910 machine, with 2 processors Intel Xeon E5-2695 v3 2.30 GHz, 192 Gb RAM and Windows 7 x64 operating system.
To speed up the running time of my analysis I thought to use the doParallel package in combination with foreach. I would set up the cluster as follow
cl <- makeCluster(number_of_cores, type='PSOCK')
registerDoParallel(cl)
with number_of_clusterset to various numbers between 2 and 10 (detectCore() tells me that I have 56 cores in total).
My problem is that even if only setting number_of_cluster to 2, I got a protection from stack overflowerror message. The thing is that I monitor the RAM usage while the script is running and not even 20 Gb of my 192 Gb RAM are being used.
If I run the script in a sequential way it takes its sweet time (~ 3 hours with 42 rows and ~ 80000 columns), but it does run until the end.
I have tried (almost) every trick in the book for good memory management in R:
I am loading and removing big variables as needed in order to reduce memory usage
I am breaking down the steps with functions rather than scripting them directly, to take advantage of scoping
I am calling gc()every time I delete a big object in order to prompt R to return memory to the operating system
But I am still unable to run the script in parallel.
Do someone have any suggestion about this ? Should I just give up and wait > 3 hours every time I run the analyses ? And more generally: how is it possible to have a stack overflow problem when having a lot of free RAM ?
UPDATE
I have now tried to "pseudo-parallelize" the work using the same machine: since I am running a 10-fold cross-validation scheme, I am opening 5 different instances of Rgui and running 2 folds in each instances. Proceeding in this way, everything run smoothly, and the process indeed take 10 times less than running it in a single instance of R. What makes me wonder is that if 10 instances of Rgui can run at the same time and get the job done, this means that the machine has the computational resources needed. Hence I can not really get my head around the fact that %dopar% with 10 clusters does not work.

The "protection stack overflow" means that you have run out of the "protection stack", that is too many pointers have been PROTECTed but not (yet) UNPROTECTed. This could be because of a bug or inefficiency in the code you are running (in native code of a package or in native code of R, but not a bug in R source code).
This problem has nothing to do with the amount of available memory on the heap, so calling gc() will have no impact, and it is not important how much physical memory the machine has. Please do not call gc() explicitly at all, even if there was a problem with the heap usage, it just makes the program run slower but does not help: if there is not enough heap space but it could be obtained by garbage collection, the garbage collector will run automatically. As the problem is the protection stack, neither restructuring the R code nor removing dead variables explicitly will help. In principle, structuring the code into (relatively small) functions is a good thing for maintainability/readability and it also indirectly reduces scope of variables, so removing variables explicitly should become unnecessary.
It might help to increase the pointer protection stack size, which can be done at R startup from the command line using --max-ppsize.

Related

When (if ever) should I tell R parallel to not use all cores?

I've been using this code:
library(parallel)
cl <- makeCluster( detectCores() - 1)
clusterCall(cl, function(){library(imager)})
then I have a wrapper function looking something like this:
d <- matrix #Loading a batch of data into a matrix
res <- parApply(cl, d, 1, FUN, ...)
# Upload `res` somewhere
I tested on my notebook, with 8 cores (4 cores, hyperthreading). When I ran it on a 50,000 row, 800 column, matrix, it took 177.5s to complete, and for most of the time the 7 cores were kept at near 100% (according to top), then it sat there for the last 15 or so seconds, which I guess was combining results. According to system.time(), user time was 14s, so that matches.
Now I'm running on EC2, a 36-core c4.8xlarge, and I'm seeing it spending almost all of its time with just one core at 100%. More precisely: There is an approx 10-20 secs burst where all cores are being used, then about 90 secs of just one core at 100% (being used by R), then about 45 secs of other stuff (where I save results and load the next batch of data). I'm doing batches of 40,000 rows, 800 columns.
The long-term load average, according to top, is hovering around 5.00.
Does this seem reasonable? Or is there a point where R parallelism spends more time with communication overhead, and I should be limiting to e.g. 16 cores. Any rules of thumb here?
Ref: CPU spec I'm using "Linux 4.4.5-15.26.amzn1.x86_64 (amd64)". R version 3.2.2 (2015-08-14)
UPDATE: I tried with 16 cores. For the smallest data, run-time increased from 13.9s to 18.3s. For the medium-sized data:
With 16 cores:
user system elapsed
30.424 0.580 60.034
With 35 cores:
user system elapsed
30.220 0.604 54.395
I.e. the overhead part took the same amount of time, but the parallel bit had fewer cores so took longer, and so it took longer overall.
I also tried using mclapply(), as suggested in the comments. It did appear to be a bit quicker (something like 330s vs. 360s on the particular test data I tried it on), but that was on my notebook, where other processes, or over-heating, could affect the results. So, I'm not drawing any conclusions on that yet.
There are no useful rules of thumb — the number of cores that a parallel task is optimal for is entirely determined by said task. For a more general discussion see Gustafson’s law.
The high single-core portion that you’re seeing in your code probably comes from the end phase of the algorithm (the “join” phase), where the parallel results are collated into a single data structure. Since this far surpasses the parallel computation phase, this may indeed be an indication that fewer cores could be beneficial.
I'd add that in case you are not aware of this wonderful resource for parallel computing in R, you may find reading Norman Matloff's recent book Parallel Computing for Data Science: With Examples in R, C++ and CUDA a very helpful read. I'd highly recommend it (I learnt a lot, not coming from a CS background).
The book answers your question in depth (Chapter 2 specifically). The book gives a high level overview of the causes of overhead that lead to bottlenecks to parallel programs.
Quoting section 2.1, which implicitly partially answers your question:
There are two main performance issues in parallel programming:
Communications overhead: Typically data must be transferred back and
forth between processes. This takes time, which can take quite a toll
on performance. In addition, the processes can get in each other’s way
if they all try to access the same data at once. They can collide when
trying to access the same communications channel, the same memory
module, and so on. This is another sap on speed. The term granularity
is used to refer, roughly, to the ratio of computa- tion to overhead.
Large-grained or coarse-grained algorithms involve large enough chunks
of computation that the overhead isn’t much of a problem. In
fine-grained algorithms, we really need to avoid overhead as much as
possible.
^ When overhead is high, less cores for the problem at hand can give shorter total computation time.
Load balance: As noted in the last chapter, if we are not
careful in the way in which we assign work to processes, we risk
assigning much more work to some than to others. This compromises
performance, as it leaves some processes unproductive at the end of
the run, while there is still work to be done.
When if ever do not use all cores? One example from my personal experience in running daily cronjobs in R on data that amounts to 100-200GB data in RAM, in which multiple cores are run to crunch blocks of data, I've indeed found running with say 6 out of 32 available cores to be faster than using 20-30 of the cores. A major reason was memory requirements for children processes (After a certain number of children processes were in action, memory usage got high and things slowed down considerably).

R code failed with: "Error: cannot allocate buffer"

Compiling an RMarkdown script overnight failed with the message:
Error: cannot allocate buffer
Execution halted
The code chunk that it died on was while training a caretEnsemble list of 10 machine learning algorithms. I know it takes a fair bit of RAM and computing time, but I did previously succeed to run that same code in the console. Why did it fail in RMarkdown? I'm fairly sure that even if it ran out of free RAM, there was enough swap.
I'm running Ubuntu with 3GB RAM and 4GB swap.
I found a blog article about memory limits in R, but it only applies to Windows: http://www.r-bloggers.com/memory-limit-management-in-r/
Any ideas on solving/avoiding this problem?
One reason why it may be backing up is that knitr and Rmarkdown just add a layer of computing complexity to things and they take some memory. The console is the most streamline implementation.
Also Caret is fat, slow and unapologetic about it. If the machine learning algorithm is complex, the data set is large and you have limited RAM it can become problematic.
Some things you can do to reduce the burden:
If there are unused variables in the set, use a subset of the ones you want and then clear the old set from memory using rm() with your variable name for the data frame in the parentheses.
After removing variables, run garbage collect, it reclaims the memory space your removed variables and interim sets are taking up in memory.
R has no native means of memory purging, so if a function is not written with a garbage collect and you do not do it, all your past executed refuse is persisting in memory making life hard.
To do this just type gc() with nothing in the parentheses. Also clear out the memory with gc() between the 10 ML runs. And if you import data with XLConnect the java implementation is nasty inefficient...that alone could tap your memory, gc() after using it every time.
After setting up training, testing and validation sets, save the testing and validation files in csv format on the hard drive and REMOVE THEM from your memory and run,you guessed it gc(). Load them again when you need them after the first model.
Once you have decided which of the algorithms to run, try installing their original packages separately instead of running Caret, require() each by name as you get to it and clean up after each one with detach(package:packagenamehere) gc().
There are two reasons for this.
One, Caret is a collection of other ML algorithms, and it is inherently slower than ALL of them in their native environment. An example: I was running a data set through random forest in Caret after 30 minutes I was less than 20% done. It had crashed twice already at about the one hour mark. I loaded the original independent package and in about 4 minutes had a completed analysis.
Two, if you require, detach and garbage collect, you have less in resident memory to worry about bogging you down. Otherwise you have ALL of carets functions in memory at once...that is wasteful.
There are some general things that you can do to make it go better that you might not initially think of but could be useful. Depending on your code they may or may not work or work to varying degrees, but try them and see where it gets you.
I. Use the lexical scoping to your advantage. Run the whole script in a clean Rstudio environment and make sure that all of the pieces and parts are living in your work space. Then garbage collect the remnants. Then go to knitr & rMarkdown and call pieces and parts from your existing work space. It is available to you in Markdown under the same rStudio shell so as long as nothing was created inside a loop and without saving it to to global environment.
II. In markdown set your code chunks up so that you cache the stuff that would need to be calculated multiple times so that it lives somewhere ready to be called upon instead of taxing memory multiple times.
If you call a variable from a data frame, do something as simple as multiply against it to each observation in one column and save it back into that original same frame, you could end up with as many as 3 copies in memory. If the file is large that is a killer. So make a clean copy, garbage collect and cache the pure frame.
Caching intuitively seems like it would waste memory, and done wrong it will, but if you rm() the unnecessary from the environment and gc() regularly, you will probably benefit from tactical caching
III. If things are still getting bogged down, you can try to save results in csv files send them to the hard drive and call them back up as needed to move them out of memory if you do not need all of the data at one time.
I am pretty certain that you can set the program up to load and unload libraries, data and results as needed. But honestly the best thing you can do, based on my own biased experience, is move away from Caret on big multi- algorithm processes.
I was getting this error when I was inadvertently running the 32-bit version of R on my 64-bit machine.

mclapply cores spending lots of time in uninterruptable sleep

This is a somewhat generic question for which I apologize, but I can't generate a code example that reproduces the behavior. My question is this: I'm scoring a largish data set (~11 million rows with 274 dimensions) by subdividing the data set into a list of data frames and then running a scoring function on 16 cores of a 24 core Linux server using mclapply. Each data frame on the list is allocated to a spawned instance and scored, returning a list of data frames of predictions. While the mclapply is running the various R instances are spending a lot of time in uninterruptable sleep, more than they're spending running. Has anyone else experienced this using mclapply? I'm a Linux neophyte, from an OS perspective does this make any sense? Thanks.
You need to be careful when using mclapply to operate on large data sets. It's easy to create too many workers for the amount of memory on your computer and the amount of memory used by your computation. It's hard to predict the memory requirements due to the complexity of R's memory management, so it's best to monitor memory usage carefully using a tool such as "top" or "htop".
You may be able to decrease the memory usage by splitting your work into more but smaller tasks since that may reduce the memory needed by the computation. I don't think that the choice of prescheduling affects the memory usage much, since mclapply will never fork more than mc.cores workers at a time, regardless of the value of mc.prescheduling.

Does multicore computing using R's doParallel package use more memory?

I just tested an elastic net with and without a parallel backend. The call is:
enetGrid <- data.frame(.lambda=0,.fraction=c(.005))
ctrl <- trainControl( method="repeatedcv", repeats=5 )
enetTune <- train( x, y, method="enet", tuneGrid=enetGrid, trControl=ctrl, preProc=NULL )
I ran it without a parallel backend registered (and got the warning message from %dopar% when the train call was finished), and then again with one registered for 7 cores (of 8). The first run took 529 seconds, the second, 313. But the first took 3.3GB memory max (reported by the Sun cluster system), and the second took 22.9GB. I've got 30GB of ram, and the task only gets more complicated from here.
Questions:
1) Is this a general property of parallel computation? I thought they shared memory....
2) Is there a way around this while still using enet inside train? If doParallel is the problem, are there other architectures that I could use with %dopar%--no, right?
Because I am interested in whether this is the expected result, this is closely related but not the exact same as this question, but I'd be fine closing this and merging my question in to that one (or marking that as duplicate and pointing to this one, since this has more detail) if that's what the concensus is:
Extremely high memory consumption of new doParallel package
In multithreaded programs, threads share lots of memory. It's primarily the stack that isn't shared between threads. But, to quote Dirk Eddelbuettel, "R is, and will remain, single-threaded", so R parallel packages use processes rather than threads, and so there is much less opportunity to share memory.
However, memory is shared between the processes that are forked by mclapply (as long as the processes don't modify it, which triggers a copy of the memory region in the operating system). That is one reason that the memory footprint can be smaller when using the "multicore" API versus the "snow" API with parallel/doParallel.
In other words, using:
registerDoParallel(7)
may be much more memory efficient than using:
cl <- makeCluster(7)
registerDoParallel(cl)
since the former will cause %dopar% to use mclapply on Linux and Mac OS X, while the latter uses clusterApplyLB.
However, the "snow" API allows you to use multiple machines, and that means that your memory size increases with the number of CPUs. This is a great advantage since it can allow programs to scale. Some programs even get super-linear speedup when running in parallel on a cluster since they have access to more memory.
So to answer your second question, I'd say to use the "multicore" API with doParallel if you only have a single machine and are using Linux or Mac OS X, but use the "snow" API with multiple machines if you're using a cluster. I don't think there is any way to use shared memory packages such as Rdsm with the caret package.
There is a minimum number of characters elsewise I would simply have typed: 1) Yes. 2) No, er, maybe. There are packages that use a "shared memory" model for parallel computation, but R's more thoroughly tested packages don't use it.
http://www.stat.berkeley.edu/scf/paciorek-parallelWorkshop.pdf
http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf
http://heather.cs.ucdavis.edu/Rdsm/BARUGSlides.pdf

Running R jobs on a grid computing environment

I am running some large regression models in R in a grid computing environment. As far as I know, the grid just gives me more memory and faster processors, so I think this question would also apply for those who are using R on a powerful computer.
The regression models I am running have lots of observations, and several factor variables that have many (10s or 100s) of levels each. As a result, the regression can get computationally intensive. I have noticed that when I line up 3 regressions in a script and submit it to the grid, it exits (crashes) due to memory constraints. However, if I run it as 3 different scripts, it runs fine.
I'm doing some clean up, so after each model runs, I save the model object to a separate file, rm(list=ls()) to clear all memory, then run gc() before the next model is run. Still, running all three in one script seems to crash, but breaking up the job seems to be fine.
The sys admin says that breaking it up is important, but I don't see why, if I'm cleaning up after each run. 3 in one script runs them in sequence anyways. Does anyone have an idea why running three individual scripts works, but running all the models in one script would cause R to have memory issues?
thanks! EXL
Similar questions that are worth reading through:
Forcing garbage collection to run in R with the gc() command
Memory Usage in R
My experience has been that R isn't superb at memory management. You can try putting each regression in a function in the hope that letting variables go out of scope works better than gc(), but I wouldn't hold your breath. Is there a particular reason you can't run each in its own batch? More information as Joris requested would help as well.

Resources