Figuring out maximum memory requirement for R function - r

How can I find the maximum memory requirement for an R function? I am trying to improve the resource requirements for a function, but am running into difficulty figuring out the maximum memory footprint.
It appears that gc() reports the maximum memory used, but this maximum is subject to issues of when gc is run during the function. The best I have been able to do is set a maximum memory with ulimit -v prior to starting R and running the script, and then decreasing that limit until the script fails. This is a rather slow, iterative process.
Is there a way to figure out the resource requirements in a single R session?

Take a look at the documentation for profiling R code for memory use. The example in the help file for Rprofmem() is also helpful.
First call Rprofmem() with a file for the output and a lower limit for when to write the stack trace:
Rprofmem("Rprofmem.out", threshold = 1000)
Then run some code:
<your function>
Then turn off profiling and look at the file
Rprofmem(NULL)
noquote(readLines("Rprofmem.out", n = <some integer>))
The largest are at the top.

Related

Restricting loess' multicore usage in R

I'm trying to fit +- 70.000 values as a function of two variables using the loess() function several times. I want to use this fit to de-trend the data. My problem is that once I start the loess function, the R session takes up all available cores on the system, and that would be inconsiderate towards other users on the same computing cluster.
The relevant code would be analogous to the following:
# Approximation of the data
df <- data.frame(y = rpois(70000, rnorm(70000, 10, 2)), # y is count data
x = 50000 - rpois(70000, 100),
z = runif(70000))
# The problematic operation
fit <- loess(y ~ x + z, data = df)
When I run this example on my local machine, it only takes up 1 core, but on the cluster it takes as many cores as it could get (up to 48). Ideally, I would loess() to run on only 1 core.
I've tried to trace any multicore parameters in the code of loess, which I couldn't find. I know that loess calls stats:::simpleLoess, which in turn calls C code, which in turn calls Fortran code. I have no experience in C or Fortran and I haven't been able to figure out how I can restrict the CPU usage for this function.
Does anyone has any suggestion on how I can limit the CPU usage of the loess function?
I am not knowledgeable enough to comment on specifics about how all of this works, but I know that C++ and FORTRAN for R are usually built using the OpenMP framework for multi-thread programming. Empirically, I do know that your issue can be resolved if you set the OMP_NUM_THREADS argument before you launch R or if you set it within an R session.
Let's say you wanted to use 2 threads for the loess function. Before you launch R, you would do this ($ to signify typing this in a shell session):
$ OMP_NUM_THREADS=2 R [whatever other options you use to launch R]
Here's how to do it from within R (> to indicate an interactive R session):
> Sys.setenv("OMP_NUM_THREADS" = 2)
If you ever need to check the variable from within R, you can do the following (this will return a character vector with the number):
> Sys.getenv("OMP_NUM_THREADS")
# The result in our example will be "2"
For completeness, be sure to use ?Sys.setenv or ?Sys.getenv if you wish to get more information about those functions, and check out this site for details about OMP_NUM_THREADS.
Hope that helps!
So McG led me down a path that eventually gave me the ability to control the number of cores, which I'll post as another answer.
There were a few details I foolishly neglected to mention, namely that I was working on an RStudio server. For all other purposes, I indeed think that McG's answer would be excellent.
That answer helped me get the correct terms to google, and strolling around the search results I stumbled upon this thread that suggested that the RhpcBLASctl package has a function to set the number of cores as follows:
blas_set_num_threads(2)
Setting this in an RMarkdown document before running loess kept my CPU usage at 200% while running the loess function afterwards that was problematic before.

protection from stack overflow in R with a lot of free RAM

I apologize in advance since this post will not have any reproducible example.
I am using R x64 3.4.2 to run some cross-validated analyses on quite big matrices (number of columns ~ 80000, number of rows between 40 and 180). The analyses involve several features selection steps (performed with in-house functions or with functions from the CORElearnpackage, which is written in C++), as well as some clustering of the features and the fitting of a SVM model (by means of the package RWeka, that is written in Java).
I am working on a DELL Precision T7910 machine, with 2 processors Intel Xeon E5-2695 v3 2.30 GHz, 192 Gb RAM and Windows 7 x64 operating system.
To speed up the running time of my analysis I thought to use the doParallel package in combination with foreach. I would set up the cluster as follow
cl <- makeCluster(number_of_cores, type='PSOCK')
registerDoParallel(cl)
with number_of_clusterset to various numbers between 2 and 10 (detectCore() tells me that I have 56 cores in total).
My problem is that even if only setting number_of_cluster to 2, I got a protection from stack overflowerror message. The thing is that I monitor the RAM usage while the script is running and not even 20 Gb of my 192 Gb RAM are being used.
If I run the script in a sequential way it takes its sweet time (~ 3 hours with 42 rows and ~ 80000 columns), but it does run until the end.
I have tried (almost) every trick in the book for good memory management in R:
I am loading and removing big variables as needed in order to reduce memory usage
I am breaking down the steps with functions rather than scripting them directly, to take advantage of scoping
I am calling gc()every time I delete a big object in order to prompt R to return memory to the operating system
But I am still unable to run the script in parallel.
Do someone have any suggestion about this ? Should I just give up and wait > 3 hours every time I run the analyses ? And more generally: how is it possible to have a stack overflow problem when having a lot of free RAM ?
UPDATE
I have now tried to "pseudo-parallelize" the work using the same machine: since I am running a 10-fold cross-validation scheme, I am opening 5 different instances of Rgui and running 2 folds in each instances. Proceeding in this way, everything run smoothly, and the process indeed take 10 times less than running it in a single instance of R. What makes me wonder is that if 10 instances of Rgui can run at the same time and get the job done, this means that the machine has the computational resources needed. Hence I can not really get my head around the fact that %dopar% with 10 clusters does not work.
The "protection stack overflow" means that you have run out of the "protection stack", that is too many pointers have been PROTECTed but not (yet) UNPROTECTed. This could be because of a bug or inefficiency in the code you are running (in native code of a package or in native code of R, but not a bug in R source code).
This problem has nothing to do with the amount of available memory on the heap, so calling gc() will have no impact, and it is not important how much physical memory the machine has. Please do not call gc() explicitly at all, even if there was a problem with the heap usage, it just makes the program run slower but does not help: if there is not enough heap space but it could be obtained by garbage collection, the garbage collector will run automatically. As the problem is the protection stack, neither restructuring the R code nor removing dead variables explicitly will help. In principle, structuring the code into (relatively small) functions is a good thing for maintainability/readability and it also indirectly reduces scope of variables, so removing variables explicitly should become unnecessary.
It might help to increase the pointer protection stack size, which can be done at R startup from the command line using --max-ppsize.

What are the risks of "Reached total allocation of 31249Mb: see help(memory.size)"

A few times when dealing with modifying large objects (5gb), on a windows machine with 30gb of RAM, I have been reciving an error
Reached total allocation of 31249Mb: see help(memory.size). However the process seems to complete, i.e. I get a file with what looks like the right values. Checking every bit of a large file for exactly the right returns by cutting it up and comparing it to the right section is time consuming, but when I've done it it appears that the returned objects are correct with my expectations.
What risks/side effects can I expect from this error? What should I be checking? Is the process automatically recovering because I'm getting back the returns I'm expecting, or are the errors going to be more subtle? My entire analysis process is being written using tidyverse, does this mean I can rely on good error handling from Hadley et al., and is that why my process is warning, but also completing?
N.B. I have not included any attempt at an MWE, as every machine will have different limitations of what memory is available, though happy to be shown and MWE for this kind of process if there are suggestions.
Use memory.limit(x) where x is the amount of MB of memory to give it.
See link for more details:
Increasing (or decreasing) the memory available to R processes

R code failed with: "Error: cannot allocate buffer"

Compiling an RMarkdown script overnight failed with the message:
Error: cannot allocate buffer
Execution halted
The code chunk that it died on was while training a caretEnsemble list of 10 machine learning algorithms. I know it takes a fair bit of RAM and computing time, but I did previously succeed to run that same code in the console. Why did it fail in RMarkdown? I'm fairly sure that even if it ran out of free RAM, there was enough swap.
I'm running Ubuntu with 3GB RAM and 4GB swap.
I found a blog article about memory limits in R, but it only applies to Windows: http://www.r-bloggers.com/memory-limit-management-in-r/
Any ideas on solving/avoiding this problem?
One reason why it may be backing up is that knitr and Rmarkdown just add a layer of computing complexity to things and they take some memory. The console is the most streamline implementation.
Also Caret is fat, slow and unapologetic about it. If the machine learning algorithm is complex, the data set is large and you have limited RAM it can become problematic.
Some things you can do to reduce the burden:
If there are unused variables in the set, use a subset of the ones you want and then clear the old set from memory using rm() with your variable name for the data frame in the parentheses.
After removing variables, run garbage collect, it reclaims the memory space your removed variables and interim sets are taking up in memory.
R has no native means of memory purging, so if a function is not written with a garbage collect and you do not do it, all your past executed refuse is persisting in memory making life hard.
To do this just type gc() with nothing in the parentheses. Also clear out the memory with gc() between the 10 ML runs. And if you import data with XLConnect the java implementation is nasty inefficient...that alone could tap your memory, gc() after using it every time.
After setting up training, testing and validation sets, save the testing and validation files in csv format on the hard drive and REMOVE THEM from your memory and run,you guessed it gc(). Load them again when you need them after the first model.
Once you have decided which of the algorithms to run, try installing their original packages separately instead of running Caret, require() each by name as you get to it and clean up after each one with detach(package:packagenamehere) gc().
There are two reasons for this.
One, Caret is a collection of other ML algorithms, and it is inherently slower than ALL of them in their native environment. An example: I was running a data set through random forest in Caret after 30 minutes I was less than 20% done. It had crashed twice already at about the one hour mark. I loaded the original independent package and in about 4 minutes had a completed analysis.
Two, if you require, detach and garbage collect, you have less in resident memory to worry about bogging you down. Otherwise you have ALL of carets functions in memory at once...that is wasteful.
There are some general things that you can do to make it go better that you might not initially think of but could be useful. Depending on your code they may or may not work or work to varying degrees, but try them and see where it gets you.
I. Use the lexical scoping to your advantage. Run the whole script in a clean Rstudio environment and make sure that all of the pieces and parts are living in your work space. Then garbage collect the remnants. Then go to knitr & rMarkdown and call pieces and parts from your existing work space. It is available to you in Markdown under the same rStudio shell so as long as nothing was created inside a loop and without saving it to to global environment.
II. In markdown set your code chunks up so that you cache the stuff that would need to be calculated multiple times so that it lives somewhere ready to be called upon instead of taxing memory multiple times.
If you call a variable from a data frame, do something as simple as multiply against it to each observation in one column and save it back into that original same frame, you could end up with as many as 3 copies in memory. If the file is large that is a killer. So make a clean copy, garbage collect and cache the pure frame.
Caching intuitively seems like it would waste memory, and done wrong it will, but if you rm() the unnecessary from the environment and gc() regularly, you will probably benefit from tactical caching
III. If things are still getting bogged down, you can try to save results in csv files send them to the hard drive and call them back up as needed to move them out of memory if you do not need all of the data at one time.
I am pretty certain that you can set the program up to load and unload libraries, data and results as needed. But honestly the best thing you can do, based on my own biased experience, is move away from Caret on big multi- algorithm processes.
I was getting this error when I was inadvertently running the 32-bit version of R on my 64-bit machine.

R memory limit warning vs "unable to allocate..."

Does a memory warning affect my R analysis?
When running a large data analysis script in R I get a warning something like:
In '... '
reached total allocation of ___Mb: see help...
But my script continues without error, just the warning. With other data sets I get an error something like:
Error: cannot allocate vector of size ___Mb:
I know the error breaks my data analysis, but is there anything wrong with just getting the warning? I have not noticed anything missing in my data set but it is very large and I have no good means to check everything. I am at 18000Mb allocated to memory and cannot reasonably allocate more.
Way back in the R 2.5.1 news I found this reference to memory allocation warnings:
malloc.c has been updated to version 2.8.3. This version has a
slightly different allocation strategy, and is likely to work a
little better close to address space limits but may give more
warnings about reaching the total allocation before successfully
allocating.
Based on this note, I hypothesize (without any advanced knowledge of the inner implementation) that the warning is given when the memory allocation call in R (malloc.c) failed an attempt to allocate memory. Multiple attempts are made to allocate memory, possibly using different methods, and possibly with calls to the garbage collector. Only when malloc is fairly certain that the allocation cannot be made will it return an error.
Warnings do not compromise existing R objects. They just inform the user that R is nearing the limits of computer memory.
(I hope a more knowledgeable user can confirm this...)

Resources