R memory issue with memory.limit() - r

I am running some simulations on a machine with 16GB memory. First, I met some errors:
Error: cannot allocate vector of size 6000.1 Mb (the number might be not accurate)
Then I tried to allocate more memory to R by using:
memory.limit(1E10)
The reason of choosing such a big number is because memory.limit could not allow me of selecting a number less than my system total memory
In memory.size(size) : cannot decrease memory limit: ignored
After doing this, I can finish my simulations, but R took around 15GB memory, which stopped my from doing any post analysis.
I used object.size() to estimate the total memory used of all the generated variable, which only took around 10GB. I could not figure where R took the rest of the memory. So my question is how do I reasonably allocate memory to R without exploding my machine?
Thanks!

R is interpreted so WYSINAWYG (what you see is not always what you get). As is mentioned in the comments you need more memory that is required by the storage of your objects due to copying of said objects. Also, it is possible that besides being inefficient, nested for loops are a bad idea because gc won't run in the innermost loop. If you have any of these I suggest you try to remove them using vectorised methods, or you manually call gc in your loops to force garbage collections, but be warned this will slow things down somewhat
The issue of memory required for simple objects can be illustrated by the following example. This code grows a data.frame object. Watch the memory use before, after and the resulting object size. There is a lot of garbage that is allowed to accumulate before gc is invoked. I think garbage collection is problematic on Windows than *nix systems. I am not able to replicate the example at the bottom on Mac OS X, but I can repeatedly on Windows. The loop and more explanations can be found in The R Inferno page 13...
# Current memory usage in Mb
memory.size()
# [1] 130.61
n = 1000
# Run loop overwriting current objects
my.df <- data.frame(a=character(0), b=numeric(0))
for(i in 1:n) {
this.N <- rpois(1, 10)
my.df <- rbind(my.df, data.frame(a=sample(letters,
this.N, replace=TRUE), b=runif(this.N)))
}
# Current memory usage afterwards (in Mb)
memory.size()
# [1] 136.34
# BUT... Size of my.df
print( object.size( my.df ) , units = "Mb" )
0.1 Mb

Related

R's gc() on parallel runs seems to dramatically under-report peak memory

In R I have a task that I'm trying to parallelize. Part of this is comparing run-times and peak memory usage for different implementations of the task at hand. I'm using the peakRAM library to determine peak memory, which I think just uses gc() under the surface, since if I do it manually I get the same peak memory results.
The problem is that the results from peakRAM are different from the computer's task manager (or top on Linux). If I run a single-core, these numbers are in the same ballpark, but even using 2 cores, they are really different.
I'm parallelizing using pblapply in a manner similar to this.
times_parallel = peakRAM(
pblapply(X = 1:10,
FUN = \(x) data[iteration==x] %>% parallel_task(),
cl = makeCluster(numcores, type = "FORK"))
)
With a single core, this process requires a peak of 30G of memory. But with 2 cores, peakRAM reports only about 3G of memory. Looking at top however, shows that each of the 2 threads is using around 20-30G of memory at a time.
The only thing I can think of is that peakRAM is only reporting the memory of the main thread but I see nothing in the gc() details that suggests this is happening.
The time reported from peakRAM seems appropriate. Sub-linear gains at different core levels.

Memory profiling in R: how to find the place of maximum memory usage?

My code eats up to 3GB of memory at a single time. I figured it out using gc():
gc1 <- gc(reset = TRUE)
graf(...) # the code
gc2 <- gc()
cat(sprintf("mem: %.1fMb.\n", sum(gc2[,6] - gc1[,2])))
# mem: 3151.7Mb.
Which I guess means that there is one single time, when 3151.7 MB are allocated at once.
My goal is to minimize the maximum memory allocated at any single time. How do I figure out which part of my code is reposponsible for the maximum usage of those 3GB of memory? I.e. the place where those 3GB are allocated at once.
I tried memory profiling with Rprof and profvis, but both seem to show different information (which seems undocumented, see my other question). Maybe I need to use them with different parameters (or use different tool?).
I've been looking at Rprofmem... but:
in the profmem vignette they wrote: "with utils::Rprofmem() it is not possible to quantify the total memory usage at a given time because it only logs allocations and does therefore not reflect deallocations done by the garbage collector."
how to output the result of Rprofmem? This source speaks for itself: "Summary functions for this output are still being designed".
My code eats up to 3GB of memory at a single time.
While it looks like your code is consuming a lot of RAM at once by calling one function you can break down the memory consumption by looking into the implementation details of the function (and its sub calls) by using RStudio's built-in profiling (based on profvis) to see the execution time and rough memory consumption. Eg. if I use my demo code:
# graf code taken from the tutorial at
# https://rawgit.com/goldingn/intecol2013/master/tutorial/graf_workshop.html
library(dismo) # install.packages("dismo")
library(GRaF) # install_github('goldingn/GRaF')
data(Anguilla_train)
# loop to call the code under test several times to get better profiling results
for (i in 1:5) {
# keep the first n records of SegSumT, SegTSeas and Method as covariates
covs <- Anguilla_train[, c("SegSumT", "SegTSeas", "Method")]
# use the presence/absence status to fit a simple model
m1 <- graf(Anguilla_train$Angaus, covs)
}
Start profiling with the Profile > Start Profiling menu item, source the above code and stop the profiling via the above menu.
After Profile > Stop Profiling RStudio is showing the result as Flame Graph but what you are looking for is hidden in the Data tab of the profile result (I have unfolded all function calls which show heavy memory consumption):
The numbers in the memory column indicate the memory allocated (positive) and deallocated (negative numbers) for each called function and the values should include the sum of the whole sub call tree + the memory directly used in the function.
My goal is to minimize the maximum memory allocated at any single time.
Why do you want to do that? Do you run out-of-memory or do you suspect that repeated memory allocation is causing long execution times?
High memory consumption (or repeated allocations/deallocations) often come together with a slow execution performance since copying memory costs time.
So look at the Memory or Time column depending on your optimization goals to find function calls with high values.
If you look into the source code of the GRaF package you can find a loop in the graf.fit.laplace function (up to 50 "newton iterations") that calls "slow" R-internal functions like chol, backsolve, forwardsolve but also slow functions implemented in the package itself (like cov.SE.d1).
Now you can try to find faster (or less memory consuming) replacements for these functions... (sorry, I can't help here).
PS: profvis uses Rprof internally so the profiling data is collected by probing the current memory consumption in regular time intervals and counting it for the currently active function (call stack).
Rprof has limitations (mainly not an exact profiling result since the garbage collector triggers at non-deterministic times and the freed memory is attributed to the function the next probing interval break stops at and it does not recognize memory allocated directly from the OS via C/C++ code/libraries that bypasses R's memory management API).
Still it is the easiest and normally good enough indication of memory and performance problems...
For an introduction into profvis see: For https://rstudio.github.io/profvis/

Error: cannot allocate vector of size X Mb in R

I have question regarding memory usage in R. I am running a Rcode on our entire database in R in a for loop. However, the code stops at some point saying that it cannot allocate a vector of size 325.7 mb. When I was looking at the task manager I saw that it was using 28gb of RAM on our server.
I am familiar with the gc() function in R but this does not seems to work. E.g. the code stopped working on the 15th iteration, saying that it cannot allocate the vector. However, if I only run the 15th iteration (and nothing else) there is no problem at all. Moreover, for each new iteration I delete my DT which is by far the largest object in my environment.
Code sample:
DT <- data.table()
items <- as.character(seq(1:10))
for (i in items){
DT <- sample(x = 5000,replace = T)
write.csv(DT,paste0(i,".csv"))
gc()
rm(DT)
}
I have the feeling that this gc function does not work properly in a for loop. Is that correct or are there any other possible issues, i.e. are there reasons why my memory is full after a few iterations?
View the memory limit using the command memory.limit() and then expand it using memory.limit(size=XXX)
Note this is just a temporary approach and I think that this urlR memory management / cannot allocate vector of size n Mb gives a much better explanation on how to tackle these.

R - Memory allocation besides objects in ls()

I have loaded a fairly large set of data using data.table. I then want to add around 30 columns using instructions of the form:
DT[, x5:=cumsum(y1), by=list(x1, x2)]
DT[, x6:=cummean(y2), by=x1]
At some point I start to get "warnings" like this:
1: In structure(.Call(C_objectSize, x), class = "object_size") :
Reached total allocation of 8072Mb: see help(memory.size)
I check the tracemem(DT) every now and then to assure that no copies are made. The only output I ever get is:
"<0000000005E8E700>"
Also I check ls() to see which objects are in use and object.size() to see how much of my RAM is allocated by the object. The only output of ls() is my data.table and the object size after the first error is 5303.1 Mb.
I am on a Windows 64-bit machine running R in 64-bit and have 8 GB RAM. Of these 8 GB RAM only 80% are in use when I get the warning. Of these R is using 5214.0 Mb (strange since the table is bigger than this).
My question is, if the only RAM R is using is 5303.1 Mb and I still have around 2 Gb of free memory why do I get the error that R has reached the limit of 8 Gb and is there anything I can do against it? If not, what are other options? I know I could use Bigmemory but then I would have to rewrite my whole code and would loose the sweet by-reference modifications which data.table offers.
The problem is that the operations require RAM beyond what the object itself takes up. You could verify that windows is using a page file. If it is you could try increasing its size. http://windows.microsoft.com/en-us/windows/change-virtual-memory-size
If that fails you could try to run a live environment of Lubuntu linux to see if its memory overhead is small enough to allow the operation. http://lubuntu.net/
Ultimately, I suspect you're going to have to use bigmemory or similar.

tracking memory usage and garbage collection in R

I am running functions which are deeply nested and consume quite a bit of memory as reported by the Windows task manager. The output variables are relatively small (1-2 orders of magnitude smaller than the amount of memory consumed), so I am assuming that the difference can be attributed to intermediate variables assigned somewhere in the function (or within sub-functions being called) and a delay in garbage collection. So, my questions are:
1) Is my assumption correct? Why or why not?
2) Is there any sense in simply nesting calls to functions more deeply rather than assigning intermediate variables? Will this reduce memory usage?
3) Suppose a scenario in which R is using 3GB of memory on a system with 4GB of RAM. After running gc(), it's now using only 2GB. In such a situation, is R smart enough to run garbage collection on its own if I had, say, called another function which used up 1.5GB of memory?
There are certain datasets I am working with which are able to crash the system as it runs out of memory when they are processed, and I'm trying to alleviate this. Thanks in advance for any answers!
Josh
1) Memory used to represent objects in R and memory marked by the OS as in-use are separated by several layers (R's own memory handling, when and how the OS reclaims memory from applications, etc.). I'd say (a) I don't know for sure but (b) at times the task manager's notion of memory use might not accurately reflect the memory actually in use by R, but that (c) yes, probably the discrepancy you describe reflects memory allocated by R to objects in your current session.
2) In a function like
f = function() { a = 1; g=function() a; g() }
invoking f() prints 1, implying that memory used by a is still being marked as in use when g is invoked. So nesting functions doesn't help with memory management, probably the reverse.
Your best bet is to clean-up or re-use variables representing large allocations before making more large allocations. Appropriately designed functions can help with this, e.g.,
f = function() { m = matrix(0, 10000, 10000); 1 }
g = function() { m = matrix(0, 10000, 10000); 1 }
h = function() { f(); g() }
The large memory of f is no longer needed by the time f returns, and so is available for garbage collection if the large memory required for g necessitates this.
3) If R tries to allocate memory for a variable and can't, it'll run its garbage collector a and try again. So you don't gain anything by running gc() yourself.
I'd make sure that you've written memory efficient code, and if there are still issues I'd move to a 64bit platform where memory is less of an issue.
R has facilities for memory profiling, but it needs to be built that. While we enable that for Debian / Ubuntu, I do not know what the default for Windows is.
Usage of memory profiling is discussed (briefly) in the 'Writing R Extensions' manual.
Coping with (limited) memory on a 32-bit system (and particularly Windows) has its challenges. Most people will recommend that you switch to a system with as much RAM as possible running a 64-bit OS.

Resources