Garbage Collection in R - r

I started to use gc() for garbage collection in R. I have 16 GB RAM and sometimes, up to 10 GB RAM gets freed when using this command.
Does it make sense to use gc() inside functions? Often, the functions I write/use need almost all RAM that is available. Or does R reliably clean up memory that was used only inside a function?
Example:
f <- function(x) {
# do something
y <- doStuff(x)
# do something else
z <- doMoreStuff(y)
# garbage collection
gc()
# return result
return(z)
}

Calling gc() is largely pointless, as R calls it automatically when more memory is needed. The only reason I can think of for calling gc() explicitly is if another program needs memory that R is hogging.

Related

Finalizer in R modifying objects - thread safe?

In an R package, I manage some external objects. I want to release them after the corresponding R object (an environment) has been garbage collected. There is a reason why I cannot release the external resources right away. Therefore I need to somehow save the information that the object has been garbage collected and use this information at a later time. The approch is implemented basically like the following:
Environment objects are created with some ID. I have some environment managingEnv, where I collect the information about the finalized objects in a vector of IDs of objects. When the objects are finalized, the finalizer writes their IDs into the managingEnv.
managingEnv <- new.env(parent = emptyenv())
managingEnv$garbageCollectedIds <- c()
createExternalObject <- function(id) {
ret <- new.env(parent = emptyenv())
ret$id <- id
reg.finalizer(ret, function(e) {
managingEnv$garbageCollectedIds <- c(managingEnv$garbageCollectedIds, e$id)
})
ret
}
If I then create some objects and run the garbage collection, the approach seems to work and I have at the end all IDs collected in the managingEnv and could later perform my action to release all these objects.
> createExternalObject(1)
<environment: 0x000002307ff36920>
> createExternalObject(2)
<environment: 0x000002307e94d390>
> createExternalObject(3)
<environment: 0x000002307e94ac90>
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 568005 30.4 1299461 69.4 1299461 69.4
Vcells 1519057 11.6 8388608 64.0 2044145 15.6
> managingEnv$garbageCollectedIds
[1] 2 1 3
Although the approach seems to work, I have encountered some instability in my package and R crashes sometimes randomly. After some research into the problem, I came to the conclusion that the approach that I use might be not safe.
The documentation of reg.finalizer says:
Note:
R's interpreter is not re-entrant and the finalizer could be run
in the middle of a computation. So there are many functions which
it is potentially unsafe to call from ‘f’: one example which
caused trouble is ‘options’. Finalizers are scheduled at garbage
collection but only run at a relatively safe time thereafter.
Is my approach really safe? Can I change this code to make if safe(r) or find another solution to the problem described above?
EDIT: I found the reasons for the instability in other parts of the code. So it seems like the approach described above is safe. I would still be interested in more details about how to know which operations are "potentially unsafe" in finalizers, to be able to reason that the approach described here is safe.

R memory puzzle on ECDF environments

I have a massive list of ECDF objects.
Similar to:
vals <- rnorm(10000)
x <- ecdf(vals)
ecdfList <- lapply(1:10000, function(i) ecdf(vals))
save(ecdfList, file='mylist.rda')
class(ecdfList[[1]])
[1] "ecdf" "stepfun" "function"
Let's quit the R server and start fresh.
q()
> R (I'm on a server running ubuntu, R 3.4.4)
Now, the problem is, when starting with a fresh env,
loading and deleting the ecdfList doesn't free the memory.
load('mylist.rda')
rm(ecdfList)
gc()
top and free still show the memory as being used by R.
So I thought I would be clever and load them to a new environment.
e = new.env()
load('mylist.rda', envir=e)
rm(e)
gc()
But, same thing happens. top and free still show the memory as being used.
Where are those ecdf objects? How can I safely remove that list of ecdfs from memory?
Maybe the memory is just being held.. just in case.. by R? This doesn't happen with other data objects.
Here's an example of watching the memory with 'free'.
From Rstudio, I'll create a list of vectors and then release them, checking the memory used before and after.
dave#yoga:~$ free
total used free shared
available Mem: 16166812 1548680 11725452 932416
Then make a list of vectors.
x <- lapply(1:10000, function(a) rnorm(n=10000))
Then check the free memory.
davidgibbs#gibbs-yoga:~$ free
total used free shared
available Mem: 16166812 2330068 10954372 921956
From within Rstudio, rm the vectors.
rm(x)
gc()
Check the memory again,
davidgibbs#gibbs-yoga:~$ free
total used free shared
available Mem: 16166812 1523252 11750620 932528
OK, so the memory is returned.
Now we'll try it with a list of ECDFs.
# already saved the list as above
e = new.env()
open('mylist.rda', envir=e)
And check the memory
dave#yoga:~$ free
total used free shared
Mem: 16166812 1752808 10213168 1166136
e <- new.env()
load('ecdflist.rda', envir = e)
And we'll check the memory
dave#yoga:~$ free
total used free shared
Mem: 16166812 3365536 8667616 1096236
Now we'll rm that env.
rm(e)
gc()
Final memory check.
dave#yoga:~$ free
total used free shared
available Mem: 16166812 3321584 8726964
And still being used until we reset R.
Thank you!!
-dave

Memory leak and C wrapper

I am currently using the sbrl() function from the sbrl library. The function does the job of any supervised statistical learning algorithm: it takes data, and generates a predictive model.
I have a memory leak issue when using it.
If I run the function in a loop, my RAM will get filled more and more, although I am always pointing to the same object.
Eventually, my computer will reach the RAM limit and crash.
Calling gc() will never help. Only closing the R session releases the memory.
Below is a minimal reproducible example. An eye should be kept on the system's memory management program.
Importantly, the sbrl() function calls, from what I can tell, C code, and also makes use of Rcpp. I guess this relates to the memory leak issue.
Would you know how to force memory to be released?
Configuration: Windows 10, R 3.5.0 (Rstudio or R.exe)
install.packages("sbrl")
library(sbrl)
# Getting / prepping data
data("tictactoe")
# Looping over sbrl
for (i in 1:1e3) {
rules <- sbrl(
tdata = tictactoe, iters=30000, pos_sign="1",
neg_sign="0", rule_minlen=1, rule_maxlen=3,
minsupport_pos=0.10, minsupport_neg=0.10,
lambda=10.0, eta=1.0, alpha=c(1,1), nchain=20
)
invisible(gc())
cat("Rules object size in Mb:", object.size(rules)/1e6, "\n")
}

not all RAM is released after gc() after using ffdf object in R

I am running the script as follows:
library(ff)
library(ffbase)
setwd("D:/My_package/Personal/R/reading")
x<-cbind(rnorm(1:100000000),rnorm(1:100000000),1:100000000)
system.time(write.csv2(x,"test.csv",row.names=FALSE))
#make ffdf object with minimal RAM overheads
system.time(x <- read.csv2.ffdf(file="test.csv", header=TRUE, first.rows=1000, next.rows=10000,levels=NULL))
#make increase by 5 of the column#1 of ffdf object 'x' by the chunk approach
chunk_size<-100
m<-numeric(chunk_size)
#list of chunks
chunks <- chunk(x, length.out=chunk_size)
#FOR loop to increase column#1 by 5
system.time(
for(i in seq_along(chunks)){
x[chunks[[i]],][[1]]<-x[chunks[[i]],][[1]]+5
}
)
# output of x
print(x)
#clear RAM used
rm(list = ls(all = TRUE))
gc()
#another option to run garbage collector explicitly.
gc(reset=TRUE)
The issue is that I still some RAM unreleased but all objects and functions have been swept away from the current environment.
Moreover, the next run of the script will increase portion of RAM unreleased as if it is cumulative function (by Task manager in Win7 64bit).
However, if I make a non-ffdf object and sweep it away, the output of rm() and gc() will be Ok.
So my guess about RAM unreleased is connected with specifics of ffdf objects and ff package.
So the effective way to clear up RAM is to quit the current R-session and re-run it again. but it is not very convinient.
I have scanned a bunch of posts about memory cleaning up including this one:
Tricks to manage the available memory in an R session
But I have not found the clear explanations of such a situation and effective ways to overcome it (without resetting R-session).
I would be very grateful for your comments.

remove a temporary environment variable and release memory in R

I am working on a job in which a temporary Hash table is repeatedly used through a loop. The Hash table is represented by an environment variable in R. The problem is that as the loop proceeds the memory cost keeps rising no matter what method I used to delete the table (I tried rm() and gc() however neither was able to free the memory.) As a consequence I cannot accomplish an extraordinary long loop, say 10M cycles. It looks like a memory leak problem but I fail to find a solution elsewhere. I would like to ask what is the correct way to completely removing an environment variable and simultaneously releasing all memory it previously occupied. Thanks in advance for helping check the problem for me.
Here is a very simple example. I am using Windows 8 and R version 3.1.0.
> fun = function(){
H = new.env()
for(i in rnorm(100000)){
H[[as.character(i)]] = rnorm(100)
}
rm(list=names(H), envir=H, inherits=FALSE)
rm(H)
gc()
}
>
> for(k in 1:5){
print(k)
fun()
gc()
print(memory.size(F))
}
[1] 1
[1] 40.43
[1] 2
[1] 65.34
[1] 3
[1] 82.56
[1] 4
[1] 100.22
[1] 5
[1] 120.36
Environments in R are not a good choice for situations where the keys can vary a lot during the computation. The reason is that environments require keys to be symbols, and symbols are not garbage collected. So each run of your function is adding to the internal symbol table. Arranging for symbols to be garbage collected would be one possibility, though care would be needed since a lot of internals code assumes they are not. Another option would be to create better hash table support so environments don't have to try to serve this purpose for which they were not originally designed.

Resources