I am using furrr which is built on top of future.
I have a very simple question. I have a list of files, say list('/mydata/file1.csv.gz', '/mydata/file1.csv.gz') and I am processing them in parallel with a simple function that loads the data, does some filtering stuff, and write it to disk.
In essence, my function is
processing_func <- function(file){
mydata <- readr::read_csv(file)
mydata <- mydata %>% dplyr::filter(var == 1)
data.table::fwrite(mydata, 'myfolder/processed.csv.gz')
rm()
gc()
}
and so I am simply running
listfiles %>% furrr::future_map(., processing_func(.x))
This works, but despite my gc() and rm() calls, the RAM keeps filling up until the session crashes.
What is the conceptual issue here? Why would some residual objects remain somehow in memory when I explicitly discard them?
Thanks!
You can try using a callr future plan, it may be less memory hungry.
As quoted from the future.callr vignette
When using callr futures, each future is resolved in a fresh background R session which ends as soon as the value of the future has been collected. In contrast, multisession futures are resolved in background R worker sessions that serve multiple futures over their life spans. The advantage with using a new R process for each future is that it is that the R environment is guaranteed not to be contaminated by previous futures, e.g. memory allocations, finalizers, modified options, and loaded and attached packages. The disadvantage, is an added overhead of launching a new R process
library("future.callr")
plan(callr)
Assuming you're using 64-bit R on Windows, R is only bound to RAM by default. You can use memory.limit() to increase the amount of memory your r session can use. The line "memory.limit(50*1024)" would allow your R session to use 50GB of memory. Also, R automatically calls gc() whenever it's running low on space, so that line isn't helping you.
With future multisession:
future::plan(multisession)
processing_func <- function(file){
readr::read_csv(file) |>
dplyr::filter(var == 1) |>
data.table::fwrite('...csv.gz')
gc()
}
listfiles |> purrr::walk(processing_func)
Note that I am
Not creating any variables in processing_func so there is nothing to rm
Using purrr::walk not map, as we don't need resolved value.
Using gc() inside the future.
Passing files to functions in futures is a nice way to parrallelize things. I also like to use multicore instead of multisession to share some objects from the parent environment.
It seems like these sessions run out of memory if you aren't careful. A gc call in the future function seems to help pretty often.
Related
I use the bigmemory package to put a very large matrix into shared memory (see script below, so it can be accessed in parallel by scripts in other R sessions.
I now want to execute the script in a non-interactive way. The problem is, that if I run it with Rscript, the matrix is removed from shared memory right after the Rscript process ended. I could add Sys.sleep(99999) to the end of the script, but I am wondering if there is any better way to acclompish this. Any ideas?
library(bigmemory)
m = read.big.matrix("matrix.txt", type='double', shared = TRUE, header = FALSE, sep = "\t")
sign = describe(m)
dput(sign, "matrix.signature")
If you have the descriptor sign on disk, then you can just use attach.big.matrix() in another session:
m <- attach.big.matrix("matrix.signature")
As long as the matrix is attached in at least one R session it remains in an isolated part of the RAM. The best practice to prevent the matrix from being lost is to fileback it. There is no performance penalty involved, but you have to be aware that in the designated location on your hard drive, these data reside and remain there until you delete them explicitly. Even closing all R sessions won't delete them. The finalizer is deactivated in this case.
So I just finish doing some heavy-lifting with R on a ~200 Gb dataset. in which I used the the following packages (don't know if it's relevant or not):
library(stringdist)
library(tidyverse)
library(data.table)
Afterwards I want to clear memory so I can move on to the next step, for this I use:
remove(list = ls())
dev.off()
gc(full = T)
cat("\f")
What I am looking to with these commands is "a fresh start" in which all of my environment is as if I have just opened R for the fist time (and loaded in the relevant packages).
However, checking my task manager reveals R is still using ~55 Gb of memory. Needless to say, this is way to much for an "empty" R to occupy. So my guess is R is holding on to something. Why does this happen? and how can I reduce this memory usage to few Mb R uses normally?
Thanks!
I rewrote my program many times to not hit any memory limits. It again takes up full VIRT which does not make any sense to me. I do not save any objects. I write to disk each time I am done with a calculation.
The code (simplified) looks like
lapply(foNames, # these are just folder names like ["~/datastes/xyz","~/datastes/xyy"]
function(foName){
Filepath <- paste(foName,"somefile,rds",sep="")
CleanDataObject <- readRDS(Filepath) # reads the data
cl <- makeCluster(CONF$CORES2USE) # spins up a cluster (it does not matter if I use the cluster or not. The problem is intependent imho)
mclapply(c(1:noOfDataSets2Generate),function(x,CleanDataObject){
bootstrapper(CleanDataObject)
},CleanDataObject)
stopCluster(cl)
})
The bootstrap function simply samples the data and save the sampled data to disk.
bootstrapper <- function(CleanDataObject){
newCPADataObject <- sample(CleanDataObject)
newCPADataObject$sha1 <- digest::sha1(newCPADataObject, algo="sha1")
saveRDS(newCPADataObject, paste(newCPADataObject$sha1 ,".rds", sep = "") )
return(newCPADataObject)
}
I do not get how this can now accumulate to over 60 GB of RAM. The code is highly simplified but imho there is nothing else which could be problematic. I can paste more code details if needed.
How does R manage to successively eat up my memory, even though I already re-wrote the software to store the generated object on disk?
I have had this problem with loops in the past. It is more complicated to address in functions and apply.
But, what I have done is used two things in combination to fix the problem.
Within each function that generates temporary files, use rm(file-name) to remove the temp file and then run gc() which forces a garbage collection before exiting the functions. This will slow the process some, but reduce memory pressure. This way each iteration of apply will purge before moving on to the next step. You may have to go back to your first function in nested functions to accomplish this well. It takes experimentation to figure out where the system is getting backed up.
I find this to be especially necessary if you use ANY methods called from packages built over rJava, it is extremely wasteful of resources and R has no way of running garbage collection on the Java heap, and most authors of java packages do not seem to be accounting for the need to collect in their methods.
I found lots of questions here asking how to deal with "cannot allocate vector of size **" and tried the suggestions but still can't find out why Rstudio crashes every time.
I'm using 64bit R in Windows 10, and my memory.limit() is 16287.
I'm working with a bunch of large data files (mass spectra) that take up 6-7GB memory each, so I've been calling individual files one at a time and saving it as a variable with the XCMS package like below.
msdata <- xcmsRaw(datafile1,profstep=0.01,profmethod="bin",profparam=list(),includeMSn=FALSE,mslevel=NULL, scanrange=NULL)
I do a series of additional operations to clean up data and make some plots using rawEIC (also in XCMS package), which increases my memory.size() to 7738.28. Then I removed all the variables I created that are saved in my global environment using rm(list=ls()). But when I try to call in a new file, it tells me it cannot allocate vector of size **Gb. With the empty environment, my memory.size() is 419.32, and I also checked with gc() to confirm that the used memory (on the Vcells row) is on the same order with when I first open a new R session.
I couldn't find any information on why R still thinks that something is taking up a bunch of memory space when the environment is completely empty. But if I terminate the session and reopen the program, I can import the data file - I just have to re-open the session every single time one data file processing is done, which is getting really annoying. Does anyone have suggestions on this issue?
I would like to check what is the top usage of memory during running a code in R. Does anyone know such a function?
The only thing I found, so far, is the function mem_change from pryr package, which checks memory change before and after running a code.
I work on Linux.
gc() will tell you the maximum memory usage. So if you start a new R session, run your code and then use gc() you should find what you need. Alternatives include the profiling functions Rprof and Rprofmem as referenced in #James comment above.