R occupying virtual Memory completely - r

I rewrote my program many times to not hit any memory limits. It again takes up full VIRT which does not make any sense to me. I do not save any objects. I write to disk each time I am done with a calculation.
The code (simplified) looks like
lapply(foNames, # these are just folder names like ["~/datastes/xyz","~/datastes/xyy"]
function(foName){
Filepath <- paste(foName,"somefile,rds",sep="")
CleanDataObject <- readRDS(Filepath) # reads the data
cl <- makeCluster(CONF$CORES2USE) # spins up a cluster (it does not matter if I use the cluster or not. The problem is intependent imho)
mclapply(c(1:noOfDataSets2Generate),function(x,CleanDataObject){
bootstrapper(CleanDataObject)
},CleanDataObject)
stopCluster(cl)
})
The bootstrap function simply samples the data and save the sampled data to disk.
bootstrapper <- function(CleanDataObject){
newCPADataObject <- sample(CleanDataObject)
newCPADataObject$sha1 <- digest::sha1(newCPADataObject, algo="sha1")
saveRDS(newCPADataObject, paste(newCPADataObject$sha1 ,".rds", sep = "") )
return(newCPADataObject)
}
I do not get how this can now accumulate to over 60 GB of RAM. The code is highly simplified but imho there is nothing else which could be problematic. I can paste more code details if needed.
How does R manage to successively eat up my memory, even though I already re-wrote the software to store the generated object on disk?

I have had this problem with loops in the past. It is more complicated to address in functions and apply.
But, what I have done is used two things in combination to fix the problem.
Within each function that generates temporary files, use rm(file-name) to remove the temp file and then run gc() which forces a garbage collection before exiting the functions. This will slow the process some, but reduce memory pressure. This way each iteration of apply will purge before moving on to the next step. You may have to go back to your first function in nested functions to accomplish this well. It takes experimentation to figure out where the system is getting backed up.
I find this to be especially necessary if you use ANY methods called from packages built over rJava, it is extremely wasteful of resources and R has no way of running garbage collection on the Java heap, and most authors of java packages do not seem to be accounting for the need to collect in their methods.

Related

R bigmemory: how to access matrix/keep in shared memory after script got executed

I use the bigmemory package to put a very large matrix into shared memory (see script below, so it can be accessed in parallel by scripts in other R sessions.
I now want to execute the script in a non-interactive way. The problem is, that if I run it with Rscript, the matrix is removed from shared memory right after the Rscript process ended. I could add Sys.sleep(99999) to the end of the script, but I am wondering if there is any better way to acclompish this. Any ideas?
library(bigmemory)
m = read.big.matrix("matrix.txt", type='double', shared = TRUE, header = FALSE, sep = "\t")
sign = describe(m)
dput(sign, "matrix.signature")
If you have the descriptor sign on disk, then you can just use attach.big.matrix() in another session:
m <- attach.big.matrix("matrix.signature")
As long as the matrix is attached in at least one R session it remains in an isolated part of the RAM. The best practice to prevent the matrix from being lost is to fileback it. There is no performance penalty involved, but you have to be aware that in the designated location on your hard drive, these data reside and remain there until you delete them explicitly. Even closing all R sessions won't delete them. The finalizer is deactivated in this case.

how to avoid filling the RAM when doing multiprocessing in R (future)?

I am using furrr which is built on top of future.
I have a very simple question. I have a list of files, say list('/mydata/file1.csv.gz', '/mydata/file1.csv.gz') and I am processing them in parallel with a simple function that loads the data, does some filtering stuff, and write it to disk.
In essence, my function is
processing_func <- function(file){
mydata <- readr::read_csv(file)
mydata <- mydata %>% dplyr::filter(var == 1)
data.table::fwrite(mydata, 'myfolder/processed.csv.gz')
rm()
gc()
}
and so I am simply running
listfiles %>% furrr::future_map(., processing_func(.x))
This works, but despite my gc() and rm() calls, the RAM keeps filling up until the session crashes.
What is the conceptual issue here? Why would some residual objects remain somehow in memory when I explicitly discard them?
Thanks!
You can try using a callr future plan, it may be less memory hungry.
As quoted from the future.callr vignette
When using callr futures, each future is resolved in a fresh background R session which ends as soon as the value of the future has been collected. In contrast, multisession futures are resolved in background R worker sessions that serve multiple futures over their life spans. The advantage with using a new R process for each future is that it is that the R environment is guaranteed not to be contaminated by previous futures, e.g. memory allocations, finalizers, modified options, and loaded and attached packages. The disadvantage, is an added overhead of launching a new R process
library("future.callr")
plan(callr)
Assuming you're using 64-bit R on Windows, R is only bound to RAM by default. You can use memory.limit() to increase the amount of memory your r session can use. The line "memory.limit(50*1024)" would allow your R session to use 50GB of memory. Also, R automatically calls gc() whenever it's running low on space, so that line isn't helping you.
With future multisession:
future::plan(multisession)
processing_func <- function(file){
readr::read_csv(file) |>
dplyr::filter(var == 1) |>
data.table::fwrite('...csv.gz')
gc()
}
listfiles |> purrr::walk(processing_func)
Note that I am
Not creating any variables in processing_func so there is nothing to rm
Using purrr::walk not map, as we don't need resolved value.
Using gc() inside the future.
Passing files to functions in futures is a nice way to parrallelize things. I also like to use multicore instead of multisession to share some objects from the parent environment.
It seems like these sessions run out of memory if you aren't careful. A gc call in the future function seems to help pretty often.

Why does loading saved R file increase CPU usage?

I have an R script that I want to run frequently. Few months ago when I wrote it and initiated, there was no problem.
Now, my script is consuming almost all (99%) of the CPU and its slower than it used to be. I am running the script in a server and other users experience slow response from the server when the script is running.
I tried to find out the piece of code where its slow. The following loop is taking almost all the time and CPU that is used by the script.
for (i in 1:100){
load (paste (saved_file, i, ".RData", sep=""))
Do something (which is fast)
assign (paste ("var", i, sep=""), vector)
}
The loaded data is about 11 MB in each iteration. When I run above script for an arbitrary "i", the loading of file step takes longer time than other commands.
I spent few hours reading forum posts but could not get any hint about my problem. It would be great if you could point out if there's something I am missing or suggest more effective way to load a file in R.
EDIT: Added space in the codes to make it easier to read.
paste(saved_file, i, ".RData", sep = "")
Loads a object at each iteration, with name xxx1, xxx2, and so on.
Did you tried to rm the object at the end of loop? I guess the object stays in memory, regardless of your variable being reused.
Just a tip: add spaces in your code (like i did), it's much more easier to read/debug.

Speed up RData load

I've checked several related questions such is this
How to load data quickly into R?
I'm quoting specific part of the most rated answer
It depends on what you want to do and how you process the data further. In any case, loading from a binary R object is always going to be faster, provided you always need the same dataset. The limiting speed here is the speed of your harddrive, not R. The binary form is the internal representation of the dataframe in the workspace, so there is no transformation needed anymore
I really thought that. However, life is about experimenting. I have a 1.22 GB file containing an igraph object. That's said, i don't think what I found here is related to the object class, mainly because you can load('file.RData') even before you call "library".
Disks in this server are pretty cool. As you can check in the reading time to memory
user#machine data$ pv mygraph.RData > /dev/null
1.22GB 0:00:03 [ 384MB/s] [==================================>] 100% `
However when I load this data from R
>system.time(load('mygraph.RData'))
user system elapsed
178.533 16.490 202.662
So it seems loading *.RData files is 60 times slower than disk limits, which should mean R actually does something while "load".
I've got the same feeling using differentes R versions with different hardware, it's just this time I got patience to make benchmarking (mainly because with such a cool disk storage, it was terrible how long the load actually takes)
Any ideas on how to overcome this?
After ideas in answers
save(g,file="test.RData",compress=F)
Now the file is 3.1GB against 1.22GB before. In my case, loading uncompress is a bit faster (disk is not my bottleneck by far)
> system.time(load('test.RData'))
user system elapsed
126.254 2.701 128.974
Reading the uncompressed file to memory takes like 12 seconds, so I confirm most the time is spent in setting the enviroment
I'll be back with RDS results, sounds like interesting
Here we are, as prommised
system.time(saveRDS(g,file="test2.RData",compress=F))
user system elapsed
7.714 2.820 18.112
And I get a 3.1GB just like "save" uncompressed, although md5sum is different, probably because save also stores the object name
Now reading...
> system.time(a<-readRDS('test2.RData'))
user system elapsed
41.902 2.166 44.077
So combining both ideas (uncompress and RDS) runs 5 times faster. Thanks for your contributions!
save compresses by default, so it takes extra time to uncompress the file. Then it takes a bit longer to load the larger file into memory. Your pv example is just copying the compressed data to memory, which isn't very useful to you. ;-)
UPDATE:
I tested my theory and it was incorrect (at least on my Windows XP machine with 3.3Ghz CPU and 7200RPM HDD). Loading compressed files is faster (probably because it reduces disk I/O).
The extra time is spent in RestoreToEnv (in saveload.c) and/or R_Unserialize (in serialize.c). So you could make loading faster by changing those files, or maybe by using saveRDS to individually save the objects in myGraph.RData then somehow using loadRDS across multiple R processes to load the data into shared memory...
For variables that big, I suspect that most of the time is taken up inside the internal C code (http://svn.r-project.org/R/trunk/src/main/saveload.c). You can run some profiling to see if I'm right. (All the R code in the load function does is check that your file is non-empty and hasn't been corrupted.
As well as reading the variables into memory, they (amongst other things) need to be stored inside an R environment.
The only obvious way of getting a big speedup in loading variables would be to rewrite the code in a parallel way to allow simultaneous loading of variables. This presumably requires a substantial rewrite of R's internals, so don't hold your breath for such a feature.
The main reason why RData files take a while to load is that the de-compression step is single-threaded.
The fastSave R package allows using parallel tools for saving and restoring R sessions:
https://github.com/barkasn/fastSave
But it only works on UNIX (You should still be able to open the files on other platforms though).

Cache expensive operations in R

A very simple question:
I am writing and running my R scripts using a text editor to make them reproducible, as has been suggested by several members of SO.
This approach is working very well for me, but I sometimes have to perform expensive operations (e.g. read.csv or reshape on 2M-row databases) that I'd better cache in the R environment rather than re-run every time I run the script (which is usually many times as I progress and test the new lines of code).
Is there a way to cache what a script does up to a certain point so every time I am only running the incremental lines of code (just as I would do by running R interactively)?
Thanks.
## load the file from disk only if it
## hasn't already been read into a variable
if(!(exists("mytable")){
mytable=read.csv(...)
}
Edit: fixed typo - thanks Dirk.
Some simple ways are doable with some combinations of
exists("foo") to test if a variable exists, else re-load or re-compute
file.info("foo.Rd")$ctime which you can compare to Sys.time() and see if it is newer than a given amount of time you can load, else recompute.
There are also caching packages on CRAN that may be useful.
After you do something you discover to be costly, save the results of that costly step in an R data file.
For example, if you loaded a csv into a data frame called myVeryLargeDataFrame and then created summary stats from that data frame into a df called VLDFSummary then you could do this:
save(c(myVeryLargeDataFrame, VLDFSummary),
file="~/myProject/cachedData/VLDF.RData",
compress="bzip2")
The compress option there is optional and to be used if you want to compress the file being written to disk. See ?save for more details.
After you save the RData file you can comment out the slow data loading and summary steps as well as the save step and simply load the data like this:
load("~/myProject/cachedData/VLDF.RData")
This answer is not editor dependent. It works the same for Emacs, TextMate, etc. You can save to any location on your computer. I recommend keeping the slow code in your R script file, however, so you can always know where your RData file came from and be able to recreate it from the source data if needed.
(Belated answer, but I began using SO a year after this question was posted.)
This is the basic idea behind memoization (or memoisation). I've got a long list of suggestions, especially the memoise and R.cache packages, in this query.
You could also take advantage of checkpointing, which is also addressed as part of that same list.
I think your use case mirrors my second: "memoization of monstrous calculations". :)
Another trick I use is to do a lot of memory mapped files, which I use a lot of, to store data. The nice thing about this is that multiple R instances can access shared data, so I can have a lot of instances cracking at the same problem.
I want to do this too when I'm using Sweave. I'd suggest putting all of your expensive functions (loading and reshaping data) at the beginning of your code. Run that code, then save the workspace. Then, comment out the expensive functions, and load the workspace file with load(). This is, of course, riskier if you make unwanted changes to the workspace file, but in that event, you still have the code in comments if you want to start over from scratch.
Without going into too much detail, I usually follow one of three approaches:
Use assign to assign a unique name for each important object throughout my execution. Then include an if(exists(...)) get(...) at the top of each function to get the value or else recompute it. (same as Dirk's suggestion)
Use cacheSweave with my Sweave documents. This does all the work for you of caching computations and retrieves them automatically. It's really trivial to use: just use the cacheSweave driver and add this flag to each block: <<..., cache=true>>=
Use save and load to save the environment at crucial moments, again making sure that all names are unique.
The 'mustashe' package is great for this kind of problem. In addition to caching the results, it also can include links to dependencies so that the code is re-run if the dependencies change.
Disclosure: I wrote this tool ('mustashe'), though I do not make any financial gains from others using it. I made it for this exact purpose for my own work and want to share it with others.
Below is a simple example. The foo variable is created and "stashed" for later. If the same code is re-run, the foo variable is loaded from disk and added to the global environment.
library(mustashe)
stash("foo", {
foo <- some_long_running_opperation(1e3)
}
#> Stashing object.
The documentation has additional examples of more complex use-cases and a detailed explanation of how it works under the hood.

Resources