cannot allocate vector but my environment is empty - r

I found lots of questions here asking how to deal with "cannot allocate vector of size **" and tried the suggestions but still can't find out why Rstudio crashes every time.
I'm using 64bit R in Windows 10, and my memory.limit() is 16287.
I'm working with a bunch of large data files (mass spectra) that take up 6-7GB memory each, so I've been calling individual files one at a time and saving it as a variable with the XCMS package like below.
msdata <- xcmsRaw(datafile1,profstep=0.01,profmethod="bin",profparam=list(),includeMSn=FALSE,mslevel=NULL, scanrange=NULL)
I do a series of additional operations to clean up data and make some plots using rawEIC (also in XCMS package), which increases my memory.size() to 7738.28. Then I removed all the variables I created that are saved in my global environment using rm(list=ls()). But when I try to call in a new file, it tells me it cannot allocate vector of size **Gb. With the empty environment, my memory.size() is 419.32, and I also checked with gc() to confirm that the used memory (on the Vcells row) is on the same order with when I first open a new R session.
I couldn't find any information on why R still thinks that something is taking up a bunch of memory space when the environment is completely empty. But if I terminate the session and reopen the program, I can import the data file - I just have to re-open the session every single time one data file processing is done, which is getting really annoying. Does anyone have suggestions on this issue?

Related

Issue with applying str_length to a dataframe

I created a simple R Script that is run on a monthly basis by colleagues.
This script brings in a fairly chunky RDS file that has around 2.6M observations and 521 variables.
Against this file the following two commands are run:
Latest$MFU <- substr(Latest$SUB_BUSINESS_UNIT_CODE, 1, 2)
Latest$LENGTH <- str_length(Latest$POLICYHOLDER_COMPANY_NAME_LAST_NAME)
This script has run perfectly for the last three years, but today, for some reason, it is now failing for all three people tasked to run it and has indeed fallen over for myself too.
The error message received is
Error: cannot allocate vector of size 10.0 Mb
At first I assumed that their computers were running out of memory, or they were not using 64Bit R, or some other reason such as not restarting their computers, etc.
It turns out though that they have plenty of memory available, have restarted their computers, are using 64 Bit R in R Studio and all are using different versions of R Studio/R.
I tried running the process myself, my computer has 32GB of Ram and 768GB of Hard Drive space free. I am getting the same error message.......
So, must be a corrupt source file I figure. Try last months file which all ran just fine last month for everyone and same error.
Maybe just try stringr package instead then, move around the problem that way. Nope, no dice, exact same error message.
I have to admit I'm stumped. I have tried gc(), tried previous versions of the file, tried cutting the file in half and running it that way, it just flat out refuses to run.
Anyone know of an alternative to stringr/base R commands to get the length of a character string as a new variable and to get a substring as a new variable?
What about rm(list=ls()) before running, and memory.limit(size = 16265*4) (or another big number) ?

R how to clear memory

Here's my problem:
I'm working on a Linux system and run R in a console.
Within an R programs loop, I load a really huge Data file, several GB.
Pseudocode:
for(i in 1:n){
data=read(Hugefile[i])
...
# do some stuff
...
rm(data)
}
When R was started new from console, the first iteration loads the data successfully. But in the second iteration, I get an allocation error. Even when I clear everything by using rm(list=ls()) and gc(), I get the same error trying to load this file manually. First, when I close R and open it again, I then can load another file of that size.
Does anyone know how to clear the memory of R within a loop and without restarting R?
Thanks for your help :)

Why does R keep using so much memory after clearing all the enviorment?

So I just finish doing some heavy-lifting with R on a ~200 Gb dataset. in which I used the the following packages (don't know if it's relevant or not):
library(stringdist)
library(tidyverse)
library(data.table)
Afterwards I want to clear memory so I can move on to the next step, for this I use:
remove(list = ls())
dev.off()
gc(full = T)
cat("\f")
What I am looking to with these commands is "a fresh start" in which all of my environment is as if I have just opened R for the fist time (and loaded in the relevant packages).
However, checking my task manager reveals R is still using ~55 Gb of memory. Needless to say, this is way to much for an "empty" R to occupy. So my guess is R is holding on to something. Why does this happen? and how can I reduce this memory usage to few Mb R uses normally?
Thanks!

R occupying virtual Memory completely

I rewrote my program many times to not hit any memory limits. It again takes up full VIRT which does not make any sense to me. I do not save any objects. I write to disk each time I am done with a calculation.
The code (simplified) looks like
lapply(foNames, # these are just folder names like ["~/datastes/xyz","~/datastes/xyy"]
function(foName){
Filepath <- paste(foName,"somefile,rds",sep="")
CleanDataObject <- readRDS(Filepath) # reads the data
cl <- makeCluster(CONF$CORES2USE) # spins up a cluster (it does not matter if I use the cluster or not. The problem is intependent imho)
mclapply(c(1:noOfDataSets2Generate),function(x,CleanDataObject){
bootstrapper(CleanDataObject)
},CleanDataObject)
stopCluster(cl)
})
The bootstrap function simply samples the data and save the sampled data to disk.
bootstrapper <- function(CleanDataObject){
newCPADataObject <- sample(CleanDataObject)
newCPADataObject$sha1 <- digest::sha1(newCPADataObject, algo="sha1")
saveRDS(newCPADataObject, paste(newCPADataObject$sha1 ,".rds", sep = "") )
return(newCPADataObject)
}
I do not get how this can now accumulate to over 60 GB of RAM. The code is highly simplified but imho there is nothing else which could be problematic. I can paste more code details if needed.
How does R manage to successively eat up my memory, even though I already re-wrote the software to store the generated object on disk?
I have had this problem with loops in the past. It is more complicated to address in functions and apply.
But, what I have done is used two things in combination to fix the problem.
Within each function that generates temporary files, use rm(file-name) to remove the temp file and then run gc() which forces a garbage collection before exiting the functions. This will slow the process some, but reduce memory pressure. This way each iteration of apply will purge before moving on to the next step. You may have to go back to your first function in nested functions to accomplish this well. It takes experimentation to figure out where the system is getting backed up.
I find this to be especially necessary if you use ANY methods called from packages built over rJava, it is extremely wasteful of resources and R has no way of running garbage collection on the Java heap, and most authors of java packages do not seem to be accounting for the need to collect in their methods.

Speed up RData load

I've checked several related questions such is this
How to load data quickly into R?
I'm quoting specific part of the most rated answer
It depends on what you want to do and how you process the data further. In any case, loading from a binary R object is always going to be faster, provided you always need the same dataset. The limiting speed here is the speed of your harddrive, not R. The binary form is the internal representation of the dataframe in the workspace, so there is no transformation needed anymore
I really thought that. However, life is about experimenting. I have a 1.22 GB file containing an igraph object. That's said, i don't think what I found here is related to the object class, mainly because you can load('file.RData') even before you call "library".
Disks in this server are pretty cool. As you can check in the reading time to memory
user#machine data$ pv mygraph.RData > /dev/null
1.22GB 0:00:03 [ 384MB/s] [==================================>] 100% `
However when I load this data from R
>system.time(load('mygraph.RData'))
user system elapsed
178.533 16.490 202.662
So it seems loading *.RData files is 60 times slower than disk limits, which should mean R actually does something while "load".
I've got the same feeling using differentes R versions with different hardware, it's just this time I got patience to make benchmarking (mainly because with such a cool disk storage, it was terrible how long the load actually takes)
Any ideas on how to overcome this?
After ideas in answers
save(g,file="test.RData",compress=F)
Now the file is 3.1GB against 1.22GB before. In my case, loading uncompress is a bit faster (disk is not my bottleneck by far)
> system.time(load('test.RData'))
user system elapsed
126.254 2.701 128.974
Reading the uncompressed file to memory takes like 12 seconds, so I confirm most the time is spent in setting the enviroment
I'll be back with RDS results, sounds like interesting
Here we are, as prommised
system.time(saveRDS(g,file="test2.RData",compress=F))
user system elapsed
7.714 2.820 18.112
And I get a 3.1GB just like "save" uncompressed, although md5sum is different, probably because save also stores the object name
Now reading...
> system.time(a<-readRDS('test2.RData'))
user system elapsed
41.902 2.166 44.077
So combining both ideas (uncompress and RDS) runs 5 times faster. Thanks for your contributions!
save compresses by default, so it takes extra time to uncompress the file. Then it takes a bit longer to load the larger file into memory. Your pv example is just copying the compressed data to memory, which isn't very useful to you. ;-)
UPDATE:
I tested my theory and it was incorrect (at least on my Windows XP machine with 3.3Ghz CPU and 7200RPM HDD). Loading compressed files is faster (probably because it reduces disk I/O).
The extra time is spent in RestoreToEnv (in saveload.c) and/or R_Unserialize (in serialize.c). So you could make loading faster by changing those files, or maybe by using saveRDS to individually save the objects in myGraph.RData then somehow using loadRDS across multiple R processes to load the data into shared memory...
For variables that big, I suspect that most of the time is taken up inside the internal C code (http://svn.r-project.org/R/trunk/src/main/saveload.c). You can run some profiling to see if I'm right. (All the R code in the load function does is check that your file is non-empty and hasn't been corrupted.
As well as reading the variables into memory, they (amongst other things) need to be stored inside an R environment.
The only obvious way of getting a big speedup in loading variables would be to rewrite the code in a parallel way to allow simultaneous loading of variables. This presumably requires a substantial rewrite of R's internals, so don't hold your breath for such a feature.
The main reason why RData files take a while to load is that the de-compression step is single-threaded.
The fastSave R package allows using parallel tools for saving and restoring R sessions:
https://github.com/barkasn/fastSave
But it only works on UNIX (You should still be able to open the files on other platforms though).

Resources