Speed up RData load - r

I've checked several related questions such is this
How to load data quickly into R?
I'm quoting specific part of the most rated answer
It depends on what you want to do and how you process the data further. In any case, loading from a binary R object is always going to be faster, provided you always need the same dataset. The limiting speed here is the speed of your harddrive, not R. The binary form is the internal representation of the dataframe in the workspace, so there is no transformation needed anymore
I really thought that. However, life is about experimenting. I have a 1.22 GB file containing an igraph object. That's said, i don't think what I found here is related to the object class, mainly because you can load('file.RData') even before you call "library".
Disks in this server are pretty cool. As you can check in the reading time to memory
user#machine data$ pv mygraph.RData > /dev/null
1.22GB 0:00:03 [ 384MB/s] [==================================>] 100% `
However when I load this data from R
>system.time(load('mygraph.RData'))
user system elapsed
178.533 16.490 202.662
So it seems loading *.RData files is 60 times slower than disk limits, which should mean R actually does something while "load".
I've got the same feeling using differentes R versions with different hardware, it's just this time I got patience to make benchmarking (mainly because with such a cool disk storage, it was terrible how long the load actually takes)
Any ideas on how to overcome this?
After ideas in answers
save(g,file="test.RData",compress=F)
Now the file is 3.1GB against 1.22GB before. In my case, loading uncompress is a bit faster (disk is not my bottleneck by far)
> system.time(load('test.RData'))
user system elapsed
126.254 2.701 128.974
Reading the uncompressed file to memory takes like 12 seconds, so I confirm most the time is spent in setting the enviroment
I'll be back with RDS results, sounds like interesting
Here we are, as prommised
system.time(saveRDS(g,file="test2.RData",compress=F))
user system elapsed
7.714 2.820 18.112
And I get a 3.1GB just like "save" uncompressed, although md5sum is different, probably because save also stores the object name
Now reading...
> system.time(a<-readRDS('test2.RData'))
user system elapsed
41.902 2.166 44.077
So combining both ideas (uncompress and RDS) runs 5 times faster. Thanks for your contributions!

save compresses by default, so it takes extra time to uncompress the file. Then it takes a bit longer to load the larger file into memory. Your pv example is just copying the compressed data to memory, which isn't very useful to you. ;-)
UPDATE:
I tested my theory and it was incorrect (at least on my Windows XP machine with 3.3Ghz CPU and 7200RPM HDD). Loading compressed files is faster (probably because it reduces disk I/O).
The extra time is spent in RestoreToEnv (in saveload.c) and/or R_Unserialize (in serialize.c). So you could make loading faster by changing those files, or maybe by using saveRDS to individually save the objects in myGraph.RData then somehow using loadRDS across multiple R processes to load the data into shared memory...

For variables that big, I suspect that most of the time is taken up inside the internal C code (http://svn.r-project.org/R/trunk/src/main/saveload.c). You can run some profiling to see if I'm right. (All the R code in the load function does is check that your file is non-empty and hasn't been corrupted.
As well as reading the variables into memory, they (amongst other things) need to be stored inside an R environment.
The only obvious way of getting a big speedup in loading variables would be to rewrite the code in a parallel way to allow simultaneous loading of variables. This presumably requires a substantial rewrite of R's internals, so don't hold your breath for such a feature.

The main reason why RData files take a while to load is that the de-compression step is single-threaded.
The fastSave R package allows using parallel tools for saving and restoring R sessions:
https://github.com/barkasn/fastSave
But it only works on UNIX (You should still be able to open the files on other platforms though).

Related

R occupying virtual Memory completely

I rewrote my program many times to not hit any memory limits. It again takes up full VIRT which does not make any sense to me. I do not save any objects. I write to disk each time I am done with a calculation.
The code (simplified) looks like
lapply(foNames, # these are just folder names like ["~/datastes/xyz","~/datastes/xyy"]
function(foName){
Filepath <- paste(foName,"somefile,rds",sep="")
CleanDataObject <- readRDS(Filepath) # reads the data
cl <- makeCluster(CONF$CORES2USE) # spins up a cluster (it does not matter if I use the cluster or not. The problem is intependent imho)
mclapply(c(1:noOfDataSets2Generate),function(x,CleanDataObject){
bootstrapper(CleanDataObject)
},CleanDataObject)
stopCluster(cl)
})
The bootstrap function simply samples the data and save the sampled data to disk.
bootstrapper <- function(CleanDataObject){
newCPADataObject <- sample(CleanDataObject)
newCPADataObject$sha1 <- digest::sha1(newCPADataObject, algo="sha1")
saveRDS(newCPADataObject, paste(newCPADataObject$sha1 ,".rds", sep = "") )
return(newCPADataObject)
}
I do not get how this can now accumulate to over 60 GB of RAM. The code is highly simplified but imho there is nothing else which could be problematic. I can paste more code details if needed.
How does R manage to successively eat up my memory, even though I already re-wrote the software to store the generated object on disk?
I have had this problem with loops in the past. It is more complicated to address in functions and apply.
But, what I have done is used two things in combination to fix the problem.
Within each function that generates temporary files, use rm(file-name) to remove the temp file and then run gc() which forces a garbage collection before exiting the functions. This will slow the process some, but reduce memory pressure. This way each iteration of apply will purge before moving on to the next step. You may have to go back to your first function in nested functions to accomplish this well. It takes experimentation to figure out where the system is getting backed up.
I find this to be especially necessary if you use ANY methods called from packages built over rJava, it is extremely wasteful of resources and R has no way of running garbage collection on the Java heap, and most authors of java packages do not seem to be accounting for the need to collect in their methods.

cannot allocate vector but my environment is empty

I found lots of questions here asking how to deal with "cannot allocate vector of size **" and tried the suggestions but still can't find out why Rstudio crashes every time.
I'm using 64bit R in Windows 10, and my memory.limit() is 16287.
I'm working with a bunch of large data files (mass spectra) that take up 6-7GB memory each, so I've been calling individual files one at a time and saving it as a variable with the XCMS package like below.
msdata <- xcmsRaw(datafile1,profstep=0.01,profmethod="bin",profparam=list(),includeMSn=FALSE,mslevel=NULL, scanrange=NULL)
I do a series of additional operations to clean up data and make some plots using rawEIC (also in XCMS package), which increases my memory.size() to 7738.28. Then I removed all the variables I created that are saved in my global environment using rm(list=ls()). But when I try to call in a new file, it tells me it cannot allocate vector of size **Gb. With the empty environment, my memory.size() is 419.32, and I also checked with gc() to confirm that the used memory (on the Vcells row) is on the same order with when I first open a new R session.
I couldn't find any information on why R still thinks that something is taking up a bunch of memory space when the environment is completely empty. But if I terminate the session and reopen the program, I can import the data file - I just have to re-open the session every single time one data file processing is done, which is getting really annoying. Does anyone have suggestions on this issue?

R console unexpectedly slow, long behind job (PDF output) is finished

When I run a large R scripts (works nicely as expected, basically produces a correct PDF at the end of the script (base plotting plus beeswarm, last line of script is dev.off()), I notice that the PDF is finished after ~3 seconds and can even be opened in other applications, long before the console output (merely few integer values and echo of code ~400 lines) is finished (~20 seconds). There are no errors reported. In between, the echo stops and does nothing for seconds.
I work with R Studio V0.97.551, R version 3.0.1, on Win-7.
gc() or close and restart R did not help, and the data structures used are not big anyway (5 dataframes with up to 60 obs and 64 numeric or short character variables). The available memory should be sufficient (according to task manager, around 4 GB throughout), but CPU is busy during that time.
I agree this is not reproducible for other people w/o the script, which is however too large to post, but maybe someone has experienced the same problem or even an explanation or suggestion what to check? Thanks in advance!
EDIT:
I run exactly the same code directly in R 3.0.1 (w/o RStudio), and the problem was gone, suggesting the problem is related to RStudio. I added the tag RStudio, but I am not sure if I am now supposed to move this question somewhere else?
Recently I came across similar problem--running from RStudio becomes very slow, even when it is executing something as simple as example('plot'). After searching around, this post pointed me to the right place that eventually led to a workaround: resetting RStudio by renaming the "RStudio-Desktop Directory". The exact way to do so depends upon the OS you are using, and you could find the detail instruction here. I just tried it, and it works.

write.table(...,append=T) : Cannot open the connection

I'm wondering if anyone else has ever encountered this problem. I'm writing a fairly small amount of data to a csv file. It's about 30 lines, 50 times.
I'm using a for loop to write data to the file.
It seems "finicky" sometimes the operation completes successfully, and other times it stops after the first ten times (300 lines) other times 3, or 5... by telling me
"cannot open connection".
I imagine it is some type of timeout. Is there a way to tell R to "slow down" when writing tables?
Before you ask: there's just too much code to provide an example here.
Code would help, despite your objections. R has a fixed-size connection pool and I suspect you are running out of connection.
So make sure you follow the three-step of
open connection (and check for error as a bonus)
write using the connection
close the connection
I can't reproduce it on a R 2.11.1 32bit on a Windows 7 64bit. For these kind of things, please provide more info on your system (see e.g. ?R.version, ?Sys.info )
Memory is a lot faster than disk access. 1500 lines are pretty much manageable in the memory and can be written to file in one time. If it's different sets of data, add an extra factor variable indicating the set (set1 to set50). All your data is easily manageable in one dataframe and you avoid having to access the disk many times.
In case it really is for 50 files, this code illustrates the valuable advice of Dirk :
for(i in 1:50){
...
ff <- file("C:/Mydir/Myfile.txt",open="at")
write.table(myData,file=ff)
close(ff)
}
See also the help: ?file
EDIT : you should use open="at" instead of open="wt". "at" is appending mode. "wt" is writing mode. append=T is the same as open="at".

Cache expensive operations in R

A very simple question:
I am writing and running my R scripts using a text editor to make them reproducible, as has been suggested by several members of SO.
This approach is working very well for me, but I sometimes have to perform expensive operations (e.g. read.csv or reshape on 2M-row databases) that I'd better cache in the R environment rather than re-run every time I run the script (which is usually many times as I progress and test the new lines of code).
Is there a way to cache what a script does up to a certain point so every time I am only running the incremental lines of code (just as I would do by running R interactively)?
Thanks.
## load the file from disk only if it
## hasn't already been read into a variable
if(!(exists("mytable")){
mytable=read.csv(...)
}
Edit: fixed typo - thanks Dirk.
Some simple ways are doable with some combinations of
exists("foo") to test if a variable exists, else re-load or re-compute
file.info("foo.Rd")$ctime which you can compare to Sys.time() and see if it is newer than a given amount of time you can load, else recompute.
There are also caching packages on CRAN that may be useful.
After you do something you discover to be costly, save the results of that costly step in an R data file.
For example, if you loaded a csv into a data frame called myVeryLargeDataFrame and then created summary stats from that data frame into a df called VLDFSummary then you could do this:
save(c(myVeryLargeDataFrame, VLDFSummary),
file="~/myProject/cachedData/VLDF.RData",
compress="bzip2")
The compress option there is optional and to be used if you want to compress the file being written to disk. See ?save for more details.
After you save the RData file you can comment out the slow data loading and summary steps as well as the save step and simply load the data like this:
load("~/myProject/cachedData/VLDF.RData")
This answer is not editor dependent. It works the same for Emacs, TextMate, etc. You can save to any location on your computer. I recommend keeping the slow code in your R script file, however, so you can always know where your RData file came from and be able to recreate it from the source data if needed.
(Belated answer, but I began using SO a year after this question was posted.)
This is the basic idea behind memoization (or memoisation). I've got a long list of suggestions, especially the memoise and R.cache packages, in this query.
You could also take advantage of checkpointing, which is also addressed as part of that same list.
I think your use case mirrors my second: "memoization of monstrous calculations". :)
Another trick I use is to do a lot of memory mapped files, which I use a lot of, to store data. The nice thing about this is that multiple R instances can access shared data, so I can have a lot of instances cracking at the same problem.
I want to do this too when I'm using Sweave. I'd suggest putting all of your expensive functions (loading and reshaping data) at the beginning of your code. Run that code, then save the workspace. Then, comment out the expensive functions, and load the workspace file with load(). This is, of course, riskier if you make unwanted changes to the workspace file, but in that event, you still have the code in comments if you want to start over from scratch.
Without going into too much detail, I usually follow one of three approaches:
Use assign to assign a unique name for each important object throughout my execution. Then include an if(exists(...)) get(...) at the top of each function to get the value or else recompute it. (same as Dirk's suggestion)
Use cacheSweave with my Sweave documents. This does all the work for you of caching computations and retrieves them automatically. It's really trivial to use: just use the cacheSweave driver and add this flag to each block: <<..., cache=true>>=
Use save and load to save the environment at crucial moments, again making sure that all names are unique.
The 'mustashe' package is great for this kind of problem. In addition to caching the results, it also can include links to dependencies so that the code is re-run if the dependencies change.
Disclosure: I wrote this tool ('mustashe'), though I do not make any financial gains from others using it. I made it for this exact purpose for my own work and want to share it with others.
Below is a simple example. The foo variable is created and "stashed" for later. If the same code is re-run, the foo variable is loaded from disk and added to the global environment.
library(mustashe)
stash("foo", {
foo <- some_long_running_opperation(1e3)
}
#> Stashing object.
The documentation has additional examples of more complex use-cases and a detailed explanation of how it works under the hood.

Resources