Locating tempory files from raster processes in R: 140 Gb missing - r

I recently ran a script that was meant to stack multiple large rasters and run a randomforest classification on the stack. I've done this numerous times with success, though it always takes up tremendous amounts of storage.
I'm aware of ways to check and clear the temporary folder in raster package: rasterTmpFile(prefix='r_tmp_'), showTmpFiles(), removeTmpFiles(h=24), TmpDir().
Typically when the process is complete and I no longer need the temp files, I go to the folder and delete them. Last night, the process ran, and 140 Gb of storage space were consumed, but there is no temp data (in raster - tmp folder, or others). Also these files were not written to .tif.
I do not understand what is happening. Where is the data? How can I remove it?

Related

RDs format weights more than csv one for the same dataframe

So, I saved a dataframe in both csv and RDs formats, but the RDs one weights significantly more than the csv alternative (40 GB vs. 10 GB). According to this blog:
[RDs format] creates a serialized version of the dataset and then saves it with gzip compression
So, if RDs data is compressed while csv one is uncompressed, then why is the RDs version so much heavier? I would understand the difference if the dataset was small, but it is 140,000 by 42,000, so there shouldn't be an issue with asymptotics kicking in.
What command did you use to save the file as Rds? If you used save_rds() then the RDS file is not compressed by default.
write_rds() does not compress by default as space is generally cheaper than time.
(https://readr.tidyverse.org/reference/read_rds.html)
From this article (https://waterdata.usgs.gov/blog/formats/) it seems that uncompressed RDS files are about 20 times bigger, so this could explain the difference in size that you see.
So, I believe this is some issue that is related to integer overflow in R when computing the indices of the new dataframe. Although nowhere in the documentation I could find a reference to overflow as a possible cause of such errors, I did run into similar issues with Python for which docs indicate overflow as a possible cause. I couldn't find any other way of fixing this and had to reduce the size of my dataset after which everything worked fine.

Read only rows meeting a condition from a compressed RData file in a Shiny App?

I am trying to make a shiny app that can be hosted for free on shinyapps.io. Free hosting requires that all data/code to be uploaded is <1GB, and that when running the app the memory used is <1GB at any time.
The data
The underlying data (that I'm uploading) is 1000 iterations of a network with ~3050 nodes. Each interaction between nodes (~415,000 interactions per network) has 9 characteristics--of the origin, destination, and the interaction itself--that I need to keep track of. The app needs to read in data from all 1000 networks for user-selected node(s) meeting user-input(ted?) criteria (those 9 characteristics) and summarize it (in a map & table). I can use 1000 one-per-network RData files (more on format below) and the app works, but it takes ~10 minutes to load, and I'd like to speed that up.
A couple notes about what I've done/tried, but I'm not tied to any of this if you have better ideas.
The data is too large to store as CSVs (and fall under the 1GB upload limit), so I've been saving it as RData files of a data.frame with "xz" compression.
To further reduce size, I've turned the data into frequency tables of the 9 variables of interest
In a desktop version, I created 10 summary files that each contained the data for 100 networks (~5 minutes to load), but these are too large to be read into memory in a free shiny app.
I tried making RData files for each node (instead of by splitting by network), but they're too large for the 1GB upload limit.
I'm not sure there are better ways to package the data (but again, happy to hear ideas!), so I'm looking to optimize processing it.
Finally, a question
Is there a way to read only certain rows from a compressed RData file, based on some value (i.e. nodeID)? This post (quickly load a subset of rows from data.frame saved with `saveRDS()`) makes me think that might not be possible because it's compressed. In looking at other options, awk keeps coming up, but I'm not sure if that would work with an RData file (I only seem to see data.frame/data.table/CSV implementations).

R code failed with: "Error: cannot allocate buffer"

Compiling an RMarkdown script overnight failed with the message:
Error: cannot allocate buffer
Execution halted
The code chunk that it died on was while training a caretEnsemble list of 10 machine learning algorithms. I know it takes a fair bit of RAM and computing time, but I did previously succeed to run that same code in the console. Why did it fail in RMarkdown? I'm fairly sure that even if it ran out of free RAM, there was enough swap.
I'm running Ubuntu with 3GB RAM and 4GB swap.
I found a blog article about memory limits in R, but it only applies to Windows: http://www.r-bloggers.com/memory-limit-management-in-r/
Any ideas on solving/avoiding this problem?
One reason why it may be backing up is that knitr and Rmarkdown just add a layer of computing complexity to things and they take some memory. The console is the most streamline implementation.
Also Caret is fat, slow and unapologetic about it. If the machine learning algorithm is complex, the data set is large and you have limited RAM it can become problematic.
Some things you can do to reduce the burden:
If there are unused variables in the set, use a subset of the ones you want and then clear the old set from memory using rm() with your variable name for the data frame in the parentheses.
After removing variables, run garbage collect, it reclaims the memory space your removed variables and interim sets are taking up in memory.
R has no native means of memory purging, so if a function is not written with a garbage collect and you do not do it, all your past executed refuse is persisting in memory making life hard.
To do this just type gc() with nothing in the parentheses. Also clear out the memory with gc() between the 10 ML runs. And if you import data with XLConnect the java implementation is nasty inefficient...that alone could tap your memory, gc() after using it every time.
After setting up training, testing and validation sets, save the testing and validation files in csv format on the hard drive and REMOVE THEM from your memory and run,you guessed it gc(). Load them again when you need them after the first model.
Once you have decided which of the algorithms to run, try installing their original packages separately instead of running Caret, require() each by name as you get to it and clean up after each one with detach(package:packagenamehere) gc().
There are two reasons for this.
One, Caret is a collection of other ML algorithms, and it is inherently slower than ALL of them in their native environment. An example: I was running a data set through random forest in Caret after 30 minutes I was less than 20% done. It had crashed twice already at about the one hour mark. I loaded the original independent package and in about 4 minutes had a completed analysis.
Two, if you require, detach and garbage collect, you have less in resident memory to worry about bogging you down. Otherwise you have ALL of carets functions in memory at once...that is wasteful.
There are some general things that you can do to make it go better that you might not initially think of but could be useful. Depending on your code they may or may not work or work to varying degrees, but try them and see where it gets you.
I. Use the lexical scoping to your advantage. Run the whole script in a clean Rstudio environment and make sure that all of the pieces and parts are living in your work space. Then garbage collect the remnants. Then go to knitr & rMarkdown and call pieces and parts from your existing work space. It is available to you in Markdown under the same rStudio shell so as long as nothing was created inside a loop and without saving it to to global environment.
II. In markdown set your code chunks up so that you cache the stuff that would need to be calculated multiple times so that it lives somewhere ready to be called upon instead of taxing memory multiple times.
If you call a variable from a data frame, do something as simple as multiply against it to each observation in one column and save it back into that original same frame, you could end up with as many as 3 copies in memory. If the file is large that is a killer. So make a clean copy, garbage collect and cache the pure frame.
Caching intuitively seems like it would waste memory, and done wrong it will, but if you rm() the unnecessary from the environment and gc() regularly, you will probably benefit from tactical caching
III. If things are still getting bogged down, you can try to save results in csv files send them to the hard drive and call them back up as needed to move them out of memory if you do not need all of the data at one time.
I am pretty certain that you can set the program up to load and unload libraries, data and results as needed. But honestly the best thing you can do, based on my own biased experience, is move away from Caret on big multi- algorithm processes.
I was getting this error when I was inadvertently running the 32-bit version of R on my 64-bit machine.

How to quickly read a large txt data file (5GB) into R(RStudio) (Centrino 2 P8600, 4Gb RAM)

I have a large data set, one of the files is 5GB. Can someone suggest me how to quickly read it into R (RStudio)? Thanks
If you only have 4 GBs of RAM you cannot put 5 GBs of data 'into R'. You can alternatively look at the 'Large memory and out-of-memory data' section of the High Perfomance Computing task view in R. Packages designed for out-of-memory processes such as ff may help you. Otherwise you can use Amazon AWS services to buy computing time on a larger computer.
My package filematrix is made for working with matrices while storing them in files in binary format. The function fm.create.from.text.file reads a matrix from a text file and stores it in a binary file without loading the whole matrix in memory. It can then be accessed by parts using usual subscripting fm[1:4,1:3] or loaded quickly in memory as a whole fm[].

Running two instances of R in order to improve large data reading performance

I would like to read-in a number of CSV files (~50), run a number of operations, and then use write.csv() to output a master file. Since the CSV files are on the larger side (~80 Mb), I was wondering if it might be more efficient to open two instances of R, reading-in half the CSVs on one instance, and half on the other. Then I would write each to a large CSV, read-in both CSVs, and combine them into a master CSV. Does anyone know if running two instances of R will improve the time it takes to read-in all the csv's?
I'm using a Macbook Pro OSX 10.6 with 4Gb RAM.
If the majority of your code execution time is spent reading the files, then it will likely be slower because the two R processes will be competing for disk I/O. But it would be faster if the majority of the time is spent "running a number of operations".
read.table() and related can be quite slow.
The best way to tell if you can benefit from parallelization is to time your R script, and the basic reading of your files. For instance, in a terminal:
time cat *.csv > /dev/null
If the "cat" time is significantly lower, your problem is not I/O bound and you may
parallelize. In which case you should probably use the parallel package, e.g
library(parallel)
csv_files <- c(.....)
my_tables <- mclapply(csv_files, read.csv)

Resources