saveRDS inflating size of object - r

This is a tricky one as I can't provide a reproducible example, but I'm hoping that others may have had experience dealing with this.
Essentially I have a function that pulls a large quantity of data from a DB, cleans and reduces the size and loops through some parameters to produce a series of lm model objects, parameter values and other reference values. This is compiled into a complex list structure that totals about 10mb.
It's then supposed to saved as an RDS file on AWS s3 where it's retrieved in a production environment to build predictions.
e.g.
db.connection <- db.connection.object
build_model_list <- function(db.connection) {
clean_and_build_models <- function(db.connection, other.parameters) {
get_db_data <- function(db.connection, some.parameters) {# Retrieve db data} ## Externally defined
db.data <- get_db_data()
build_models <- function(db.data, some.parameters) ## Externally defined
clean_data <- function(db.data, some.parameters) {# Cleans and filters data based on parameters} ## Externally defined
clean.data <- clean_data()
lm_model <- function(clean.data) {# Builds lm model based on clean.data} ## Externally defined
lm.model <- lm_model()
return(list(lm.model, other.parameters))} ## Externally defined
looped.model.object <- llply(some.parameters, clean_and_build_models)
return(looped.model.object)}
model.list <- build_model_list()
saveRDS(model.list, "~/a_place/model_list.RDS")
The issue I'm getting is that 'model.list' object which is only 10MB in memory will inflate to many GBs when I save locally as RDS or try to upload to AWS s3.
I should note that though the function processes very large quantities of data (~ 5 million rows), the data used in the outputs is no larger than a few hundred rows.
Reading the limited info on this on Stack Exchange, I've found that moving some of the externally defined functions (as part of a package) inside the main function (e.g. clean_data and lm_model) helps reduce the RDS save size.
This however has some big disadvantages.
Firstly it's trial and error and follows no clear logical order, with frequent crashes and a couple of hours taken to build the list object, it's a very long debugging cycle.
Secondly, it'll mean my main function will be many hundreds of lines long which will make future alterations and debugging much more tricky.
My question to you is:
Has anyone encountered this issue before?
Any hypotheses as to what's causing it?
Has anyone found a logical non-trial-and-error solution to this?
Thanks for your help.

It took a bit of digging but I did actually find a solution in the end.
It turns out it was the lm model objects that were the guilty party. Based on this very helpful article:
https://blogs.oracle.com/R/entry/is_the_size_of_your
It turns out that the lm.object$terms component includes a an environment component that references to the objects present in the global environment when the model was built. Under certain circumstances, when you saveRDS R will try and draw in the environmental objects into the save object.
As I had ~0.5GB sitting in the global environment and an list array of ~200 lm model objects, this caused the RDS object to inflate dramatically as it was actually trying to compress ~100GB of data.
To test if this is what's causing the problem. Execute the following code:
as.matrix(lapply(lm.object, function(x) length(serialize(x,NULL))))
This will tell you if the $terms component is inflating.
The following code will remove the environmental references from the $terms component:
rm(list=ls(envir = attr(lm.object$terms, ".Environment")), envir = attr(lm.object$terms, ".Environment"))
Be warned though it'll also remove all the global environmental objects it references.

For model objects you could also simply delete the reference to the environment.
As for example like this
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
attr(lm.D9$terms, ".Environment") <- NULL
saveRDS(lm.D9, file = "path_to_save.RDS")
This unfortunatly breaks the model - but you can add an environment manualy after loading again.
readRDS("path_to_save.RDS")
attr(lm.D9$terms, ".Environment") <- globalenv()
This helped me in my specific use case and looks a bit saver to me...

Neither of these two solutions worked for me.
Instead I have used:
downloaded_object <- storage_download(connection, "path")
read_RDS <- readRDS(downloaded_object)

The answer by mhwh mostly solved my problem, but with the additional step of creating an empty list and copying into it from the model object what was relevant. This might be due to additional (undocumented) environment references associated with using the model class I used.
mm <- felm(formula=formula, data=data, keepX=TRUE, ...)
# Make an empty list and copy into it what we need:
mm_cp <- list()
mm_cp$coefficients <- mm$coefficients
# mm_cp$ <- something else from mm you might need ...
mm_cp$terms <- terms(ans)
attr(mm_cp$terms, ".Environment") <- NULL
saveRDS(mm_cp, file = "path_to_save.RDS")
Then when we need to use it:
mm_cp <- saveRDS("path_to_save.RDS")
attr(mm_cp$terms, ".Environment") <- globalenv()
In my case the file went from 5.5G to 13K. Additionally, when reading in the file it used to allocate >32G of memory, more than 6 times the file-size. This also reduced execution time significantly (no need to recreate various environments?).
Environmental references sounds like an excellent contender for a new chapter in the R Inferno book.

Related

How to delete temporary files in parallel task in R

Is it possible to delete temporary files from within a parallelized R task?
I rely on parallelization with doParallel and foreach in R to perform various calculations on small subsets of a huge raster file. This involves cropping a subset of the large raster many times. My basic syntax looks similar to this:
grid <- raster::raster("grid.tif")
data <- raster::raster("data.tif")
cl <- parallel::makeCluster(32)
doParallel::registerDoParallel(cl)
m <- foreach(col=ncol(grid)) %:% foreach(row=nrow(grid)) %dopar% {
# get extent of subset
cell <- raster::cellFromRowCol(grid, row, col)
ext <- raster::extentFromCells(grid, cell)
# crop main raster to subset extent
subset <- raster::crop(data, ext)
# ...
# perform some processing steps on the raster subset
# ...
# save results to a separate file
saveRDS(subset, paste0("output_folder/", row, "_", col)
}
The algorithm works perfectly fine and achieves what I want it to. However, raster::crop(data, ext) creates a small temporary file everytime it is called. This seems to be standard behavior of the raster package, but it becomes a problem, because these temp files are only deleted after the whole code has been executed, and take up way too much disk space in the meantime (hundreds of GB).
In a serial execution of the task I can simply delete the temporary file with file.remove(subset#file#name). However, this does not work anymore when running the task in parallel. Instead, the command is simply ignored and the temp file stays where it is until the whole task is done.
Any ideas as to why this is the case and how I could solve this problem?
There is a function for this removeTmpFiles.
You should be able to use f <- filename(subset), avoid reading from slots (#). I do not see why you would not be able to remove it. But perhaps it needs some fiddling with the path?
temp files are only created when the raster package deems it necessary, based on RAM available and required. See canProcessInMemory( , verbose=TRUE). The default settings are somewhat conservative, and you can change them with rasterOptions() (memfrac and maxmemory)
Another approach is to provide a filename argument to crop. Then you know what the filename is, and you can delete it. Of course you need to take care of not overwriting data from different tasks, so you may need to use some unique id associated with it.
saveRDS( ) won't work if the raster is backed up by a tempfile (as it will disappear).

R not remembering objects written within functions

I'm struggling to clearly explain this problem.
Essentially, something has seemed to have happened within the R environment and none of the code I write inside my functions are working and not data is being saved. If I type a command line directly into the console it works (i.e. Monkey <- 0), but if I type it within a function, it doesn't store it when I run the function.
It could be I'm missing a glaring error in the code, but I noticed the problem when I accidentally clicked on the debugger and tried to excite out of the browser[1] prompt which appeared.
Any ideas? This is driving me nuts.
corr <- function(directory, threshold=0) {
directory <- paste(getwd(),"/",directory,"/",sep="")
file.list <- list.files(directory)
number <- 1:length(file.list)
monkey <- c()
for (i in number) {
x <- paste(directory,file.list[i],sep="")
y <- read.csv(x)
t <- sum(complete.cases(y))
if (t >= threshold) {
correl <- cor(y$sulfate, y$nitrate, use='pairwise.complete.obs')
monkey <- append(monkey,correl)}
}
#correl <- cor(newdata$sulfate, newdata$nitrate, use='pairwise.complete.obs')
#summary(correl)
}
corr('specdata', 150)
monkey```
It's a namespace issue. Functions create their own 'environment', that isn't necessarily in the global environment.
Using <- will assign in the local environment. To save an object to the global environment, use <<-
Here's some information on R environments.
I suggest you give a look at some tutorial on using functions in R.
Briefly (and sorry for my horrible explanation) objects that you define within functions will ONLY be defined within functions, unless you explicitly export them using (one of the possible approaches) the return() function.
browser() is indeed used for debugging, keeps you inside the function, and allows you accessing objects created inside the function.
In addition, to increase the probability to have useful answers, I suggest that you try to post a self-contained, working piece of code allowing quickly reproducing the issue. Here you are reading some files we have no access to.
It seems to me you have to store the output yourself when you run your script:
corr_out <- corr('specdata', 150)

R script, programmatically batch import multiple csv files as list of data frames (solution)

I'm relatively new to R but experienced in traditional programming languages (e.g., C, Java). I've recently run into the situation where I had so many data files to load that I was spending almost as much time on that one task as I was on the actual analysis. I spent a little time googling this but didn't run across any solutions that I found directly relevant (I might have missed something, I'm impatient that way). Despite that I came up with a simple solution to my problem that I wanted to share with the community in case anyone else found themselves in similar circumstances.
A bit of background info: The data I'm analyzing is real-time performance and diagnostic metrics for an experimental system that is driven by real-time data feeds (i.e., complicated). The upshot is that between trials filenames don't change and the data is written out directly to csv files (I wrote the logging code so I get to be my own best friend like that ;). There are dozens of files generated during a single trial and we have potentially hundreds of trials to look forward to.
I had a few ideas and after playing around with the code a bit I came up with the following solution:
# Create mapping that associates files with a handle that the loader will use to
# generate a named list of data frames (don't even try this on the cmdline)
createDataFileMapping <- function() {
list(
c(file = "file1.csv", descr = "descriptor1"),
c(file = "file2.csv", descr = "descriptor2"),
...
)
}
# Batch load csv files and return as list of data frames
loadTrialData <- function(load.dir, mapping) {
dfList <- list()
for (item in mapping) {
file <- paste(load.dir, item[["file"]], sep = "/")
df <- read.csv(file)
dfList[[ item[["descr"]] ]] <- df
}
return(dfList)
}
Invoking is as simple as loadTrialData("~/data/directory", createDataFileMapping()).
I'm sure there are other ways to solve this problem but the above gets the job done in my case. I'm sure this is slightly less memory-efficient than loading the files directly into data frames in the global environment, and the syntax for passing individual data frames to analysis/plotting functions isn't as elegant as it could be, but I'm not choosy. If you have a more flexible/generalizable solution then please don't hesitate to post!
What you have is sound, I would add only two comments:
Don't worry about extra memory usage, assuming the data frames are of nontrivial size you won't lose much putting them in a big list.
You might add ... as an argument to your function and pass it through to read.csv, so that if another user needs to specify extra arguments because their file wasn't in quite the same format (or wants stringsAsFactors=FALSE or something) then they have the flexibility to do that.

Package a large data set

Column-wise storage in the inst/extdata directory of a package, as suggested by Jan, is now implemented in the dfunbind package.
I'm using the data-raw idiom to make entire analyses from the raw data to the results reproducible. For this, datasets are first wrapped in R packages which can then be loaded with library().
One of the datasets I'm using is largish, around 8 million observations with about 80 attributes. For my current analysis I only need a small fraction of the attributes, but I'd like to package the entire dataset anyway.
Now, if it is simply packaged as a data frame (e.g., with devtools::use_data()), it will be loaded in its entirety when first accessing it. What would be the best approach to package this kind of data so that I can lazy-load at the column level? (Only those columns which I'm actually accessing are loaded, the others happily stay on disk and don't occupy RAM.) Would the ff package help? Can anyone point me to a working example?
I think, I would store the data in inst/extdata. Then create a couple of functions in your package that can read and return parts of that data. In your functions you can get the path to your data using: system.file("extdata", "yourfile", package = "yourpackage"). (As on the page you linked to).
The question then is in what format you store your data and how do you obtain selections from it without reading the data in memory. For that, there are a large number of options. To name some:
sqlite: Store your data in a sqlite database. You can then perform queries on this data using the rsqlite package.
ff: store your data in ff objects (e.g. save using the save.ffdf function from ffbase; use load.ffdf to load again). ff doesn't handle character fields well (they are always converted to factors). And in theory the files are not cross platform although as long as you stay on intel platforms you should be ok.
CSV: store your data in a plain old csv file. You can then make selections from this file using the LaF package. The performance will probably be less than with ff but might be good enough.
RDS: store each of your columns in a seperate RDS file (using saveRDS) and load them using readRDS the advantage is that you do not depend on any R-packages. This is fast. The disadvantage is that you cannot do row selections (but that does not seem to be the case).
If you only want to select columns, I would go with RDS.
A rough example using RDS
The following code creates an example package containing the iris data set:
load_data <- function(dataset, columns) {
result <- vector("list", length(columns));
for (i in seq_along(columns)) {
col <- columns[i]
fn <- system.file("extdata", dataset, paste0(col, ".RDS"), package = "lazydata")
result[[i]] <- readRDS(fn)
}
names(result) <- columns
as.data.frame(result)
}
store_data <- function(package, name, data) {
dir <- file.path(package, "inst", "exdata", name)
dir.create(dir, recursive = TRUE)
for (col in names(data)) {
saveRDS(data[[col]], file.path(dir, paste0(col, ".RDS")))
}
}
packagename <- "lazyload"
package.skeleton(packagename, "load_data")
store_data(packagename, "iris", iris)
After building and installing the package (you'll need to fix the documentation, e.g. delete it) you can do:
library(lazyload)
data <- load_data("iris", "Sepal.Width")
To load the Sepal.Width column of the iris data set.
Of course this is a very simple implementation of load_data: no error handling, it assumes all column exist, it does not know which columns exist, it does not know which data sets exist.

R: Improving workflow and keeping track of output

I have what I think is a common enough issue, on optimising workflow in R. Specifically, how can I avoid the common issue of having a folder full of output (plots, RData files, csv, etc.), without, after some time, having a clue where they came from or how they were produced? In part, it surely involves trying to be intelligent about folder structure. I have been looking around, but I'm unsure of what the best strategy is. So far, I have tackled it in a rather unsophisticated (overkill) way: I created a function metainfo (see below) that writes a text file with metadata, with a given file name. The idea is that if a plot is produced, this command is issued to produce a text file with exactly the same file name as the plot (except, of course, the extension), with information on the system, session, packages loaded, R version, function and file the metadata function was called from, etc. The questions are:
(i) How do people approach this general problem? Are there obvious ways to avoid the issue I mentioned?
(ii) If not, does anyone have any tips on improving this function? At the moment it's perhaps clunky and not ideal. Particularly, getting the file name from which the plot is produced doesn't necessarily work (the solution I use is one provided by #hadley in 1). Any ideas would be welcome!
The function assumes git, so please ignore the probable warning produced. This is the main function, stored in a file metainfo.R:
MetaInfo <- function(message=NULL, filename)
{
# message - character string - Any message to be written into the information
# file (e.g., data used).
# filename - character string - the name of the txt file (including relative
# path). Should be the same as the output file it describes (RData,
# csv, pdf).
#
if (is.null(filename))
{
stop('Provide an output filename - parameter filename.')
}
filename <- paste(filename, '.txt', sep='')
# Try to get as close as possible to getting the file name from which the
# function is called.
source.file <- lapply(sys.frames(), function(x) x$ofile)
source.file <- Filter(Negate(is.null), source.file)
t.sf <- try(source.file <- basename(source.file[[length(source.file)]]),
silent=TRUE)
if (class(t.sf) == 'try-error')
{
source.file <- NULL
}
func <- deparse(sys.call(-1))
# MetaInfo isn't always called from within another function, so func could
# return as NULL or as general environment.
if (any(grepl('eval', func, ignore.case=TRUE)))
{
func <- NULL
}
time <- strftime(Sys.time(), "%Y/%m/%d %H:%M:%S")
git.h <- system('git log --pretty=format:"%h" -n 1', intern=TRUE)
meta <- list(Message=message,
Source=paste(source.file, ' on ', time, sep=''),
Functions=func,
System=Sys.info(),
Session=sessionInfo(),
Git.hash=git.h)
sink(file=filename)
print(meta)
sink(file=NULL)
}
which can then be called in another function, stored in another file, e.g.:
source('metainfo.R')
RandomPlot <- function(x, y)
{
fn <- 'random_plot'
pdf(file=paste(fn, '.pdf', sep=''))
plot(x, y)
MetaInfo(message=NULL, filename=fn)
dev.off()
}
x <- 1:10
y <- runif(10)
RandomPlot(x, y)
This way, a text file with the same file name as the plot is produced, with information that could hopefully help figure out how and where the plot was produced.
In terms of general R organization: I like to have a single script that recreates all work done for a project. Any project should be reproducible with a single click, including all plots or papers associated with that project.
So, to stay organized: keep a different directory for each project, each project has its own functions.R script to store non-package functions associated with that project, and each project has a master script that starts like
## myproject
source("functions.R")
source("read-data.R")
source("clean-data.R")
etc... all the way through. This should help keep everything organized, and if you get new data you just go to early scripts to fix up headers or whatever and rerun the entire project with a single click.
There is a package called Project Template that helps organize and automate the typical workflow with R scripts, data files, charts, etc. There is also a number of helpful documents like this one Workflow of statistical data analysis by Oliver Kirchkamp.
If you use Emacs and ESS for your analyses, learning Org-Mode is a must. I use it to organize all my work. Here is how it integrates with R: R Source Code Blocks in Org Mode.
There is also this new free tool called Drake which is advertised as "make for data".
I think my question belies a certain level of confusion. Having looked around, as well as explored the suggestions provided so far, I have reached the conclusion that it is probably not important to know where and how a file is produced. You should in fact be able to wipe out any output, and reproduce it by rerunning code. So while I might still use the above function for extra information, it really is a question of being ruthless and indeed cleaning up folders every now and then. These ideas are more eloquently explained here. This of course does not preclude the use of Make/Drake or Project Template, which I will try to pick up on. Thanks again for the suggestions #noah and #alex!
There is also now an R package called drake (Data Frames in R for Make), independent from Factual's Drake. The R package is also a Make-like build system that links code/dependencies with output.
install.packages("drake") # It is on CRAN.
library(drake)
load_basic_example()
plot_graph(my_plan)
make(my_plan)
Like it's predecessor remake, it has the added bonus that you do not have to keep track of a cumbersome pile of files. Objects generated in R are cached during make() and can be reloaded easily.
readd(summ_regression1_small) # Read objects from the cache.
loadd(small, large) # Load objects into your R session.
print(small)
But you can still work with files as single-quoted targets. (See 'report.Rmd' and 'report.md' in my_plan from the basic example.)
There is package developed by RStudio called pins that might address this problem.

Resources