Can I cache data loading in R? - r

I'm working on a R script which has to load data (obviously). The data loading takes a lot of effort (500MB) and I wonder if I can avoid having to go through the loading step every time I rerun the script, which I do a lot during the development.
I appreciate that I could do the whole thing in the interactive R session, but developing multi-line functions is just so much less convenient on the R prompt.
Example:
#!/usr/bin/Rscript
d <- read.csv("large.csv", header=T) # 500 MB ~ 15 seconds
head(d)
How, if possible, can I modify the script, such that on subsequent executions, d is already available? Is there something like a cache=T statement as in R markdown code chunks?

Sort of. There are a few answers:
Use a faster csv read: fread() in the data.table() package is beloved by many. Your time may come down to a second or two.
Similarly, read once as csv and then write in compact binary form via saveRDS() so that next time you can do readRDS() which will be faster as you do not have to load and parse the data again.
Don't read the data but memory-map it via package mmap. That is more involved but likely very fast. Databases uses such a technique internally.
Load on demand, and eg the package SOAR package is useful here.
Direct caching, however, is not possible.
Edit: Actually, direct caching "sort of" works if you save your data set with your R session at the end. Many of us advise against that as clearly reproducible script which make the loading explicit are preferably in our view -- but R can help via the load() / save() mechanism (which lots several objects at once where saveRSS() / readRDS() work on a single object.

Package ‘R.cache’ R.cache
start_year <- 2000
end_year <- 2013
brics_countries <- c("BR","RU", "IN", "CN", "ZA")
indics <- c("NY.GDP.PCAP.CD", "TX.VAL.TECH.CD", "SP.POP.TOTL", "IP.JRN.ARTC.SC",
"GB.XPD.RSDV.GD.ZS", "BX.GSR.CCIS.ZS", "BX.GSR.ROYL.CD", "BM.GSR.ROYL.CD")
key <- list(brics_countries, indics, start_year, end_year)
brics_data <- loadCache(key)
if (is.null(brics_data)) {
brics_data <- WDI(country=brics_countries, indicator=indics,
start=start_year, end=end_year, extra=FALSE, cache=NULL)
saveCache(brics_data, key=key, comment="brics_data")
}

I use exists to check if the object is present and load conditionally, i.e.:
if (!exists(d))
{
d <- read.csv("large.csv", header=T)
# Any further processing on loading
}
# The rest of the script
If you want to load/process the file again, just use rm(d) before sourcing. Just be careful that you do not use object names that are already used elsewhere, otherwise it will pick that up and not load.

I wrote up some of the common ways of caching in R in "Caching in R" and published it to R-Bloggers. For your purpose, I would recommend just using saveRDS() or qs() from the 'qs' (quick serialization) package. My package, 'mustashe', uses qs() for reading and writing files, so you could just use mustashe::stash(), too.

Related

Create data tables using SPSS in R

Using expss package I am creating cross tabs by reading SPSS files in R. This actually works perfectly but the process takes lots of time to load. I have a folder which contains various SPSS files(usually 3 files only) and through R script I am fetching the last modified file among the three.
setwd('/file/path/for/this/file/SPSS')
library(expss)
expss_output_viewer()
#get all .sav files
all_sav <- list.files(pattern ='\\.sav$')
#use file.info to get the index of the file most recently modified
pass<-all_sav[with(file.info(all_sav), which.max(mtime))]
mydata = read_spss(pass,reencode = TRUE) # read SPSS file mydata
w <- data.frame(mydata)
args <- commandArgs(TRUE)
Everything is perfect and works absolutely fine but it generally takes too much time to load large files(112MB,48MB for e.g) which isn't good.
Is there a way I can make it more time-efficient and takes less time to create the table. The dropdowns are created using PHP.
I have searched for this and found another library called 'haven' but I am not sure whether that can give me significance as well. Can anyone help me with this? I would really appreciate that. Thanks in advance.
As written in the expss vignette (https://cran.r-project.org/web/packages/expss/vignettes/labels-support.html) you can use in the following way:
# we need to load packages strictly in this order to avoid conflicts
library(haven)
library(expss)
spss_data = haven::read_spss("spss_file.sav")
# add missing 'labelled' class
spss_data = add_labelled_class(spss_data)

R script, programmatically batch import multiple csv files as list of data frames (solution)

I'm relatively new to R but experienced in traditional programming languages (e.g., C, Java). I've recently run into the situation where I had so many data files to load that I was spending almost as much time on that one task as I was on the actual analysis. I spent a little time googling this but didn't run across any solutions that I found directly relevant (I might have missed something, I'm impatient that way). Despite that I came up with a simple solution to my problem that I wanted to share with the community in case anyone else found themselves in similar circumstances.
A bit of background info: The data I'm analyzing is real-time performance and diagnostic metrics for an experimental system that is driven by real-time data feeds (i.e., complicated). The upshot is that between trials filenames don't change and the data is written out directly to csv files (I wrote the logging code so I get to be my own best friend like that ;). There are dozens of files generated during a single trial and we have potentially hundreds of trials to look forward to.
I had a few ideas and after playing around with the code a bit I came up with the following solution:
# Create mapping that associates files with a handle that the loader will use to
# generate a named list of data frames (don't even try this on the cmdline)
createDataFileMapping <- function() {
list(
c(file = "file1.csv", descr = "descriptor1"),
c(file = "file2.csv", descr = "descriptor2"),
...
)
}
# Batch load csv files and return as list of data frames
loadTrialData <- function(load.dir, mapping) {
dfList <- list()
for (item in mapping) {
file <- paste(load.dir, item[["file"]], sep = "/")
df <- read.csv(file)
dfList[[ item[["descr"]] ]] <- df
}
return(dfList)
}
Invoking is as simple as loadTrialData("~/data/directory", createDataFileMapping()).
I'm sure there are other ways to solve this problem but the above gets the job done in my case. I'm sure this is slightly less memory-efficient than loading the files directly into data frames in the global environment, and the syntax for passing individual data frames to analysis/plotting functions isn't as elegant as it could be, but I'm not choosy. If you have a more flexible/generalizable solution then please don't hesitate to post!
What you have is sound, I would add only two comments:
Don't worry about extra memory usage, assuming the data frames are of nontrivial size you won't lose much putting them in a big list.
You might add ... as an argument to your function and pass it through to read.csv, so that if another user needs to specify extra arguments because their file wasn't in quite the same format (or wants stringsAsFactors=FALSE or something) then they have the flexibility to do that.

Testing Against Stored R Objects

Is there a simple workflow to write tests that store objects as .rds or .rda so that future runs of a test can compare the result of code execution vs. the stored object? This would make it easy to check that functions that return somewhat complex values are still behaving as they should.
For example, something like:
test_obj(res <- lm(y ~ x, data.frame(x=1:3, y=5:7)))
which, if *extdata/test_obj.res.rds* doesn't exist, would create it in *inst/extdata/test_obj.res.rds*, with res from above, but if it does exist, would identical/all.equal etc. the newly generated object with the one recovered from the rds.
I would find such tests super useful, and I am a bit surprised that RUnit/svUnit / testthat don't implement something of the sort (I'm hoping they do, and I just haven't found it).
testthat::make_expectation is close, but I'd prefer to have an automated store/retrieve rds rather than copy paste the text representation to a file, which I think is how you're supposed to use testthat::make_expectation (I guess I could pipe stdout() to a .R file, but even then there is a bit of automation that could facilitate the process).
It only took me three years, but I wrote unitizer to resolve this issue. It is a unit testing framework with an interactive UI that allows you to review test output and store it / reject it with a single keystroke. It also streamlines the update/test/debug cycle by showing you a proper diff of failing tests, and dropping you into those tests evaluation environments for debugging in the interactive UI.
For example, if we have a matrix rotation function (courtesy #MatthewLundberg) we want to test:
# mx-rotate.R
rotate <- function(x) t(apply(x, 2, rev))
And a script with some tests:
# mx-test.R
mx <- matrix(1:9, 3)
rotate(mx)
rotate(rotate(mx))
rotate(rotate(rotate(mx)))
Then:
library(unitizer)
unitize('mx-test.R')
Will kick-off an interactive session that will allow you to review the results of the three rotation calls and accept them as tests if they work as expected.
There is a screencast demo available.
As of 2017, testthat has the feature expect_equal_to_reference, which does exactly what the question asks. I guess Hadley W. figured out a way.

R: Improving workflow and keeping track of output

I have what I think is a common enough issue, on optimising workflow in R. Specifically, how can I avoid the common issue of having a folder full of output (plots, RData files, csv, etc.), without, after some time, having a clue where they came from or how they were produced? In part, it surely involves trying to be intelligent about folder structure. I have been looking around, but I'm unsure of what the best strategy is. So far, I have tackled it in a rather unsophisticated (overkill) way: I created a function metainfo (see below) that writes a text file with metadata, with a given file name. The idea is that if a plot is produced, this command is issued to produce a text file with exactly the same file name as the plot (except, of course, the extension), with information on the system, session, packages loaded, R version, function and file the metadata function was called from, etc. The questions are:
(i) How do people approach this general problem? Are there obvious ways to avoid the issue I mentioned?
(ii) If not, does anyone have any tips on improving this function? At the moment it's perhaps clunky and not ideal. Particularly, getting the file name from which the plot is produced doesn't necessarily work (the solution I use is one provided by #hadley in 1). Any ideas would be welcome!
The function assumes git, so please ignore the probable warning produced. This is the main function, stored in a file metainfo.R:
MetaInfo <- function(message=NULL, filename)
{
# message - character string - Any message to be written into the information
# file (e.g., data used).
# filename - character string - the name of the txt file (including relative
# path). Should be the same as the output file it describes (RData,
# csv, pdf).
#
if (is.null(filename))
{
stop('Provide an output filename - parameter filename.')
}
filename <- paste(filename, '.txt', sep='')
# Try to get as close as possible to getting the file name from which the
# function is called.
source.file <- lapply(sys.frames(), function(x) x$ofile)
source.file <- Filter(Negate(is.null), source.file)
t.sf <- try(source.file <- basename(source.file[[length(source.file)]]),
silent=TRUE)
if (class(t.sf) == 'try-error')
{
source.file <- NULL
}
func <- deparse(sys.call(-1))
# MetaInfo isn't always called from within another function, so func could
# return as NULL or as general environment.
if (any(grepl('eval', func, ignore.case=TRUE)))
{
func <- NULL
}
time <- strftime(Sys.time(), "%Y/%m/%d %H:%M:%S")
git.h <- system('git log --pretty=format:"%h" -n 1', intern=TRUE)
meta <- list(Message=message,
Source=paste(source.file, ' on ', time, sep=''),
Functions=func,
System=Sys.info(),
Session=sessionInfo(),
Git.hash=git.h)
sink(file=filename)
print(meta)
sink(file=NULL)
}
which can then be called in another function, stored in another file, e.g.:
source('metainfo.R')
RandomPlot <- function(x, y)
{
fn <- 'random_plot'
pdf(file=paste(fn, '.pdf', sep=''))
plot(x, y)
MetaInfo(message=NULL, filename=fn)
dev.off()
}
x <- 1:10
y <- runif(10)
RandomPlot(x, y)
This way, a text file with the same file name as the plot is produced, with information that could hopefully help figure out how and where the plot was produced.
In terms of general R organization: I like to have a single script that recreates all work done for a project. Any project should be reproducible with a single click, including all plots or papers associated with that project.
So, to stay organized: keep a different directory for each project, each project has its own functions.R script to store non-package functions associated with that project, and each project has a master script that starts like
## myproject
source("functions.R")
source("read-data.R")
source("clean-data.R")
etc... all the way through. This should help keep everything organized, and if you get new data you just go to early scripts to fix up headers or whatever and rerun the entire project with a single click.
There is a package called Project Template that helps organize and automate the typical workflow with R scripts, data files, charts, etc. There is also a number of helpful documents like this one Workflow of statistical data analysis by Oliver Kirchkamp.
If you use Emacs and ESS for your analyses, learning Org-Mode is a must. I use it to organize all my work. Here is how it integrates with R: R Source Code Blocks in Org Mode.
There is also this new free tool called Drake which is advertised as "make for data".
I think my question belies a certain level of confusion. Having looked around, as well as explored the suggestions provided so far, I have reached the conclusion that it is probably not important to know where and how a file is produced. You should in fact be able to wipe out any output, and reproduce it by rerunning code. So while I might still use the above function for extra information, it really is a question of being ruthless and indeed cleaning up folders every now and then. These ideas are more eloquently explained here. This of course does not preclude the use of Make/Drake or Project Template, which I will try to pick up on. Thanks again for the suggestions #noah and #alex!
There is also now an R package called drake (Data Frames in R for Make), independent from Factual's Drake. The R package is also a Make-like build system that links code/dependencies with output.
install.packages("drake") # It is on CRAN.
library(drake)
load_basic_example()
plot_graph(my_plan)
make(my_plan)
Like it's predecessor remake, it has the added bonus that you do not have to keep track of a cumbersome pile of files. Objects generated in R are cached during make() and can be reloaded easily.
readd(summ_regression1_small) # Read objects from the cache.
loadd(small, large) # Load objects into your R session.
print(small)
But you can still work with files as single-quoted targets. (See 'report.Rmd' and 'report.md' in my_plan from the basic example.)
There is package developed by RStudio called pins that might address this problem.

Is there a built-in function for sampling a large delimited data set?

I have a few large data files I'd like to sample when loading into R. I can load the entire data set, but it's really too large to work with. sample does roughly the right thing, but I'd like to have to take random samples of the input while reading it.
I can imagine how to build that with a loop and readline and what-not but surely this has been done hundreds of times.
Is there something in CRAN or even base that can do this?
You can do that in one line of code using sqldf. See part 6e of example 6 on the sqldf home page.
No pre-built facilities. Best approach would be to use a database management program. (Seems as though this was addressed in either SO or Rhelp in the last week.)
Take a look at: Read csv from specific row , and especially note Grothendieck's comments. I consider him a "class A wizaRd". He's got first hand experience with sqldf. (The author IIRC.)
And another "huge files" problem with a Grothendieck solution that succeeded:
R: how to rbind two huge data-frames without running out of memory
I wrote the following function that does close to what I want:
readBigBz2 <- function(fn, sample_size=1000) {
f <- bzfile(fn, "r")
rv <- c()
repeat {
lines <- readLines(f, sample_size)
if (length(lines) == 0) break
rv <- append(rv, sample(lines, 1))
}
close(f)
rv
}
I may want to go with sqldf in the long-term, but this is a pretty efficient way of sampling the file itself. I just don't quite know how to wrap that around a connection for read.csv or similar.

Resources