Testing Against Stored R Objects - r

Is there a simple workflow to write tests that store objects as .rds or .rda so that future runs of a test can compare the result of code execution vs. the stored object? This would make it easy to check that functions that return somewhat complex values are still behaving as they should.
For example, something like:
test_obj(res <- lm(y ~ x, data.frame(x=1:3, y=5:7)))
which, if *extdata/test_obj.res.rds* doesn't exist, would create it in *inst/extdata/test_obj.res.rds*, with res from above, but if it does exist, would identical/all.equal etc. the newly generated object with the one recovered from the rds.
I would find such tests super useful, and I am a bit surprised that RUnit/svUnit / testthat don't implement something of the sort (I'm hoping they do, and I just haven't found it).
testthat::make_expectation is close, but I'd prefer to have an automated store/retrieve rds rather than copy paste the text representation to a file, which I think is how you're supposed to use testthat::make_expectation (I guess I could pipe stdout() to a .R file, but even then there is a bit of automation that could facilitate the process).

It only took me three years, but I wrote unitizer to resolve this issue. It is a unit testing framework with an interactive UI that allows you to review test output and store it / reject it with a single keystroke. It also streamlines the update/test/debug cycle by showing you a proper diff of failing tests, and dropping you into those tests evaluation environments for debugging in the interactive UI.
For example, if we have a matrix rotation function (courtesy #MatthewLundberg) we want to test:
# mx-rotate.R
rotate <- function(x) t(apply(x, 2, rev))
And a script with some tests:
# mx-test.R
mx <- matrix(1:9, 3)
rotate(mx)
rotate(rotate(mx))
rotate(rotate(rotate(mx)))
Then:
library(unitizer)
unitize('mx-test.R')
Will kick-off an interactive session that will allow you to review the results of the three rotation calls and accept them as tests if they work as expected.
There is a screencast demo available.

As of 2017, testthat has the feature expect_equal_to_reference, which does exactly what the question asks. I guess Hadley W. figured out a way.

Related

R not remembering objects written within functions

I'm struggling to clearly explain this problem.
Essentially, something has seemed to have happened within the R environment and none of the code I write inside my functions are working and not data is being saved. If I type a command line directly into the console it works (i.e. Monkey <- 0), but if I type it within a function, it doesn't store it when I run the function.
It could be I'm missing a glaring error in the code, but I noticed the problem when I accidentally clicked on the debugger and tried to excite out of the browser[1] prompt which appeared.
Any ideas? This is driving me nuts.
corr <- function(directory, threshold=0) {
directory <- paste(getwd(),"/",directory,"/",sep="")
file.list <- list.files(directory)
number <- 1:length(file.list)
monkey <- c()
for (i in number) {
x <- paste(directory,file.list[i],sep="")
y <- read.csv(x)
t <- sum(complete.cases(y))
if (t >= threshold) {
correl <- cor(y$sulfate, y$nitrate, use='pairwise.complete.obs')
monkey <- append(monkey,correl)}
}
#correl <- cor(newdata$sulfate, newdata$nitrate, use='pairwise.complete.obs')
#summary(correl)
}
corr('specdata', 150)
monkey```
It's a namespace issue. Functions create their own 'environment', that isn't necessarily in the global environment.
Using <- will assign in the local environment. To save an object to the global environment, use <<-
Here's some information on R environments.
I suggest you give a look at some tutorial on using functions in R.
Briefly (and sorry for my horrible explanation) objects that you define within functions will ONLY be defined within functions, unless you explicitly export them using (one of the possible approaches) the return() function.
browser() is indeed used for debugging, keeps you inside the function, and allows you accessing objects created inside the function.
In addition, to increase the probability to have useful answers, I suggest that you try to post a self-contained, working piece of code allowing quickly reproducing the issue. Here you are reading some files we have no access to.
It seems to me you have to store the output yourself when you run your script:
corr_out <- corr('specdata', 150)

Force sys.source() to put functions in specified environment

Question:
I'm using sys.source to source a script's output into a new environment. However, that script itself source()'s some things as well.
When it sources functions, they (and their output) get loaded into R_GlobalEnv instead of into the environment specified by sys.source(). It seems the functions enclosing and binding environments end up being under R_GlobalEnv instead of what you specify in sys.source().
Is there a way like sys.source() to run a script and keep everything it makes in a separate environment? An ideal solution would not require modifying the scripts I'm sourcing and still have "chdir = TRUE" style functionality.
Example:
Running this should show you what I mean:
# setup an external folder
other.folder = tempdir()
# make a functions script, it just adds "1" to the argument.
# Note: the strange-looking "assign(x=" bit is important
# to what I'm actually doing, so any solution needs to be
# robust to this.
functions = file.path(other.folder, "functions.R")
writeLines("myfunction = function(a){assign(x=c('function.output'), a+1, pos = 1)}", functions)
# make a parent script, which source()'s functions.R
# and invokes it on some data, and then modifies that data
parent = file.path(other.folder, "parent.R")
writeLines("source('functions.R')\n
original.data=1\n
myfunction(original.data)\n
resulting.data = function.output + 1", parent)
# make a separate environment
myenv = new.env()
# source parent.R into that new environment,
# using chdir=TRUE so parent.R can find functions.R
sys.source(parent, myenv, chdir = TRUE)
# You can see "myfunction" and "function.output"
# end up in R_GlobalEnv.
# Whereas "original.data" and "resulting.data" end up in the intended environment.
ls(myenv)
More information (what I'm actually trying to do):
I have data from several similar experiments. I'm trying to keep everything in line with "reproducible research" ideals (for my own sanity if nothing else). So what I'm doing is keeping each experiment in its own folder. The folder contains the raw data, and all the metadata which describes each sample (treatment, genotype, etc.). The folder also contains the necessary R scripts to read the raw data, match it with metadata, process it, and output graphs and summary statistics. These are tied into a "mother script" which will do the whole process for each experiment.
This works really well but if I want to do some meta-analysis or just compare results between experiments there are some difficulties. Right now I am thinking the best way would be to run each experiment's "mother script" in its own environment, and then pull out the data from each environment to do my meta-analysis. An alternative approach might be running each mother script in its own instance and then saving the .RData files separately and then re-loading them into a new environment in a new instance. This seems kinda hacky though and I feel like there's a more elegant solution.

Can I cache data loading in R?

I'm working on a R script which has to load data (obviously). The data loading takes a lot of effort (500MB) and I wonder if I can avoid having to go through the loading step every time I rerun the script, which I do a lot during the development.
I appreciate that I could do the whole thing in the interactive R session, but developing multi-line functions is just so much less convenient on the R prompt.
Example:
#!/usr/bin/Rscript
d <- read.csv("large.csv", header=T) # 500 MB ~ 15 seconds
head(d)
How, if possible, can I modify the script, such that on subsequent executions, d is already available? Is there something like a cache=T statement as in R markdown code chunks?
Sort of. There are a few answers:
Use a faster csv read: fread() in the data.table() package is beloved by many. Your time may come down to a second or two.
Similarly, read once as csv and then write in compact binary form via saveRDS() so that next time you can do readRDS() which will be faster as you do not have to load and parse the data again.
Don't read the data but memory-map it via package mmap. That is more involved but likely very fast. Databases uses such a technique internally.
Load on demand, and eg the package SOAR package is useful here.
Direct caching, however, is not possible.
Edit: Actually, direct caching "sort of" works if you save your data set with your R session at the end. Many of us advise against that as clearly reproducible script which make the loading explicit are preferably in our view -- but R can help via the load() / save() mechanism (which lots several objects at once where saveRSS() / readRDS() work on a single object.
Package ‘R.cache’ R.cache
start_year <- 2000
end_year <- 2013
brics_countries <- c("BR","RU", "IN", "CN", "ZA")
indics <- c("NY.GDP.PCAP.CD", "TX.VAL.TECH.CD", "SP.POP.TOTL", "IP.JRN.ARTC.SC",
"GB.XPD.RSDV.GD.ZS", "BX.GSR.CCIS.ZS", "BX.GSR.ROYL.CD", "BM.GSR.ROYL.CD")
key <- list(brics_countries, indics, start_year, end_year)
brics_data <- loadCache(key)
if (is.null(brics_data)) {
brics_data <- WDI(country=brics_countries, indicator=indics,
start=start_year, end=end_year, extra=FALSE, cache=NULL)
saveCache(brics_data, key=key, comment="brics_data")
}
I use exists to check if the object is present and load conditionally, i.e.:
if (!exists(d))
{
d <- read.csv("large.csv", header=T)
# Any further processing on loading
}
# The rest of the script
If you want to load/process the file again, just use rm(d) before sourcing. Just be careful that you do not use object names that are already used elsewhere, otherwise it will pick that up and not load.
I wrote up some of the common ways of caching in R in "Caching in R" and published it to R-Bloggers. For your purpose, I would recommend just using saveRDS() or qs() from the 'qs' (quick serialization) package. My package, 'mustashe', uses qs() for reading and writing files, so you could just use mustashe::stash(), too.

R: Improving workflow and keeping track of output

I have what I think is a common enough issue, on optimising workflow in R. Specifically, how can I avoid the common issue of having a folder full of output (plots, RData files, csv, etc.), without, after some time, having a clue where they came from or how they were produced? In part, it surely involves trying to be intelligent about folder structure. I have been looking around, but I'm unsure of what the best strategy is. So far, I have tackled it in a rather unsophisticated (overkill) way: I created a function metainfo (see below) that writes a text file with metadata, with a given file name. The idea is that if a plot is produced, this command is issued to produce a text file with exactly the same file name as the plot (except, of course, the extension), with information on the system, session, packages loaded, R version, function and file the metadata function was called from, etc. The questions are:
(i) How do people approach this general problem? Are there obvious ways to avoid the issue I mentioned?
(ii) If not, does anyone have any tips on improving this function? At the moment it's perhaps clunky and not ideal. Particularly, getting the file name from which the plot is produced doesn't necessarily work (the solution I use is one provided by #hadley in 1). Any ideas would be welcome!
The function assumes git, so please ignore the probable warning produced. This is the main function, stored in a file metainfo.R:
MetaInfo <- function(message=NULL, filename)
{
# message - character string - Any message to be written into the information
# file (e.g., data used).
# filename - character string - the name of the txt file (including relative
# path). Should be the same as the output file it describes (RData,
# csv, pdf).
#
if (is.null(filename))
{
stop('Provide an output filename - parameter filename.')
}
filename <- paste(filename, '.txt', sep='')
# Try to get as close as possible to getting the file name from which the
# function is called.
source.file <- lapply(sys.frames(), function(x) x$ofile)
source.file <- Filter(Negate(is.null), source.file)
t.sf <- try(source.file <- basename(source.file[[length(source.file)]]),
silent=TRUE)
if (class(t.sf) == 'try-error')
{
source.file <- NULL
}
func <- deparse(sys.call(-1))
# MetaInfo isn't always called from within another function, so func could
# return as NULL or as general environment.
if (any(grepl('eval', func, ignore.case=TRUE)))
{
func <- NULL
}
time <- strftime(Sys.time(), "%Y/%m/%d %H:%M:%S")
git.h <- system('git log --pretty=format:"%h" -n 1', intern=TRUE)
meta <- list(Message=message,
Source=paste(source.file, ' on ', time, sep=''),
Functions=func,
System=Sys.info(),
Session=sessionInfo(),
Git.hash=git.h)
sink(file=filename)
print(meta)
sink(file=NULL)
}
which can then be called in another function, stored in another file, e.g.:
source('metainfo.R')
RandomPlot <- function(x, y)
{
fn <- 'random_plot'
pdf(file=paste(fn, '.pdf', sep=''))
plot(x, y)
MetaInfo(message=NULL, filename=fn)
dev.off()
}
x <- 1:10
y <- runif(10)
RandomPlot(x, y)
This way, a text file with the same file name as the plot is produced, with information that could hopefully help figure out how and where the plot was produced.
In terms of general R organization: I like to have a single script that recreates all work done for a project. Any project should be reproducible with a single click, including all plots or papers associated with that project.
So, to stay organized: keep a different directory for each project, each project has its own functions.R script to store non-package functions associated with that project, and each project has a master script that starts like
## myproject
source("functions.R")
source("read-data.R")
source("clean-data.R")
etc... all the way through. This should help keep everything organized, and if you get new data you just go to early scripts to fix up headers or whatever and rerun the entire project with a single click.
There is a package called Project Template that helps organize and automate the typical workflow with R scripts, data files, charts, etc. There is also a number of helpful documents like this one Workflow of statistical data analysis by Oliver Kirchkamp.
If you use Emacs and ESS for your analyses, learning Org-Mode is a must. I use it to organize all my work. Here is how it integrates with R: R Source Code Blocks in Org Mode.
There is also this new free tool called Drake which is advertised as "make for data".
I think my question belies a certain level of confusion. Having looked around, as well as explored the suggestions provided so far, I have reached the conclusion that it is probably not important to know where and how a file is produced. You should in fact be able to wipe out any output, and reproduce it by rerunning code. So while I might still use the above function for extra information, it really is a question of being ruthless and indeed cleaning up folders every now and then. These ideas are more eloquently explained here. This of course does not preclude the use of Make/Drake or Project Template, which I will try to pick up on. Thanks again for the suggestions #noah and #alex!
There is also now an R package called drake (Data Frames in R for Make), independent from Factual's Drake. The R package is also a Make-like build system that links code/dependencies with output.
install.packages("drake") # It is on CRAN.
library(drake)
load_basic_example()
plot_graph(my_plan)
make(my_plan)
Like it's predecessor remake, it has the added bonus that you do not have to keep track of a cumbersome pile of files. Objects generated in R are cached during make() and can be reloaded easily.
readd(summ_regression1_small) # Read objects from the cache.
loadd(small, large) # Load objects into your R session.
print(small)
But you can still work with files as single-quoted targets. (See 'report.Rmd' and 'report.md' in my_plan from the basic example.)
There is package developed by RStudio called pins that might address this problem.

How do I know the detail about specific R object?

for example
data <- read.csv ("data.csv")
a <- mean(data)
b <- sd(data)
and I save the workspace, and then quit.
Later, I open this workspace and forget what a and b were.
I want R to show me that a is mean of the data and b is standard deviation of the data.
How do I do that?
Thank you.
You could always store some attributes with your data like so:
x <- 1:10
a <- mean(x)
attr(a,"info") <- "mean of x"
> a
[1] 5.5
attr(,"info")
[1] "mean of x"
> attributes(a)
$info
[1] "mean of x"
An alternative noted by #mnel below is to use comment. These will not be printed by default but can be accessed later in a similar fashion like so:
comment(a) <- "mean of x"
> comment(a)
[1] "mean of x"
A suggestion is to use the script feature of the R environment, rather than typing directly commands in the console.
The idea is that you can type commands, comments and even gibberish text (stuff that doesn't conform to R syntax), in a script window, and using Ctrl-R (or one of Run commands from the Edit menu) you send the the current line, or whatever portion of the text that is currently selected, to the R Console window (just like if had typed it directly there).
In this fashion, you can:
add voluminous comments as to the nature of the variables that you create
save the script along with the environment or independently.
In addition to implicitly saving a memory of the genesis of the variables, the scripts have several advantages, in particular they can save a lot of typing and they can also allow to recreate everything "from scratch", verbatim or with a few modifications.
In general you won't be able to find out how an object was created from the object itself. Some object types will have a call element that may save the call used to create them.
lm objects have this property.
eg
dd <- data.frame(y=runif(10), x= rnorm(10))
model <- lm(y~x,dd)
model$call
lm(formula = y ~ x, data = dd)
In this case mean and sd will not as they will return atomic vectors.
You could look at the history to see if you can find commands that created them (this is not ideal, it is dependent your IDE and how some environmental variables are set up).
Rstudio has a history tab that shows some subset of the previous commands called within a project.
You may also be able to press the up key (this works in the RGui on windows at least), to scroll through the previously called commands.
These commands based on the history require that you used the same computer and version of R.
Reproducible research or literate programming are the best ways to overcome these issues.

Resources