Question:
I'm using sys.source to source a script's output into a new environment. However, that script itself source()'s some things as well.
When it sources functions, they (and their output) get loaded into R_GlobalEnv instead of into the environment specified by sys.source(). It seems the functions enclosing and binding environments end up being under R_GlobalEnv instead of what you specify in sys.source().
Is there a way like sys.source() to run a script and keep everything it makes in a separate environment? An ideal solution would not require modifying the scripts I'm sourcing and still have "chdir = TRUE" style functionality.
Example:
Running this should show you what I mean:
# setup an external folder
other.folder = tempdir()
# make a functions script, it just adds "1" to the argument.
# Note: the strange-looking "assign(x=" bit is important
# to what I'm actually doing, so any solution needs to be
# robust to this.
functions = file.path(other.folder, "functions.R")
writeLines("myfunction = function(a){assign(x=c('function.output'), a+1, pos = 1)}", functions)
# make a parent script, which source()'s functions.R
# and invokes it on some data, and then modifies that data
parent = file.path(other.folder, "parent.R")
writeLines("source('functions.R')\n
original.data=1\n
myfunction(original.data)\n
resulting.data = function.output + 1", parent)
# make a separate environment
myenv = new.env()
# source parent.R into that new environment,
# using chdir=TRUE so parent.R can find functions.R
sys.source(parent, myenv, chdir = TRUE)
# You can see "myfunction" and "function.output"
# end up in R_GlobalEnv.
# Whereas "original.data" and "resulting.data" end up in the intended environment.
ls(myenv)
More information (what I'm actually trying to do):
I have data from several similar experiments. I'm trying to keep everything in line with "reproducible research" ideals (for my own sanity if nothing else). So what I'm doing is keeping each experiment in its own folder. The folder contains the raw data, and all the metadata which describes each sample (treatment, genotype, etc.). The folder also contains the necessary R scripts to read the raw data, match it with metadata, process it, and output graphs and summary statistics. These are tied into a "mother script" which will do the whole process for each experiment.
This works really well but if I want to do some meta-analysis or just compare results between experiments there are some difficulties. Right now I am thinking the best way would be to run each experiment's "mother script" in its own environment, and then pull out the data from each environment to do my meta-analysis. An alternative approach might be running each mother script in its own instance and then saving the .RData files separately and then re-loading them into a new environment in a new instance. This seems kinda hacky though and I feel like there's a more elegant solution.
Related
I'm using
sapply(list.files('scritps/', full.names=TRUE),source)
to run 80 scripts at once in the folder "scripts/" and I do not know exactly how does this work. There are "intermediate" objects equally named across scripts (they are iterative scritps across 80 different biological populations). Does each script only use its own objects? Is there any risk of an script taking the objects of other "previous" script that has not been yet deleted out of the memory, or does this process works exactly like if was run manually sequentially one by one?
Many thanks in advance.
The quick answer is: each script runs independently. Imagine you run a for loop iterating through all the script files instead of using sapply - it should be the same in results.
To prove my thoughts, I just did an experiment:
# This is foo.R
x <- mtcars
write.csv(x, "foo.csv")
# This is bar.R
x <- iris
write.csv(x, "bar.csv")
# Run them at once
sapply(list.files(), source)
Though the default of "local" argument in source is FALSE, it turns out that I have two different csv files in my working directory, one named "foo.csv" with mtcars data frame, and the other named "bar.csv" with iris data frame.
There are global variables that you can declare out a function. As it's name says they are global and can be re-evaluated. If you declare a var into a function it will be local variable and only will take effect inside this concrete function, it will not exists out of its own function.
Example:
Var globalVar = 'i am global';
Function foo(){
Var localVar = 'i don't exist out of foo function';
}
If you declared globalVar on the first script, and you call it on the latest one, it will answer. If you declared localVar on some script and you call it into another or out of the functions or in another function you'll get an error (var localVar is not declared / can't be found).
Edit:
Perhaps, if there aren't dependences between scripts (you don't need one to finish to continue with another) there's no matter on running them on parallel or running them secuentialy. The behaviour will be the same.
You've only to take care with global vars, local ones can't infer into another script/function.
Introduction:
I have an RStudio project where I'm researching (fairly) big data sets. Though I'm trying to keep global environment clean, after some time it becomes filled with huge objects.
Problem:
RStudio always refreshes Environment pane after debugging (probably iterates global environment and calls summary() on each object), and it takes tens of seconds on my global environment. Although the refresh itself is async, R session is busy and you must wait for it to finish before you can continue working. That makes debugging very annoying. And there's no way I know of to disable Environment pane in RStudio.
Question:
Can someone suggest any beautiful workaround of that? I see following possibilities:
Customize RStudio sources to add an option to disable Environment
pane.
Frequently clean global environment (not convinient because raw data needs time-consuming preprocessing and I often change preprocessing logic).
Maybe there're specific types of objects that causing the lag not because of their size but because of their structure?
I'm working on reproducible example now, but it's not clear which objects causing the issue.
I've emailed RStudio support about that issue some time ago, but didn't get any answer yet.
While it's not yet available in a public release of RStudio, the v1.3 daily builds of RStudio allow you to disable the automatic-updates of the environment pane:
Selecting Manual Refresh Only would disable the automatic refresh of the environment pane.
I can reproduce the problem with lots of small nested list variables.
# Populate global environment with lots of nested list variables
invisible(
replicate(
1000,
assign(
paste0(sample(letters, 10, replace = TRUE), collapse = ""),
list(a = 1, b = list(ba = 2.1, bb = list(bba = 2.21, bbb = 2.22))),
envir = globalenv()
)
)
)
f <- function() browser()
f() # hit ENTER in the console once you hit the browser
This suggests that the problem is RStudio running its equivalent of ls.str() on the global environment.
I suspect that the behaviour is implemented in one of the functions listed by ls("tools:rstudio", all.names = TRUE), but I'm not sure which. If you find it, you can override it.
Alternatively, your best bet is to rework your code so that you aren't assigning so many variables in the global environment. Wrap most of your code into functions (so most variables only exist for the lifetime of the function call). You can also define a new environment
e <- new.env(parent = globalenv())
Then assign all your results inside e. That way the refresh only takes a few microseconds.
Is there a simple workflow to write tests that store objects as .rds or .rda so that future runs of a test can compare the result of code execution vs. the stored object? This would make it easy to check that functions that return somewhat complex values are still behaving as they should.
For example, something like:
test_obj(res <- lm(y ~ x, data.frame(x=1:3, y=5:7)))
which, if *extdata/test_obj.res.rds* doesn't exist, would create it in *inst/extdata/test_obj.res.rds*, with res from above, but if it does exist, would identical/all.equal etc. the newly generated object with the one recovered from the rds.
I would find such tests super useful, and I am a bit surprised that RUnit/svUnit / testthat don't implement something of the sort (I'm hoping they do, and I just haven't found it).
testthat::make_expectation is close, but I'd prefer to have an automated store/retrieve rds rather than copy paste the text representation to a file, which I think is how you're supposed to use testthat::make_expectation (I guess I could pipe stdout() to a .R file, but even then there is a bit of automation that could facilitate the process).
It only took me three years, but I wrote unitizer to resolve this issue. It is a unit testing framework with an interactive UI that allows you to review test output and store it / reject it with a single keystroke. It also streamlines the update/test/debug cycle by showing you a proper diff of failing tests, and dropping you into those tests evaluation environments for debugging in the interactive UI.
For example, if we have a matrix rotation function (courtesy #MatthewLundberg) we want to test:
# mx-rotate.R
rotate <- function(x) t(apply(x, 2, rev))
And a script with some tests:
# mx-test.R
mx <- matrix(1:9, 3)
rotate(mx)
rotate(rotate(mx))
rotate(rotate(rotate(mx)))
Then:
library(unitizer)
unitize('mx-test.R')
Will kick-off an interactive session that will allow you to review the results of the three rotation calls and accept them as tests if they work as expected.
There is a screencast demo available.
As of 2017, testthat has the feature expect_equal_to_reference, which does exactly what the question asks. I guess Hadley W. figured out a way.
I have what I think is a common enough issue, on optimising workflow in R. Specifically, how can I avoid the common issue of having a folder full of output (plots, RData files, csv, etc.), without, after some time, having a clue where they came from or how they were produced? In part, it surely involves trying to be intelligent about folder structure. I have been looking around, but I'm unsure of what the best strategy is. So far, I have tackled it in a rather unsophisticated (overkill) way: I created a function metainfo (see below) that writes a text file with metadata, with a given file name. The idea is that if a plot is produced, this command is issued to produce a text file with exactly the same file name as the plot (except, of course, the extension), with information on the system, session, packages loaded, R version, function and file the metadata function was called from, etc. The questions are:
(i) How do people approach this general problem? Are there obvious ways to avoid the issue I mentioned?
(ii) If not, does anyone have any tips on improving this function? At the moment it's perhaps clunky and not ideal. Particularly, getting the file name from which the plot is produced doesn't necessarily work (the solution I use is one provided by #hadley in 1). Any ideas would be welcome!
The function assumes git, so please ignore the probable warning produced. This is the main function, stored in a file metainfo.R:
MetaInfo <- function(message=NULL, filename)
{
# message - character string - Any message to be written into the information
# file (e.g., data used).
# filename - character string - the name of the txt file (including relative
# path). Should be the same as the output file it describes (RData,
# csv, pdf).
#
if (is.null(filename))
{
stop('Provide an output filename - parameter filename.')
}
filename <- paste(filename, '.txt', sep='')
# Try to get as close as possible to getting the file name from which the
# function is called.
source.file <- lapply(sys.frames(), function(x) x$ofile)
source.file <- Filter(Negate(is.null), source.file)
t.sf <- try(source.file <- basename(source.file[[length(source.file)]]),
silent=TRUE)
if (class(t.sf) == 'try-error')
{
source.file <- NULL
}
func <- deparse(sys.call(-1))
# MetaInfo isn't always called from within another function, so func could
# return as NULL or as general environment.
if (any(grepl('eval', func, ignore.case=TRUE)))
{
func <- NULL
}
time <- strftime(Sys.time(), "%Y/%m/%d %H:%M:%S")
git.h <- system('git log --pretty=format:"%h" -n 1', intern=TRUE)
meta <- list(Message=message,
Source=paste(source.file, ' on ', time, sep=''),
Functions=func,
System=Sys.info(),
Session=sessionInfo(),
Git.hash=git.h)
sink(file=filename)
print(meta)
sink(file=NULL)
}
which can then be called in another function, stored in another file, e.g.:
source('metainfo.R')
RandomPlot <- function(x, y)
{
fn <- 'random_plot'
pdf(file=paste(fn, '.pdf', sep=''))
plot(x, y)
MetaInfo(message=NULL, filename=fn)
dev.off()
}
x <- 1:10
y <- runif(10)
RandomPlot(x, y)
This way, a text file with the same file name as the plot is produced, with information that could hopefully help figure out how and where the plot was produced.
In terms of general R organization: I like to have a single script that recreates all work done for a project. Any project should be reproducible with a single click, including all plots or papers associated with that project.
So, to stay organized: keep a different directory for each project, each project has its own functions.R script to store non-package functions associated with that project, and each project has a master script that starts like
## myproject
source("functions.R")
source("read-data.R")
source("clean-data.R")
etc... all the way through. This should help keep everything organized, and if you get new data you just go to early scripts to fix up headers or whatever and rerun the entire project with a single click.
There is a package called Project Template that helps organize and automate the typical workflow with R scripts, data files, charts, etc. There is also a number of helpful documents like this one Workflow of statistical data analysis by Oliver Kirchkamp.
If you use Emacs and ESS for your analyses, learning Org-Mode is a must. I use it to organize all my work. Here is how it integrates with R: R Source Code Blocks in Org Mode.
There is also this new free tool called Drake which is advertised as "make for data".
I think my question belies a certain level of confusion. Having looked around, as well as explored the suggestions provided so far, I have reached the conclusion that it is probably not important to know where and how a file is produced. You should in fact be able to wipe out any output, and reproduce it by rerunning code. So while I might still use the above function for extra information, it really is a question of being ruthless and indeed cleaning up folders every now and then. These ideas are more eloquently explained here. This of course does not preclude the use of Make/Drake or Project Template, which I will try to pick up on. Thanks again for the suggestions #noah and #alex!
There is also now an R package called drake (Data Frames in R for Make), independent from Factual's Drake. The R package is also a Make-like build system that links code/dependencies with output.
install.packages("drake") # It is on CRAN.
library(drake)
load_basic_example()
plot_graph(my_plan)
make(my_plan)
Like it's predecessor remake, it has the added bonus that you do not have to keep track of a cumbersome pile of files. Objects generated in R are cached during make() and can be reloaded easily.
readd(summ_regression1_small) # Read objects from the cache.
loadd(small, large) # Load objects into your R session.
print(small)
But you can still work with files as single-quoted targets. (See 'report.Rmd' and 'report.md' in my_plan from the basic example.)
There is package developed by RStudio called pins that might address this problem.
I'm getting behaviour I don't understand when saving environments. The code below demonstrates the problem. I would have expected the two files (far-too-big.RData, and right-size.RData) to be the same size, and also very small because the environments they contain are empty.
In fact, far-too-big.RData ends up the same size as bigfile.RData.
I get the same results using 2.14.1 and 2.15.2, both on WinXP 5.1 SP3. Can anyone explain why this is happening?
Both far-too-big.RData and right-size.RData, when loaded into a new R session, appear to contain nothing. ie they return character(0) in response to ls(). However, if I switch the saves to include ascii=TRUE, and open the result in a text editor, I can see that far-too-big.RData contains the data in bigfile.RData.
a <- matrix(runif(1000000, 0, 1), ncol=1000)
save(a, file="bigfile.RData")
fn <- function() {
load("bigfile.RData")
test <- new.env()
save(test, file="far-too-big.RData")
test1 <- new.env(parent=globalenv())
save(test1, file="right-size.RData")
}
fn()
This is not my area of expertise but I belive environments work like this.
Any environment inherits everything in its parent environment.
All function calls create their own environment.
The result of the above in your case is:
When you run fn() it creates its own local environment (green), whose parent by default is globalenv() (grey).
When you create the environment test (red) inside fn() its parent defaults to fn()'s environment (green). test will therefore include the object a.
When you create the environment test1 (blue) and explicitly states that its parent is globalenv() it is separated from fn()'s environment and does not inherit the object a.
So when saving test you also save a (somewhat hidden) copy of the object a. This does not happen when you save test1 as it does not include the object a.
Update
Apparently this is a more complicated topic than I used to believe. Although I might just be quoting #joris-mays answer now I'd like to take a final go at it.
To me the most intuitive visualization of environments would be a tree structure, see below, where each node is an environment and the arrows point to its respective enclosing environment (which I would like to believe is the same as its parent, but that has to do with frames and is beyond my corner of the world). A given environment encloses all objects you can reach by moving down the tree and it can access all objects you can reach by moving up the tree. When you save an environment it appears you save all objects and environments that are both enclosed by it and accessible from it (with the exception of globalenv()).
However, the take home message is as Joris already stated: save your objects as lists and you don't need to worry.
If you want to know more I can recommend Norman Matloff's excellent book the art of R programming. It is aimed at software development in R rather than primary data analysis and assumes you have a fair bit of programming experience. I must admit I haven't fully digested the environment part yet, but as the rest of the book is very well written and pedagogical I assume this one is too.
Actually, it's the other way around than #Backlin shows: the parent environment is the one that encloses the other ones. So in the case you define, the enclosing environment of test is the local environment of fn, and the enclosing environment of test1 is the global environment, like this:
Environments behave different from other objects in R, in the sense that they don't get copied when passed to functions or used in assignments. The environment object itself consists internally of pointers to :
a frame (which is a pairlist containing the values)
the enclosing environment (as explained above)
a hash table (which is either a list or NULL if the environment is not hashed)
The fact that an environment contains pointers, makes all the difference. Environments are not all that easy to deal with, they're actually very tricky. Take a look at the code below :
> test <- new.env()
> test$a <- 1
> test2 <- test
> test2$a <- 2
> test$a
[1] 2
So the only thing you copied from test in test2, is the pointers. If you change a value in test2, you change that in test as well. (Actually, you change that value only once, but test and test2 point both to the same frame).
When you try to save an environment, R has no choice but to get the values for the frame, the hash table AND the enclosing environment and save those. As the enclosing environment is an environment in itself, R will also save all enclosing environments until it reaches the global environment. As the global environment is treated in a special way in the internal code, that one is (luckily) not saved in the file.
Note the difference between an enclosing environment and a parent frame:
Say we define our functions a bit different :
a <- matrix(runif(1000000, 0, 1), ncol=1000)
save(a, file="bigfile.RData")
fn <- function() {
load("bigfile.RData")
test <- new.env()
save(test, file="far-too-big.RData")
test1 <- new.env(parent=globalenv())
save(test1, file="right-size.RData")
}
fn2 <- function(){
z <- matrix(runif(1000000,0,1),ncol=1000)
fn()
}
fn2()
Now we have the following situation :
One would think that the file "far-too-big.RData" contains both matrix a and matrix z, but that's not the case. It contains only the matrix a. This is because the enclosing environment of fn is the global environment. The parent frame of fn is the environment of fn2, but the environment object created by fn contains a pointer to the global environment.
On the other hand, if we do the following:
fn <- function() {
load("bigfile.RData")
test <- new.env()
test$b <- a
test2 <- new.env(parent=test)
save(test2, file="far-too-big.RData")
}
test2 is now enclosed in two environments (being test and the environment of fun), and both environments are saved in the file as well. So you get this situation :
Regardless of this, I personally avoid saving environments as environments, because there are more things that can go wrong. In my opinion, saving an environment as a list is in 99.9% of the cases the better choice :
fn2 <- function(){
load("bigfile.RData")
test <- new.env()
test$x <- "something"
test$fn <- ls
testlist <- as.list(test)
save(testlist, file="right-size.RData")
}
fn2()
If you need it to be an environment, you can convert it back when loading.
load("right-size.RData")
test <- as.environment(testlist)