Doubts on running multiple scripts at once - r

I'm using
sapply(list.files('scritps/', full.names=TRUE),source)
to run 80 scripts at once in the folder "scripts/" and I do not know exactly how does this work. There are "intermediate" objects equally named across scripts (they are iterative scritps across 80 different biological populations). Does each script only use its own objects? Is there any risk of an script taking the objects of other "previous" script that has not been yet deleted out of the memory, or does this process works exactly like if was run manually sequentially one by one?
Many thanks in advance.

The quick answer is: each script runs independently. Imagine you run a for loop iterating through all the script files instead of using sapply - it should be the same in results.
To prove my thoughts, I just did an experiment:
# This is foo.R
x <- mtcars
write.csv(x, "foo.csv")
# This is bar.R
x <- iris
write.csv(x, "bar.csv")
# Run them at once
sapply(list.files(), source)
Though the default of "local" argument in source is FALSE, it turns out that I have two different csv files in my working directory, one named "foo.csv" with mtcars data frame, and the other named "bar.csv" with iris data frame.

There are global variables that you can declare out a function. As it's name says they are global and can be re-evaluated. If you declare a var into a function it will be local variable and only will take effect inside this concrete function, it will not exists out of its own function.
Example:
Var globalVar = 'i am global';
Function foo(){
Var localVar = 'i don't exist out of foo function';
}
If you declared globalVar on the first script, and you call it on the latest one, it will answer. If you declared localVar on some script and you call it into another or out of the functions or in another function you'll get an error (var localVar is not declared / can't be found).
Edit:
Perhaps, if there aren't dependences between scripts (you don't need one to finish to continue with another) there's no matter on running them on parallel or running them secuentialy. The behaviour will be the same.
You've only to take care with global vars, local ones can't infer into another script/function.

Related

Using delayed assignment in a function: How do I send the promise back to the parent environment?

I would like to used delayedAssign to load a series of data from a set of files only when the data is needed. But since these files will always be in the same directory (which may be moved around), instead of hard coding the location of each file (which would be tedious to change later on if the directory was moved), I would like to simply make a function that accepts the filepath for the directory.
loadLayers <- function(filepath) {
delayedAssign("dataset1", readRDS(file.path(filepath, "experiment1.rds")))
delayedAssign("dataset2", readRDS(file.path(filepath, "experiment2.rds")))
delayedAssign("dataset3", readRDS(file.path(filepath,"experiment3.rds")))
return (list <- (setOne = dataset1, setTwo = dataset2, setThree = dataset3)
}
So instead of loading all the data sets at the start, I'd like to have each data set loaded only when needed (which speeds up the shiny app).
However, I'm having trouble when doing this in a function. It works when the delayedAssign is not in a function, but when I put them in a function, all the objects in the list simply return null, and the "promise" to evaluate them when needed doesn't seem to be fulfilled.
What would be the correct way to achieve this?
Thanks.
Your example code doesn't work in R, but even conceptually, you're using delayedAssign and then you immediately resolve it by referencing it in return() so you end up loading everything anyway. To make it clear, assignments are binding a symbol to a value in an enviroment. So in order for it to make any sense your function must return the environment, not a list. Or, you can simply use the global environment and the function doesn't need to return anything as you use it for its side-effect.
loadLayers <- function(filepath, where=.GlobalEnv) {
delayedAssign("dataset1", readRDS(file.path(filepath, "experiment1.rds")),
assign.env=where)
delayedAssign("dataset2", readRDS(file.path(filepath, "experiment2.rds")),
assign.env=where)
delayedAssign("dataset3", readRDS(file.path(filepath, "experiment3.rds")),
assign.env=where)
where
}

Force sys.source() to put functions in specified environment

Question:
I'm using sys.source to source a script's output into a new environment. However, that script itself source()'s some things as well.
When it sources functions, they (and their output) get loaded into R_GlobalEnv instead of into the environment specified by sys.source(). It seems the functions enclosing and binding environments end up being under R_GlobalEnv instead of what you specify in sys.source().
Is there a way like sys.source() to run a script and keep everything it makes in a separate environment? An ideal solution would not require modifying the scripts I'm sourcing and still have "chdir = TRUE" style functionality.
Example:
Running this should show you what I mean:
# setup an external folder
other.folder = tempdir()
# make a functions script, it just adds "1" to the argument.
# Note: the strange-looking "assign(x=" bit is important
# to what I'm actually doing, so any solution needs to be
# robust to this.
functions = file.path(other.folder, "functions.R")
writeLines("myfunction = function(a){assign(x=c('function.output'), a+1, pos = 1)}", functions)
# make a parent script, which source()'s functions.R
# and invokes it on some data, and then modifies that data
parent = file.path(other.folder, "parent.R")
writeLines("source('functions.R')\n
original.data=1\n
myfunction(original.data)\n
resulting.data = function.output + 1", parent)
# make a separate environment
myenv = new.env()
# source parent.R into that new environment,
# using chdir=TRUE so parent.R can find functions.R
sys.source(parent, myenv, chdir = TRUE)
# You can see "myfunction" and "function.output"
# end up in R_GlobalEnv.
# Whereas "original.data" and "resulting.data" end up in the intended environment.
ls(myenv)
More information (what I'm actually trying to do):
I have data from several similar experiments. I'm trying to keep everything in line with "reproducible research" ideals (for my own sanity if nothing else). So what I'm doing is keeping each experiment in its own folder. The folder contains the raw data, and all the metadata which describes each sample (treatment, genotype, etc.). The folder also contains the necessary R scripts to read the raw data, match it with metadata, process it, and output graphs and summary statistics. These are tied into a "mother script" which will do the whole process for each experiment.
This works really well but if I want to do some meta-analysis or just compare results between experiments there are some difficulties. Right now I am thinking the best way would be to run each experiment's "mother script" in its own environment, and then pull out the data from each environment to do my meta-analysis. An alternative approach might be running each mother script in its own instance and then saving the .RData files separately and then re-loading them into a new environment in a new instance. This seems kinda hacky though and I feel like there's a more elegant solution.

Efficiently cleaning R from attached objects

I want to initialise some variables from an external data file. One way is to set a file like the following foo.csv:
var1,var2,var3
value1,value2,value3
Then issue:
attach(read.csv('foo.csv'))
The problem is that in this way var1, var2, var3 are not shown by ls() and most of all rm(all=ls()) doesn't clean all anymore and var1, var2, var3 are still there.
As the default position for new objects is '2', I can remove the workspace where this variables live via:
detach(pos=2)
or simply
detach()
Since pos=2 is the default for detach too.
But detach() is "too" powerful and it can delete R objects loaded by default. This means that, if one attaches many datasets, removing them with repeated detach can lead to delete also default R objects and you have to restart it. Besides the simplicity of the single rm(all=ls()) goes away.
One solution will be to attach var1, var2, var3 straight to the the global environment.
Do you know how to do that?
attach(read.csv('foo.csv'), pos=1)
issues a warning (future error).
attach(read.csv('foo.csv'), pos=-1)
seems ineffective.
If you want to read the variables straight into the global environment, you can do this:
{
foo<-read.csv('foo.csv')
for(n in names(foo)) assign(n,foo[[n]],globalenv())
}
The braces will prevent foo from also being added to the global environment. You can also make this into a function if you want.
Use the named variant of attach and detach:
attach(read.csv(text='var1,var2,var3\nvalue1,value2,value3'),
name = 'some_name')
and
detach('some_name')
This will prevent mistakes. You’d obviously wrap these two into functions and generate the names automatically in an appropriate manner (the easiest being via a monotonically increasing counter).
What about a 'safer' attach?
attach<-function(x) {for(n in names(x)) assign(n,x[[n]],globalenv()); names(x)}
'Safer' means you can see attached vars with ls() and most of all remove them with a single rm(list=ls())
Inspired by mrip

Empty R environment becomes large file when saved

I'm getting behaviour I don't understand when saving environments. The code below demonstrates the problem. I would have expected the two files (far-too-big.RData, and right-size.RData) to be the same size, and also very small because the environments they contain are empty.
In fact, far-too-big.RData ends up the same size as bigfile.RData.
I get the same results using 2.14.1 and 2.15.2, both on WinXP 5.1 SP3. Can anyone explain why this is happening?
Both far-too-big.RData and right-size.RData, when loaded into a new R session, appear to contain nothing. ie they return character(0) in response to ls(). However, if I switch the saves to include ascii=TRUE, and open the result in a text editor, I can see that far-too-big.RData contains the data in bigfile.RData.
a <- matrix(runif(1000000, 0, 1), ncol=1000)
save(a, file="bigfile.RData")
fn <- function() {
load("bigfile.RData")
test <- new.env()
save(test, file="far-too-big.RData")
test1 <- new.env(parent=globalenv())
save(test1, file="right-size.RData")
}
fn()
This is not my area of expertise but I belive environments work like this.
Any environment inherits everything in its parent environment.
All function calls create their own environment.
The result of the above in your case is:
When you run fn() it creates its own local environment (green), whose parent by default is globalenv() (grey).
When you create the environment test (red) inside fn() its parent defaults to fn()'s environment (green). test will therefore include the object a.
When you create the environment test1 (blue) and explicitly states that its parent is globalenv() it is separated from fn()'s environment and does not inherit the object a.
So when saving test you also save a (somewhat hidden) copy of the object a. This does not happen when you save test1 as it does not include the object a.
Update
Apparently this is a more complicated topic than I used to believe. Although I might just be quoting #joris-mays answer now I'd like to take a final go at it.
To me the most intuitive visualization of environments would be a tree structure, see below, where each node is an environment and the arrows point to its respective enclosing environment (which I would like to believe is the same as its parent, but that has to do with frames and is beyond my corner of the world). A given environment encloses all objects you can reach by moving down the tree and it can access all objects you can reach by moving up the tree. When you save an environment it appears you save all objects and environments that are both enclosed by it and accessible from it (with the exception of globalenv()).
However, the take home message is as Joris already stated: save your objects as lists and you don't need to worry.
If you want to know more I can recommend Norman Matloff's excellent book the art of R programming. It is aimed at software development in R rather than primary data analysis and assumes you have a fair bit of programming experience. I must admit I haven't fully digested the environment part yet, but as the rest of the book is very well written and pedagogical I assume this one is too.
Actually, it's the other way around than #Backlin shows: the parent environment is the one that encloses the other ones. So in the case you define, the enclosing environment of test is the local environment of fn, and the enclosing environment of test1 is the global environment, like this:
Environments behave different from other objects in R, in the sense that they don't get copied when passed to functions or used in assignments. The environment object itself consists internally of pointers to :
a frame (which is a pairlist containing the values)
the enclosing environment (as explained above)
a hash table (which is either a list or NULL if the environment is not hashed)
The fact that an environment contains pointers, makes all the difference. Environments are not all that easy to deal with, they're actually very tricky. Take a look at the code below :
> test <- new.env()
> test$a <- 1
> test2 <- test
> test2$a <- 2
> test$a
[1] 2
So the only thing you copied from test in test2, is the pointers. If you change a value in test2, you change that in test as well. (Actually, you change that value only once, but test and test2 point both to the same frame).
When you try to save an environment, R has no choice but to get the values for the frame, the hash table AND the enclosing environment and save those. As the enclosing environment is an environment in itself, R will also save all enclosing environments until it reaches the global environment. As the global environment is treated in a special way in the internal code, that one is (luckily) not saved in the file.
Note the difference between an enclosing environment and a parent frame:
Say we define our functions a bit different :
a <- matrix(runif(1000000, 0, 1), ncol=1000)
save(a, file="bigfile.RData")
fn <- function() {
load("bigfile.RData")
test <- new.env()
save(test, file="far-too-big.RData")
test1 <- new.env(parent=globalenv())
save(test1, file="right-size.RData")
}
fn2 <- function(){
z <- matrix(runif(1000000,0,1),ncol=1000)
fn()
}
fn2()
Now we have the following situation :
One would think that the file "far-too-big.RData" contains both matrix a and matrix z, but that's not the case. It contains only the matrix a. This is because the enclosing environment of fn is the global environment. The parent frame of fn is the environment of fn2, but the environment object created by fn contains a pointer to the global environment.
On the other hand, if we do the following:
fn <- function() {
load("bigfile.RData")
test <- new.env()
test$b <- a
test2 <- new.env(parent=test)
save(test2, file="far-too-big.RData")
}
test2 is now enclosed in two environments (being test and the environment of fun), and both environments are saved in the file as well. So you get this situation :
Regardless of this, I personally avoid saving environments as environments, because there are more things that can go wrong. In my opinion, saving an environment as a list is in 99.9% of the cases the better choice :
fn2 <- function(){
load("bigfile.RData")
test <- new.env()
test$x <- "something"
test$fn <- ls
testlist <- as.list(test)
save(testlist, file="right-size.RData")
}
fn2()
If you need it to be an environment, you can convert it back when loading.
load("right-size.RData")
test <- as.environment(testlist)

How do you stop source from abbreviating function names?

I am trying to source multiple functions, that differ by a number in the name.
For example: func1, func2.
I tried using "func_1", and "func_2", as well as putting the number first, "1func" and "2func". No matter how I index the function names, the source function just reads in one function that it calls "func" - which is not what I want.
I have tried using for-loops and sapply:
for-loop:
func.list <- list.files(path="/some_path",pattern="some pattern",full.names=TRUE)
for(i in 1:length(func.list)){
source(func.list[i])
}
sapply:
sapply(func.list,FUN=source)
I am going to be writing multiple versions of a data correction function, and would really like to be able to index them - because giving a concise, but specific, name would be difficult, and not allow me to selectively source just the function files from their directory.
In my code, func.list gives the output (I have replaced the actual directory because of privacy/contractual issues):
[1] "mypath/1resp.correction.R"
[2] "mypath/2resp.correction.R"
Then when I source func.list with either the for-loop or sapply code (listed above), R only loads one function named resp.correction, with the code body from "2resp.correction.R".
The argument to source is a file name, not a function name. So you cannot be fancy here: you need to provide the exact filenames.
It sounds like your two files contain the definitions of a function with the same name (resp.correction) in both files, so yes, as you source one file after the other, the function is overwritten in your global environment.
You could, inside your loop, reassign the function to a different name:
func.list <- list.files(path="/some_path",pattern="some pattern",full.names=TRUE)
for(i in 1:length(func.list)) {
source(func.list[i], local = TRUE)
assign(paste0("resp.correction", i), resp.correction, envir = .GlobalEnv)
}

Resources