Introduction:
I have an RStudio project where I'm researching (fairly) big data sets. Though I'm trying to keep global environment clean, after some time it becomes filled with huge objects.
Problem:
RStudio always refreshes Environment pane after debugging (probably iterates global environment and calls summary() on each object), and it takes tens of seconds on my global environment. Although the refresh itself is async, R session is busy and you must wait for it to finish before you can continue working. That makes debugging very annoying. And there's no way I know of to disable Environment pane in RStudio.
Question:
Can someone suggest any beautiful workaround of that? I see following possibilities:
Customize RStudio sources to add an option to disable Environment
pane.
Frequently clean global environment (not convinient because raw data needs time-consuming preprocessing and I often change preprocessing logic).
Maybe there're specific types of objects that causing the lag not because of their size but because of their structure?
I'm working on reproducible example now, but it's not clear which objects causing the issue.
I've emailed RStudio support about that issue some time ago, but didn't get any answer yet.
While it's not yet available in a public release of RStudio, the v1.3 daily builds of RStudio allow you to disable the automatic-updates of the environment pane:
Selecting Manual Refresh Only would disable the automatic refresh of the environment pane.
I can reproduce the problem with lots of small nested list variables.
# Populate global environment with lots of nested list variables
invisible(
replicate(
1000,
assign(
paste0(sample(letters, 10, replace = TRUE), collapse = ""),
list(a = 1, b = list(ba = 2.1, bb = list(bba = 2.21, bbb = 2.22))),
envir = globalenv()
)
)
)
f <- function() browser()
f() # hit ENTER in the console once you hit the browser
This suggests that the problem is RStudio running its equivalent of ls.str() on the global environment.
I suspect that the behaviour is implemented in one of the functions listed by ls("tools:rstudio", all.names = TRUE), but I'm not sure which. If you find it, you can override it.
Alternatively, your best bet is to rework your code so that you aren't assigning so many variables in the global environment. Wrap most of your code into functions (so most variables only exist for the lifetime of the function call). You can also define a new environment
e <- new.env(parent = globalenv())
Then assign all your results inside e. That way the refresh only takes a few microseconds.
Related
I'm struggling to clearly explain this problem.
Essentially, something has seemed to have happened within the R environment and none of the code I write inside my functions are working and not data is being saved. If I type a command line directly into the console it works (i.e. Monkey <- 0), but if I type it within a function, it doesn't store it when I run the function.
It could be I'm missing a glaring error in the code, but I noticed the problem when I accidentally clicked on the debugger and tried to excite out of the browser[1] prompt which appeared.
Any ideas? This is driving me nuts.
corr <- function(directory, threshold=0) {
directory <- paste(getwd(),"/",directory,"/",sep="")
file.list <- list.files(directory)
number <- 1:length(file.list)
monkey <- c()
for (i in number) {
x <- paste(directory,file.list[i],sep="")
y <- read.csv(x)
t <- sum(complete.cases(y))
if (t >= threshold) {
correl <- cor(y$sulfate, y$nitrate, use='pairwise.complete.obs')
monkey <- append(monkey,correl)}
}
#correl <- cor(newdata$sulfate, newdata$nitrate, use='pairwise.complete.obs')
#summary(correl)
}
corr('specdata', 150)
monkey```
It's a namespace issue. Functions create their own 'environment', that isn't necessarily in the global environment.
Using <- will assign in the local environment. To save an object to the global environment, use <<-
Here's some information on R environments.
I suggest you give a look at some tutorial on using functions in R.
Briefly (and sorry for my horrible explanation) objects that you define within functions will ONLY be defined within functions, unless you explicitly export them using (one of the possible approaches) the return() function.
browser() is indeed used for debugging, keeps you inside the function, and allows you accessing objects created inside the function.
In addition, to increase the probability to have useful answers, I suggest that you try to post a self-contained, working piece of code allowing quickly reproducing the issue. Here you are reading some files we have no access to.
It seems to me you have to store the output yourself when you run your script:
corr_out <- corr('specdata', 150)
Not sure if this is even possible.
I use Rstudio and appreciate having an overview of the objects I'm working with in the Global Environment pane.
However, at the same time, I have some 15 or so simple wrapper functions that are specific to my project, e.g. on various reading and writing functions so that they automate some file management tasks and follow my preferred folder structure; unfortunately, they also clutter that GE view.
I guess I could put them all in a package but I'm quite sure I will not publish it and may not even need many of them beyond this one project.
Is there anything short of bundling them into a package for this kind of three-line functions?
Thank you!
You could always put them into a list:
helper_functions <- list(f1 = function1,
f2 = function2)
Then you can call them by helper_functions$f2().
Example:
plus_one <- function(n){
return(n + 1)
}
plus_two <- function(n){
return(n + 2)
}
plus <- list(one = plus_one,
two = plus_two)
plus$two(2)
# 4
I'm using
sapply(list.files('scritps/', full.names=TRUE),source)
to run 80 scripts at once in the folder "scripts/" and I do not know exactly how does this work. There are "intermediate" objects equally named across scripts (they are iterative scritps across 80 different biological populations). Does each script only use its own objects? Is there any risk of an script taking the objects of other "previous" script that has not been yet deleted out of the memory, or does this process works exactly like if was run manually sequentially one by one?
Many thanks in advance.
The quick answer is: each script runs independently. Imagine you run a for loop iterating through all the script files instead of using sapply - it should be the same in results.
To prove my thoughts, I just did an experiment:
# This is foo.R
x <- mtcars
write.csv(x, "foo.csv")
# This is bar.R
x <- iris
write.csv(x, "bar.csv")
# Run them at once
sapply(list.files(), source)
Though the default of "local" argument in source is FALSE, it turns out that I have two different csv files in my working directory, one named "foo.csv" with mtcars data frame, and the other named "bar.csv" with iris data frame.
There are global variables that you can declare out a function. As it's name says they are global and can be re-evaluated. If you declare a var into a function it will be local variable and only will take effect inside this concrete function, it will not exists out of its own function.
Example:
Var globalVar = 'i am global';
Function foo(){
Var localVar = 'i don't exist out of foo function';
}
If you declared globalVar on the first script, and you call it on the latest one, it will answer. If you declared localVar on some script and you call it into another or out of the functions or in another function you'll get an error (var localVar is not declared / can't be found).
Edit:
Perhaps, if there aren't dependences between scripts (you don't need one to finish to continue with another) there's no matter on running them on parallel or running them secuentialy. The behaviour will be the same.
You've only to take care with global vars, local ones can't infer into another script/function.
Question:
I'm using sys.source to source a script's output into a new environment. However, that script itself source()'s some things as well.
When it sources functions, they (and their output) get loaded into R_GlobalEnv instead of into the environment specified by sys.source(). It seems the functions enclosing and binding environments end up being under R_GlobalEnv instead of what you specify in sys.source().
Is there a way like sys.source() to run a script and keep everything it makes in a separate environment? An ideal solution would not require modifying the scripts I'm sourcing and still have "chdir = TRUE" style functionality.
Example:
Running this should show you what I mean:
# setup an external folder
other.folder = tempdir()
# make a functions script, it just adds "1" to the argument.
# Note: the strange-looking "assign(x=" bit is important
# to what I'm actually doing, so any solution needs to be
# robust to this.
functions = file.path(other.folder, "functions.R")
writeLines("myfunction = function(a){assign(x=c('function.output'), a+1, pos = 1)}", functions)
# make a parent script, which source()'s functions.R
# and invokes it on some data, and then modifies that data
parent = file.path(other.folder, "parent.R")
writeLines("source('functions.R')\n
original.data=1\n
myfunction(original.data)\n
resulting.data = function.output + 1", parent)
# make a separate environment
myenv = new.env()
# source parent.R into that new environment,
# using chdir=TRUE so parent.R can find functions.R
sys.source(parent, myenv, chdir = TRUE)
# You can see "myfunction" and "function.output"
# end up in R_GlobalEnv.
# Whereas "original.data" and "resulting.data" end up in the intended environment.
ls(myenv)
More information (what I'm actually trying to do):
I have data from several similar experiments. I'm trying to keep everything in line with "reproducible research" ideals (for my own sanity if nothing else). So what I'm doing is keeping each experiment in its own folder. The folder contains the raw data, and all the metadata which describes each sample (treatment, genotype, etc.). The folder also contains the necessary R scripts to read the raw data, match it with metadata, process it, and output graphs and summary statistics. These are tied into a "mother script" which will do the whole process for each experiment.
This works really well but if I want to do some meta-analysis or just compare results between experiments there are some difficulties. Right now I am thinking the best way would be to run each experiment's "mother script" in its own environment, and then pull out the data from each environment to do my meta-analysis. An alternative approach might be running each mother script in its own instance and then saving the .RData files separately and then re-loading them into a new environment in a new instance. This seems kinda hacky though and I feel like there's a more elegant solution.
Question: How to free all file handlers / connections R is using ? In Python, one could have a look which objects are still alive. Is there anything comparable in R?
Within a function, I create a directory with some files. At the end of the function, it should be deleted again. I am facing the problem that I am unable to delete the files, presumably because a file handler is still open. The example is with the MetaSKAT package, but I'm interested in a general solution. The example data can be found here: https://groups.google.com/group/skat_slee/attach/28a76339619d8358/Datasets.zip?part=4&authuser=0
# Code author: Seunggeun (Shawn) Lee
setwd('./Datasets')
foo <- function(dir.name) {
###### Preparation stuff ################################################
if (!require(MetaSKAT)) {install.packages('MetaSKAT'); require(MetaSKAT)}
dir.create(file.path('.',dir.name),showWarnings=F)
dir.path <- paste("./",dir.name,sep="")
file.copy(c("01.fam","01.bed", "01.bim", "01_3.SetID"), dir.path)
setwd(dir.path)
FAM<-read.table("01.fam", header=FALSE)
y<-FAM[,6]
N.Sample<-length(y)
x1<-rnorm(N.Sample)
x2<-rbinom(N.Sample,1, 0.5)
obj <-SKAT_Null_Model(y~cbind(x1, x2))
re <-Generate_Meta_Files(obj, "01.bed", "01.bim", "01_3.SetID", "01.MSSD", "01.MInfo", N.Sample)
###### Problem #######################################################
print(file.remove(list.files(), force = T)) # problem: cannot delete
# curiously, sometimes there is 1, sometimes 2 False...
###### my different tries to solve it ################################
rm(re)
closeAllConnections()
sink.number() # shows 0
rm(list = ls())
gc()
###### problem is still there ######################################
print(file.remove(list.files()))
setwd('..')
# print(unlink(dir.path, recursive = T)) # I want finally delete the directory
}
debug(foo)
foo("temp2")
I am using R Studio. Even if a try to delete it manually in Windows while R is still open, it tells me that the file is be used by a program. I can only delete it after I closed R.
So how can I force R to free these files? I will try to solve the problem at the root and look at the source code of Generate_Meta_Files(), but I thought there must be a global function in R which forces to free everything (Note: I am well aware that it does not make sense to create the files and delete it directly afterwards, it's just an example.)
Edit: After a hint, I tried it under Linux. It turns out that though it shows me that there was a problem with deletion of one (of the 6) files, all is properly deleted, hence I guess this is a windows-specific problem. Any hints what this is?