I'm struggling to clearly explain this problem.
Essentially, something has seemed to have happened within the R environment and none of the code I write inside my functions are working and not data is being saved. If I type a command line directly into the console it works (i.e. Monkey <- 0), but if I type it within a function, it doesn't store it when I run the function.
It could be I'm missing a glaring error in the code, but I noticed the problem when I accidentally clicked on the debugger and tried to excite out of the browser[1] prompt which appeared.
Any ideas? This is driving me nuts.
corr <- function(directory, threshold=0) {
directory <- paste(getwd(),"/",directory,"/",sep="")
file.list <- list.files(directory)
number <- 1:length(file.list)
monkey <- c()
for (i in number) {
x <- paste(directory,file.list[i],sep="")
y <- read.csv(x)
t <- sum(complete.cases(y))
if (t >= threshold) {
correl <- cor(y$sulfate, y$nitrate, use='pairwise.complete.obs')
monkey <- append(monkey,correl)}
}
#correl <- cor(newdata$sulfate, newdata$nitrate, use='pairwise.complete.obs')
#summary(correl)
}
corr('specdata', 150)
monkey```
It's a namespace issue. Functions create their own 'environment', that isn't necessarily in the global environment.
Using <- will assign in the local environment. To save an object to the global environment, use <<-
Here's some information on R environments.
I suggest you give a look at some tutorial on using functions in R.
Briefly (and sorry for my horrible explanation) objects that you define within functions will ONLY be defined within functions, unless you explicitly export them using (one of the possible approaches) the return() function.
browser() is indeed used for debugging, keeps you inside the function, and allows you accessing objects created inside the function.
In addition, to increase the probability to have useful answers, I suggest that you try to post a self-contained, working piece of code allowing quickly reproducing the issue. Here you are reading some files we have no access to.
It seems to me you have to store the output yourself when you run your script:
corr_out <- corr('specdata', 150)
Related
Not sure if this is even possible.
I use Rstudio and appreciate having an overview of the objects I'm working with in the Global Environment pane.
However, at the same time, I have some 15 or so simple wrapper functions that are specific to my project, e.g. on various reading and writing functions so that they automate some file management tasks and follow my preferred folder structure; unfortunately, they also clutter that GE view.
I guess I could put them all in a package but I'm quite sure I will not publish it and may not even need many of them beyond this one project.
Is there anything short of bundling them into a package for this kind of three-line functions?
Thank you!
You could always put them into a list:
helper_functions <- list(f1 = function1,
f2 = function2)
Then you can call them by helper_functions$f2().
Example:
plus_one <- function(n){
return(n + 1)
}
plus_two <- function(n){
return(n + 2)
}
plus <- list(one = plus_one,
two = plus_two)
plus$two(2)
# 4
This is a tricky one as I can't provide a reproducible example, but I'm hoping that others may have had experience dealing with this.
Essentially I have a function that pulls a large quantity of data from a DB, cleans and reduces the size and loops through some parameters to produce a series of lm model objects, parameter values and other reference values. This is compiled into a complex list structure that totals about 10mb.
It's then supposed to saved as an RDS file on AWS s3 where it's retrieved in a production environment to build predictions.
e.g.
db.connection <- db.connection.object
build_model_list <- function(db.connection) {
clean_and_build_models <- function(db.connection, other.parameters) {
get_db_data <- function(db.connection, some.parameters) {# Retrieve db data} ## Externally defined
db.data <- get_db_data()
build_models <- function(db.data, some.parameters) ## Externally defined
clean_data <- function(db.data, some.parameters) {# Cleans and filters data based on parameters} ## Externally defined
clean.data <- clean_data()
lm_model <- function(clean.data) {# Builds lm model based on clean.data} ## Externally defined
lm.model <- lm_model()
return(list(lm.model, other.parameters))} ## Externally defined
looped.model.object <- llply(some.parameters, clean_and_build_models)
return(looped.model.object)}
model.list <- build_model_list()
saveRDS(model.list, "~/a_place/model_list.RDS")
The issue I'm getting is that 'model.list' object which is only 10MB in memory will inflate to many GBs when I save locally as RDS or try to upload to AWS s3.
I should note that though the function processes very large quantities of data (~ 5 million rows), the data used in the outputs is no larger than a few hundred rows.
Reading the limited info on this on Stack Exchange, I've found that moving some of the externally defined functions (as part of a package) inside the main function (e.g. clean_data and lm_model) helps reduce the RDS save size.
This however has some big disadvantages.
Firstly it's trial and error and follows no clear logical order, with frequent crashes and a couple of hours taken to build the list object, it's a very long debugging cycle.
Secondly, it'll mean my main function will be many hundreds of lines long which will make future alterations and debugging much more tricky.
My question to you is:
Has anyone encountered this issue before?
Any hypotheses as to what's causing it?
Has anyone found a logical non-trial-and-error solution to this?
Thanks for your help.
It took a bit of digging but I did actually find a solution in the end.
It turns out it was the lm model objects that were the guilty party. Based on this very helpful article:
https://blogs.oracle.com/R/entry/is_the_size_of_your
It turns out that the lm.object$terms component includes a an environment component that references to the objects present in the global environment when the model was built. Under certain circumstances, when you saveRDS R will try and draw in the environmental objects into the save object.
As I had ~0.5GB sitting in the global environment and an list array of ~200 lm model objects, this caused the RDS object to inflate dramatically as it was actually trying to compress ~100GB of data.
To test if this is what's causing the problem. Execute the following code:
as.matrix(lapply(lm.object, function(x) length(serialize(x,NULL))))
This will tell you if the $terms component is inflating.
The following code will remove the environmental references from the $terms component:
rm(list=ls(envir = attr(lm.object$terms, ".Environment")), envir = attr(lm.object$terms, ".Environment"))
Be warned though it'll also remove all the global environmental objects it references.
For model objects you could also simply delete the reference to the environment.
As for example like this
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
attr(lm.D9$terms, ".Environment") <- NULL
saveRDS(lm.D9, file = "path_to_save.RDS")
This unfortunatly breaks the model - but you can add an environment manualy after loading again.
readRDS("path_to_save.RDS")
attr(lm.D9$terms, ".Environment") <- globalenv()
This helped me in my specific use case and looks a bit saver to me...
Neither of these two solutions worked for me.
Instead I have used:
downloaded_object <- storage_download(connection, "path")
read_RDS <- readRDS(downloaded_object)
The answer by mhwh mostly solved my problem, but with the additional step of creating an empty list and copying into it from the model object what was relevant. This might be due to additional (undocumented) environment references associated with using the model class I used.
mm <- felm(formula=formula, data=data, keepX=TRUE, ...)
# Make an empty list and copy into it what we need:
mm_cp <- list()
mm_cp$coefficients <- mm$coefficients
# mm_cp$ <- something else from mm you might need ...
mm_cp$terms <- terms(ans)
attr(mm_cp$terms, ".Environment") <- NULL
saveRDS(mm_cp, file = "path_to_save.RDS")
Then when we need to use it:
mm_cp <- saveRDS("path_to_save.RDS")
attr(mm_cp$terms, ".Environment") <- globalenv()
In my case the file went from 5.5G to 13K. Additionally, when reading in the file it used to allocate >32G of memory, more than 6 times the file-size. This also reduced execution time significantly (no need to recreate various environments?).
Environmental references sounds like an excellent contender for a new chapter in the R Inferno book.
First off, I'm an R beginner taking an R programming course at the moment. It is extremely lacking in teaching the fundamentals of R so I'm trying to learn myself via you wonderful contributors on Stack Overflow. I'm trying to figure out how nested functions work, which means I also need to learn about how lexical scoping works. I've got a function that computes the complete cases in multiple CSV files and spits out a nice table right now.
Here's the CSV files:
https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip
And here's my code, I realize it'd be cleaner if I used the apply stuff but it works as is:
complete<- function(directory, id = 1:332){
data <- NULL
for (i in 1:length(id)) {
data[[i]]<- c(paste(directory, "/", formatC(id[i], width=3, flag=0),
".csv", sep=""))
}
cases <- NULL
for (d in 1:length(data)) {
cases[[d]]<-c(read.csv(data[d]))
}
df <- NULL
for (c in 1:length(cases)){
df[[c]] <- (data.frame(cases[c]))
}
dt <- do.call(rbind, df)
ok <- (complete.cases(dt))
finally <- as.data.frame(table(dt[ok, "ID"]), colnames=c("id", "nobs"))
colnames(finally) <- c('id', 'nobs')
return(finally)
}
I am now trying to call the different variables in the dataframe finally that is the output of the above function within this new function:
corr<-function(directory, threshold = 0){
complete(directory, id = 1:332)
finally$nobs
}
corr('specdata')
Without finally$nobs this function spits out the data frame, as it should, but when I try to call the variable nobs in object finally, it says object finally is not found. I realize this problem is due to my lack of understanding in the subject of lexical scoping, my professor hasn't really made lexical scoping very clear so I'm not totally sure how to find the object within the nested function environment... any help would be great.
The object finally is only in scope within the function complete(). If you want to do something further with the object you are returning, you need to store it in a variable in the environment you are working in (in this instance, the environment you are working in is the function corr(). If we weren't working inside any function, the environment would be the "global environment"). In other words, this code should work:
corr<-function(directory, threshold=0){
this.finally <- complete(directory, id=1:332)
this.finally$nobs
}
I am calling the object that is returned by complete() this.finally to help distinguish it from the object finally that is now out of scope. Of course, you can call it anything you like!
I have what I think is a common enough issue, on optimising workflow in R. Specifically, how can I avoid the common issue of having a folder full of output (plots, RData files, csv, etc.), without, after some time, having a clue where they came from or how they were produced? In part, it surely involves trying to be intelligent about folder structure. I have been looking around, but I'm unsure of what the best strategy is. So far, I have tackled it in a rather unsophisticated (overkill) way: I created a function metainfo (see below) that writes a text file with metadata, with a given file name. The idea is that if a plot is produced, this command is issued to produce a text file with exactly the same file name as the plot (except, of course, the extension), with information on the system, session, packages loaded, R version, function and file the metadata function was called from, etc. The questions are:
(i) How do people approach this general problem? Are there obvious ways to avoid the issue I mentioned?
(ii) If not, does anyone have any tips on improving this function? At the moment it's perhaps clunky and not ideal. Particularly, getting the file name from which the plot is produced doesn't necessarily work (the solution I use is one provided by #hadley in 1). Any ideas would be welcome!
The function assumes git, so please ignore the probable warning produced. This is the main function, stored in a file metainfo.R:
MetaInfo <- function(message=NULL, filename)
{
# message - character string - Any message to be written into the information
# file (e.g., data used).
# filename - character string - the name of the txt file (including relative
# path). Should be the same as the output file it describes (RData,
# csv, pdf).
#
if (is.null(filename))
{
stop('Provide an output filename - parameter filename.')
}
filename <- paste(filename, '.txt', sep='')
# Try to get as close as possible to getting the file name from which the
# function is called.
source.file <- lapply(sys.frames(), function(x) x$ofile)
source.file <- Filter(Negate(is.null), source.file)
t.sf <- try(source.file <- basename(source.file[[length(source.file)]]),
silent=TRUE)
if (class(t.sf) == 'try-error')
{
source.file <- NULL
}
func <- deparse(sys.call(-1))
# MetaInfo isn't always called from within another function, so func could
# return as NULL or as general environment.
if (any(grepl('eval', func, ignore.case=TRUE)))
{
func <- NULL
}
time <- strftime(Sys.time(), "%Y/%m/%d %H:%M:%S")
git.h <- system('git log --pretty=format:"%h" -n 1', intern=TRUE)
meta <- list(Message=message,
Source=paste(source.file, ' on ', time, sep=''),
Functions=func,
System=Sys.info(),
Session=sessionInfo(),
Git.hash=git.h)
sink(file=filename)
print(meta)
sink(file=NULL)
}
which can then be called in another function, stored in another file, e.g.:
source('metainfo.R')
RandomPlot <- function(x, y)
{
fn <- 'random_plot'
pdf(file=paste(fn, '.pdf', sep=''))
plot(x, y)
MetaInfo(message=NULL, filename=fn)
dev.off()
}
x <- 1:10
y <- runif(10)
RandomPlot(x, y)
This way, a text file with the same file name as the plot is produced, with information that could hopefully help figure out how and where the plot was produced.
In terms of general R organization: I like to have a single script that recreates all work done for a project. Any project should be reproducible with a single click, including all plots or papers associated with that project.
So, to stay organized: keep a different directory for each project, each project has its own functions.R script to store non-package functions associated with that project, and each project has a master script that starts like
## myproject
source("functions.R")
source("read-data.R")
source("clean-data.R")
etc... all the way through. This should help keep everything organized, and if you get new data you just go to early scripts to fix up headers or whatever and rerun the entire project with a single click.
There is a package called Project Template that helps organize and automate the typical workflow with R scripts, data files, charts, etc. There is also a number of helpful documents like this one Workflow of statistical data analysis by Oliver Kirchkamp.
If you use Emacs and ESS for your analyses, learning Org-Mode is a must. I use it to organize all my work. Here is how it integrates with R: R Source Code Blocks in Org Mode.
There is also this new free tool called Drake which is advertised as "make for data".
I think my question belies a certain level of confusion. Having looked around, as well as explored the suggestions provided so far, I have reached the conclusion that it is probably not important to know where and how a file is produced. You should in fact be able to wipe out any output, and reproduce it by rerunning code. So while I might still use the above function for extra information, it really is a question of being ruthless and indeed cleaning up folders every now and then. These ideas are more eloquently explained here. This of course does not preclude the use of Make/Drake or Project Template, which I will try to pick up on. Thanks again for the suggestions #noah and #alex!
There is also now an R package called drake (Data Frames in R for Make), independent from Factual's Drake. The R package is also a Make-like build system that links code/dependencies with output.
install.packages("drake") # It is on CRAN.
library(drake)
load_basic_example()
plot_graph(my_plan)
make(my_plan)
Like it's predecessor remake, it has the added bonus that you do not have to keep track of a cumbersome pile of files. Objects generated in R are cached during make() and can be reloaded easily.
readd(summ_regression1_small) # Read objects from the cache.
loadd(small, large) # Load objects into your R session.
print(small)
But you can still work with files as single-quoted targets. (See 'report.Rmd' and 'report.md' in my_plan from the basic example.)
There is package developed by RStudio called pins that might address this problem.
I'm getting behaviour I don't understand when saving environments. The code below demonstrates the problem. I would have expected the two files (far-too-big.RData, and right-size.RData) to be the same size, and also very small because the environments they contain are empty.
In fact, far-too-big.RData ends up the same size as bigfile.RData.
I get the same results using 2.14.1 and 2.15.2, both on WinXP 5.1 SP3. Can anyone explain why this is happening?
Both far-too-big.RData and right-size.RData, when loaded into a new R session, appear to contain nothing. ie they return character(0) in response to ls(). However, if I switch the saves to include ascii=TRUE, and open the result in a text editor, I can see that far-too-big.RData contains the data in bigfile.RData.
a <- matrix(runif(1000000, 0, 1), ncol=1000)
save(a, file="bigfile.RData")
fn <- function() {
load("bigfile.RData")
test <- new.env()
save(test, file="far-too-big.RData")
test1 <- new.env(parent=globalenv())
save(test1, file="right-size.RData")
}
fn()
This is not my area of expertise but I belive environments work like this.
Any environment inherits everything in its parent environment.
All function calls create their own environment.
The result of the above in your case is:
When you run fn() it creates its own local environment (green), whose parent by default is globalenv() (grey).
When you create the environment test (red) inside fn() its parent defaults to fn()'s environment (green). test will therefore include the object a.
When you create the environment test1 (blue) and explicitly states that its parent is globalenv() it is separated from fn()'s environment and does not inherit the object a.
So when saving test you also save a (somewhat hidden) copy of the object a. This does not happen when you save test1 as it does not include the object a.
Update
Apparently this is a more complicated topic than I used to believe. Although I might just be quoting #joris-mays answer now I'd like to take a final go at it.
To me the most intuitive visualization of environments would be a tree structure, see below, where each node is an environment and the arrows point to its respective enclosing environment (which I would like to believe is the same as its parent, but that has to do with frames and is beyond my corner of the world). A given environment encloses all objects you can reach by moving down the tree and it can access all objects you can reach by moving up the tree. When you save an environment it appears you save all objects and environments that are both enclosed by it and accessible from it (with the exception of globalenv()).
However, the take home message is as Joris already stated: save your objects as lists and you don't need to worry.
If you want to know more I can recommend Norman Matloff's excellent book the art of R programming. It is aimed at software development in R rather than primary data analysis and assumes you have a fair bit of programming experience. I must admit I haven't fully digested the environment part yet, but as the rest of the book is very well written and pedagogical I assume this one is too.
Actually, it's the other way around than #Backlin shows: the parent environment is the one that encloses the other ones. So in the case you define, the enclosing environment of test is the local environment of fn, and the enclosing environment of test1 is the global environment, like this:
Environments behave different from other objects in R, in the sense that they don't get copied when passed to functions or used in assignments. The environment object itself consists internally of pointers to :
a frame (which is a pairlist containing the values)
the enclosing environment (as explained above)
a hash table (which is either a list or NULL if the environment is not hashed)
The fact that an environment contains pointers, makes all the difference. Environments are not all that easy to deal with, they're actually very tricky. Take a look at the code below :
> test <- new.env()
> test$a <- 1
> test2 <- test
> test2$a <- 2
> test$a
[1] 2
So the only thing you copied from test in test2, is the pointers. If you change a value in test2, you change that in test as well. (Actually, you change that value only once, but test and test2 point both to the same frame).
When you try to save an environment, R has no choice but to get the values for the frame, the hash table AND the enclosing environment and save those. As the enclosing environment is an environment in itself, R will also save all enclosing environments until it reaches the global environment. As the global environment is treated in a special way in the internal code, that one is (luckily) not saved in the file.
Note the difference between an enclosing environment and a parent frame:
Say we define our functions a bit different :
a <- matrix(runif(1000000, 0, 1), ncol=1000)
save(a, file="bigfile.RData")
fn <- function() {
load("bigfile.RData")
test <- new.env()
save(test, file="far-too-big.RData")
test1 <- new.env(parent=globalenv())
save(test1, file="right-size.RData")
}
fn2 <- function(){
z <- matrix(runif(1000000,0,1),ncol=1000)
fn()
}
fn2()
Now we have the following situation :
One would think that the file "far-too-big.RData" contains both matrix a and matrix z, but that's not the case. It contains only the matrix a. This is because the enclosing environment of fn is the global environment. The parent frame of fn is the environment of fn2, but the environment object created by fn contains a pointer to the global environment.
On the other hand, if we do the following:
fn <- function() {
load("bigfile.RData")
test <- new.env()
test$b <- a
test2 <- new.env(parent=test)
save(test2, file="far-too-big.RData")
}
test2 is now enclosed in two environments (being test and the environment of fun), and both environments are saved in the file as well. So you get this situation :
Regardless of this, I personally avoid saving environments as environments, because there are more things that can go wrong. In my opinion, saving an environment as a list is in 99.9% of the cases the better choice :
fn2 <- function(){
load("bigfile.RData")
test <- new.env()
test$x <- "something"
test$fn <- ls
testlist <- as.list(test)
save(testlist, file="right-size.RData")
}
fn2()
If you need it to be an environment, you can convert it back when loading.
load("right-size.RData")
test <- as.environment(testlist)