Updating embedded data, for example sysdata.rda - r

My latest submission to CRAN got bounced back because I have assignments to the global environment which is now frowned upon.
I have an embedded data set (sysdata.rda) that contains configuration parameters based upon state (as in United States) the user resides. I have wanted this embedded data set to be updatable when a new user uses the program. I previously updated this data in the initial function the user uses and made it accessible to the user via global assignment.
I am struggling to figure out how to update this embedded data and make it the default data that the user uses for the remainder of their session.
Previously I housed the data in /data and recently switched it to /R/sysdata.rda as it seemed more suited for that locale. Now I'm not so sure.
Any help greatly appreciated

The key is to do the assignment in an environment other than the global environment. There are two basic techniques, using local() and <<- or explicitly creating a new environment:
Working with an explicit environment is straightforward: create the environment and then assign into it like a list:
my_opts <- new.env(parent = emptyenv())
set_state <- function(value) my_opts$state <- value
get_state <- function() my_opts$state
Using local() is a little more complex, and requires some tricks with <<-
set_state <- NULL
get_state <- NULL
local({
state <- NULL
get_state <<- function() state
set_state <<- function(value) state <<- value
})
For more info on how <<- works see https://github.com/hadley/devtools/wiki/environments, in the "Assignment: binding names to values" section.

Why not have an foo.R file in /data which loads the data and updates it when the user calls data(foo)? This is one of the allowed options for /data, but note the following from Writing R Extensions
Note that R code should be “self-sufficient” and not make use of extra functionality provided by the package, so that the data file can also be used without having to load the package.
If you can live with that restriction then data(foo) could load the data, update it, and ensure that it is in a specific named object, which you then refer to in your functions.

Related

How to load internally stored data before functions in an R package?

I am currently writing an R package containing project-specific data cleaning functions for my collaborators, using devtools and roxygen2 and following the RStudio suggested formats. Many of these functions essentially fix typos/common data entry errors using reference files (dataframes) that are currently stored in the package's sysdata.rda file under /R. Prior to the issue I present below, referencing the files and using them within functions was working fine. Lazy load is set to true.
I would like to make a function that allows users to add a row to these reference files if they come across a novel typo/error. From my research and from reading the very helpful information at https://r-pkgs.org/data.html, it seems the best way to do this is to list the reference files to a new environment, and then allow the user to edit those files within the session-specific environment. Ideally, these changes would persist across sessions but I cannot figure out how to make that work so am continuing down this path.
For brevity, I've only included one of these files, called column_standardized, that contains standard names for the columns as well as the potential alternatives we regularly come across. A function called "standardize.columns" coerces column names of input data frames to the standards and reorders them also to our agreed standard.
Here is a short reproducible of the column_standardized:
column_standardized <- data.frame(standard_name = c("date", "date", "time", "ID", "ID", "ID", "location", "location"), other_names = c("DATE", "day", "TIME", "id", "individual", "name", "LOCATION", "locale"))
To do this, I created a file in /R that contains the following code, based heavily on the example in https://r-pkgs.org/data.html (section 8.5, Internal State). The file is titled "aaaaa.R" so it comes before other functions in /R:
(The reason I set key, correct_col, and alt_col are so that the same function can act upon different reference files with the same code. It works fine and I am not necessarily seeking feedback on the data manipulation aspects of this function.)
the <- new.env(parent = emptyenv())
the$column_standardized <- column_standardized
#' Add alternative names to reference
#'
#' #param correct A character string of the standard/correct name
#' #param alt A character string of the alternative name
#' #param data_type A character string indicating which reference file to edit. Options are 'column' (and others in reality).
#'
#' #return NA; edits included reference files for current R session.
#' #export
add.alt <- function(correct, alt, data_type){
if(data_type == "column"){
key <- the$column_standardized
correct_col <- "standard_name"
alt_col <- "other_names"
}else{print("\nData type not found. Acceptable options are 'column', etc.")}
new_key <- data.frame(matrix(ncol = length(colnames(key)), nrow = 1))
colnames(new_key) <- colnames(key)
new_key[1,correct_col] <- correct
new_key[1, alt_col] <- alt
key <- rbind(key, new_key)
if(data_type == "column"){
the$column_standardized <- key
invisible(key)
}}
No errors/issues are flagged when I run document() or load_all(), but when I check the package it is unable to install because column_standardized does not exist. I assume this is because sysdata.rda is loaded after .R files in /R.
I have also tried to put column_standardized in the /data folder and call it using system.file but run into the same error.
The actual file is over 300 rows long, and there are multiple reference files, and so I don't think it makes sense to just recreate the data frame in the environment from scratch, although I've considered it.
Finally, to my specific questions:
Is there a way to load the system data first so that .R files in /R can reference the data included?
I am not wed to storing them internally although that would be ideal for privacy reasons, and could move the dataframes to /data or another location. This seemed initially simplest but I could be wrong.
Is there a modification that would allow each user to "permanently" modify these reference files? It wouldn't be too much of a headache for them to run add.alt() each session since the files already contain most common errors, and once user data is edited/standardized one usually does not need to restandardize in another session. If someone knows a solution, however, it would probably be the ideal.
I am potentially completely off-base here as this is my first time developing a package, so any tips are appreciated! Many thanks in advance, and happy to provide more documentation information if I've forgotten anything crucial.

R not remembering objects written within functions

I'm struggling to clearly explain this problem.
Essentially, something has seemed to have happened within the R environment and none of the code I write inside my functions are working and not data is being saved. If I type a command line directly into the console it works (i.e. Monkey <- 0), but if I type it within a function, it doesn't store it when I run the function.
It could be I'm missing a glaring error in the code, but I noticed the problem when I accidentally clicked on the debugger and tried to excite out of the browser[1] prompt which appeared.
Any ideas? This is driving me nuts.
corr <- function(directory, threshold=0) {
directory <- paste(getwd(),"/",directory,"/",sep="")
file.list <- list.files(directory)
number <- 1:length(file.list)
monkey <- c()
for (i in number) {
x <- paste(directory,file.list[i],sep="")
y <- read.csv(x)
t <- sum(complete.cases(y))
if (t >= threshold) {
correl <- cor(y$sulfate, y$nitrate, use='pairwise.complete.obs')
monkey <- append(monkey,correl)}
}
#correl <- cor(newdata$sulfate, newdata$nitrate, use='pairwise.complete.obs')
#summary(correl)
}
corr('specdata', 150)
monkey```
It's a namespace issue. Functions create their own 'environment', that isn't necessarily in the global environment.
Using <- will assign in the local environment. To save an object to the global environment, use <<-
Here's some information on R environments.
I suggest you give a look at some tutorial on using functions in R.
Briefly (and sorry for my horrible explanation) objects that you define within functions will ONLY be defined within functions, unless you explicitly export them using (one of the possible approaches) the return() function.
browser() is indeed used for debugging, keeps you inside the function, and allows you accessing objects created inside the function.
In addition, to increase the probability to have useful answers, I suggest that you try to post a self-contained, working piece of code allowing quickly reproducing the issue. Here you are reading some files we have no access to.
It seems to me you have to store the output yourself when you run your script:
corr_out <- corr('specdata', 150)

Using delayed assignment in a function: How do I send the promise back to the parent environment?

I would like to used delayedAssign to load a series of data from a set of files only when the data is needed. But since these files will always be in the same directory (which may be moved around), instead of hard coding the location of each file (which would be tedious to change later on if the directory was moved), I would like to simply make a function that accepts the filepath for the directory.
loadLayers <- function(filepath) {
delayedAssign("dataset1", readRDS(file.path(filepath, "experiment1.rds")))
delayedAssign("dataset2", readRDS(file.path(filepath, "experiment2.rds")))
delayedAssign("dataset3", readRDS(file.path(filepath,"experiment3.rds")))
return (list <- (setOne = dataset1, setTwo = dataset2, setThree = dataset3)
}
So instead of loading all the data sets at the start, I'd like to have each data set loaded only when needed (which speeds up the shiny app).
However, I'm having trouble when doing this in a function. It works when the delayedAssign is not in a function, but when I put them in a function, all the objects in the list simply return null, and the "promise" to evaluate them when needed doesn't seem to be fulfilled.
What would be the correct way to achieve this?
Thanks.
Your example code doesn't work in R, but even conceptually, you're using delayedAssign and then you immediately resolve it by referencing it in return() so you end up loading everything anyway. To make it clear, assignments are binding a symbol to a value in an enviroment. So in order for it to make any sense your function must return the environment, not a list. Or, you can simply use the global environment and the function doesn't need to return anything as you use it for its side-effect.
loadLayers <- function(filepath, where=.GlobalEnv) {
delayedAssign("dataset1", readRDS(file.path(filepath, "experiment1.rds")),
assign.env=where)
delayedAssign("dataset2", readRDS(file.path(filepath, "experiment2.rds")),
assign.env=where)
delayedAssign("dataset3", readRDS(file.path(filepath, "experiment3.rds")),
assign.env=where)
where
}

Update a data frame in shiny server.R without restarting the App

Any ideas on how to update a data frame that shiny is using without stopping and restarting the application?
I tried putting a load(file = "my_data_frame.RData", envir = .GlobalEnv) inside a reactive function but so far no luck. The data frame isn't updated until after the app is stopped.
If you just update regular variables (in the global environment, or otherwise) Shiny doesn't know to react to them. You need to use a reactiveValues object to store your variables instead. You create one using reactiveValues() and it works much like an environment or list--you can store objects by name in it. You can use either $foo or [['foo']] syntax for accessing values.
Once a reactive function reads a value from a reactiveValues object, if that value is overwritten by a different value in the future then the reactive function will know it needs to re-execute.
Here's an example (made more complicated by the fact that you are using load instead of something that returns a single value, like read.table):
values <- reactiveValues()
updateData <- function() {
vars <- load(file = "my_data_frame.RData", envir = .GlobalEnv)
for (var in vars)
values[[var]] <- get(var, .GlobalEnv)
}
updateData() # also call updateData() whenever you want to reload the data
output$foo <- reactivePlot(function() {
# Assuming the .RData file contains a variable named mydata
plot(values$mydata)
}
We should have better documentation on this stuff pretty soon. Thanks for bearing with us in the meantime.

Empty R environment becomes large file when saved

I'm getting behaviour I don't understand when saving environments. The code below demonstrates the problem. I would have expected the two files (far-too-big.RData, and right-size.RData) to be the same size, and also very small because the environments they contain are empty.
In fact, far-too-big.RData ends up the same size as bigfile.RData.
I get the same results using 2.14.1 and 2.15.2, both on WinXP 5.1 SP3. Can anyone explain why this is happening?
Both far-too-big.RData and right-size.RData, when loaded into a new R session, appear to contain nothing. ie they return character(0) in response to ls(). However, if I switch the saves to include ascii=TRUE, and open the result in a text editor, I can see that far-too-big.RData contains the data in bigfile.RData.
a <- matrix(runif(1000000, 0, 1), ncol=1000)
save(a, file="bigfile.RData")
fn <- function() {
load("bigfile.RData")
test <- new.env()
save(test, file="far-too-big.RData")
test1 <- new.env(parent=globalenv())
save(test1, file="right-size.RData")
}
fn()
This is not my area of expertise but I belive environments work like this.
Any environment inherits everything in its parent environment.
All function calls create their own environment.
The result of the above in your case is:
When you run fn() it creates its own local environment (green), whose parent by default is globalenv() (grey).
When you create the environment test (red) inside fn() its parent defaults to fn()'s environment (green). test will therefore include the object a.
When you create the environment test1 (blue) and explicitly states that its parent is globalenv() it is separated from fn()'s environment and does not inherit the object a.
So when saving test you also save a (somewhat hidden) copy of the object a. This does not happen when you save test1 as it does not include the object a.
Update
Apparently this is a more complicated topic than I used to believe. Although I might just be quoting #joris-mays answer now I'd like to take a final go at it.
To me the most intuitive visualization of environments would be a tree structure, see below, where each node is an environment and the arrows point to its respective enclosing environment (which I would like to believe is the same as its parent, but that has to do with frames and is beyond my corner of the world). A given environment encloses all objects you can reach by moving down the tree and it can access all objects you can reach by moving up the tree. When you save an environment it appears you save all objects and environments that are both enclosed by it and accessible from it (with the exception of globalenv()).
However, the take home message is as Joris already stated: save your objects as lists and you don't need to worry.
If you want to know more I can recommend Norman Matloff's excellent book the art of R programming. It is aimed at software development in R rather than primary data analysis and assumes you have a fair bit of programming experience. I must admit I haven't fully digested the environment part yet, but as the rest of the book is very well written and pedagogical I assume this one is too.
Actually, it's the other way around than #Backlin shows: the parent environment is the one that encloses the other ones. So in the case you define, the enclosing environment of test is the local environment of fn, and the enclosing environment of test1 is the global environment, like this:
Environments behave different from other objects in R, in the sense that they don't get copied when passed to functions or used in assignments. The environment object itself consists internally of pointers to :
a frame (which is a pairlist containing the values)
the enclosing environment (as explained above)
a hash table (which is either a list or NULL if the environment is not hashed)
The fact that an environment contains pointers, makes all the difference. Environments are not all that easy to deal with, they're actually very tricky. Take a look at the code below :
> test <- new.env()
> test$a <- 1
> test2 <- test
> test2$a <- 2
> test$a
[1] 2
So the only thing you copied from test in test2, is the pointers. If you change a value in test2, you change that in test as well. (Actually, you change that value only once, but test and test2 point both to the same frame).
When you try to save an environment, R has no choice but to get the values for the frame, the hash table AND the enclosing environment and save those. As the enclosing environment is an environment in itself, R will also save all enclosing environments until it reaches the global environment. As the global environment is treated in a special way in the internal code, that one is (luckily) not saved in the file.
Note the difference between an enclosing environment and a parent frame:
Say we define our functions a bit different :
a <- matrix(runif(1000000, 0, 1), ncol=1000)
save(a, file="bigfile.RData")
fn <- function() {
load("bigfile.RData")
test <- new.env()
save(test, file="far-too-big.RData")
test1 <- new.env(parent=globalenv())
save(test1, file="right-size.RData")
}
fn2 <- function(){
z <- matrix(runif(1000000,0,1),ncol=1000)
fn()
}
fn2()
Now we have the following situation :
One would think that the file "far-too-big.RData" contains both matrix a and matrix z, but that's not the case. It contains only the matrix a. This is because the enclosing environment of fn is the global environment. The parent frame of fn is the environment of fn2, but the environment object created by fn contains a pointer to the global environment.
On the other hand, if we do the following:
fn <- function() {
load("bigfile.RData")
test <- new.env()
test$b <- a
test2 <- new.env(parent=test)
save(test2, file="far-too-big.RData")
}
test2 is now enclosed in two environments (being test and the environment of fun), and both environments are saved in the file as well. So you get this situation :
Regardless of this, I personally avoid saving environments as environments, because there are more things that can go wrong. In my opinion, saving an environment as a list is in 99.9% of the cases the better choice :
fn2 <- function(){
load("bigfile.RData")
test <- new.env()
test$x <- "something"
test$fn <- ls
testlist <- as.list(test)
save(testlist, file="right-size.RData")
}
fn2()
If you need it to be an environment, you can convert it back when loading.
load("right-size.RData")
test <- as.environment(testlist)

Categories

Resources