I recently wrote a R extension. The functions use data contained in the package and must therefore load them. Subroutines also need to access the data.
This is the approach taken:
main<- function(...){
data(data)
sub <- function(...,data=data){...}
...
}
I'm unhappy with the fact that the data resides in .GlobalEnv so it still hangs around when the function had terminated (also undermining the downpassing via argument concept).
Please put me on the right track! How do you employ environments, when you have to handle package-data in package-functions?
It looks that you are looking for the LazyData directive in your namepace:
LazyData: yes
Othewise, data has the envir argument you can use to control in which environment you want to load your data, so for example if you wanted the data to be loaded inside main, you could use :
main<- function(...){
data(data, envir = environment() )
sub <- function(...,data=data){...}
...
}
If the data is needed for your functions, not for the user of the package, it should be saved in a file called sysdata.rda located in the R directory.
From R extensions:
Two exceptions are allowed: if the R subdirectory contains a file
sysdata.rda (a saved image of R objects: please use suitable
compression as suggested by tools::resaveRdaFiles) this will be
lazy-loaded into the namespace/package environment – this is intended
for system datasets that are not intended to be user-accessible via
data.
Related
I have a bunch of functions and I'm trying to keep my workspace clean by defining them in an environment and attaching the environment. Some of the functions are S3 generics, and they don't seem to play well with this approach.
A minimum example of what I'm experiencing requires 4 files:
testfun.R
ttt.xxx <- function(object) print("x")
ttt <- function(object) UseMethod("ttt")
ttt2 <- function() {
yyy <- structure(1, class="xxx")
ttt(yyy)
}
In testfun.R I define an S3 generic ttt and a method ttt.xxx, I also define a function ttt2 calling the generic.
testenv.R
test_env <- new.env(parent=globalenv())
source("testfun.R", local=test_env)
attach(test_env)
In testenv.R I source testfun.R to an environment, which I attach.
test1.R
source("testfun.R")
ttt2()
xxx <- structure(1, class="xxx")
ttt(xxx)
test1.R sources testfun.R to the global environment. Both ttt2 and a direct function call work.
test2.R
source("testenv.R")
ttt2()
xxx <- structure(1, class="xxx")
ttt(xxx)
test2.R uses the "attach" approach. ttt2 still works (and prints "x" to the console), but the direct function call fails:
Error in UseMethod("ttt") :
no applicable method for 'ttt' applied to an object of class "xxx"
however, calling ttt and ttt.xxx without arguments show that they are known, ls(pos=2) shows they are on the search path, and sloop::s3_dispatch(ttt(xxx)) tells me it should work.
This questions is related to Confusion about UseMethod search mechanism and the link therein https://blog.thatbuthow.com/how-r-searches-and-finds-stuff/, but I cannot get my head around what is going on: why is it not working and how can I get this to work.
I've tried both R Studio and R in the shell.
UPDATE:
Based on the answers below I changed my testenv.R to:
test_env <- new.env(parent=globalenv())
source("testfun.R", local=test_env)
attach(test_env)
if (is.null(.__S3MethodsTable__.))
.__S3MethodsTable__. <- new.env(parent = baseenv())
for (func in grep(".", ls(envir = test_env), fixed = TRUE, value = TRUE))
.__S3MethodsTable__.[[func]] <- test_env[[func]]
rm(test_env, func)
... and this works (I am only using "." as an S3 dispatching separator).
It’s a little-known fact that you must use .S3method() to define methods for S3 generics inside custom environments (outside of packages).1 The reason almost nobody knows this is because it is not necessary in the global environment; but it is necessary everywhere else since R version 3.6.
There’s virtually no documentation of this change, just a technical blog post by Kurt Hornik about some of the background. Note that the blog post says the change was made in R 3.5.0; however, the actual effect you are observing — that S3 methods are no longer searched in attached environments — only started happening with R 3.6.0; before that, it was somehow not active yet.
… except just using .S3method will not fix your code, since your calling environment is the global environment. I do not understand the precise reason why this doesn’t work, and I suspect it’s due to a subtle bug in R’s S3 method lookup. In fact, using getS3method('ttt', 'xxx') does work, even though that should have the same behaviour as actual S3 method lookup.
I have found that the only way to make this work is to add the following to testenv.R:
if (is.null(.__S3MethodsTable__.)) {
.__S3MethodsTable__. <- new.env(parent = baseenv())
}
.__S3MethodsTable__.$ttt.xxx <- ttt.xxx
… in other words: supply .GlobalEnv manually with an S3 methods lookup table. Unfortunately this relies on an undocumented S3 implementation detail that might theoretically change in the future.
Alternatively, it “just works” if you use ‘box’ modules instead of source. That is, you can replace the entirety of your testenv.R by the following:
box::use(./testfun[...])
This code treats testfun.R as a local module and loads it, attaching all exported names (via the attach declaration [...]).
1 (and inside packages you need to use the equivalent S3method namespace declaration, though if you’re using ‘roxygen2’ then that’s taken care of for you)
First of all, my advice would be: don't try to reinvent R packages. They solve all the problems you say you are trying to solve, and others as well.
Secondly, I'll try to explain what went wrong in test2.R. It calls ttt on an xxx object, and ttt.xxx is on the search list, but is not found.
The problem is how the search for ttt.xxx happens. The search doesn't look for ttt.xxx in the search list, it looks for it in the environment from which ttt was called, then in an object called .__S3MethodsTable__.. I think there are two reasons for this:
First, it's a lot faster. It only needs to look in one or two places, and the table can be updated whenever a package is attached or detached, a relatively rare operation.
Second, it's more reliable. Each package has its own methods table, because two packages can use the same name for generics that have nothing to do with each other, or can use the same class names that are unrelated. So package code needs to be able to count on finding its own definitions first.
Since your call to ttt() happens at the top level, that's where R looks first for ttt.xxx(), but it's not there. Then it looks in the global .__S3MethodsTable__. (which is actually in the base environment), and it's not there either. So it fails.
There is a workaround that will make your code work. If you run
.__S3MethodsTable__. <- list2env(list(ttt.xxx = ttt.xxx))
as the last line of testenv.R, then you'll create a methods table in the global environment. (Normally there isn't one there, because that's user space, and R doesn't like putting things there unless the user asks for it.)
R will find that methods table, and will find the ttt.xxx method that it defines. I wouldn't be surprised if this breaks some other aspect of S3 dispatch, so I don't recommend doing it, but give it a try if you insist on reinventing the package system.
We are building a package in R for our service (a robo-advisor here in Brazil) and we send requests all the time to our external API inside our functions.
As it is the first time we build a package we have some questions. :(
When we will use our package to run some scripts we will need some information as api_path, login, password.
How do we place this information inside our package?
Here is a real example:
get_asset_daily <- function(asset_id) {
api_path <- "https://api.verios.com.br"
url <- paste0(api_path, "/assets/", asset_id, "/dailies?asc=d")
data <- fromJSON(url)
data
}
Sometimes we use a staging version of the API and we have to constantly switch paths. How we should call it inside our function?
Should we set a global environment variable, a package environment variable, just define api_path in our scripts or a package config file?
How do we do that?
Thanks for your help in advance.
Ana
One approach would be to use R's options interface. Create a file zzz.r in the R directory (this is the customary name for this file) with the following:
.onLoad <- function(libname, pkgname) {
options(api_path='...', username='name', password='pwd')
}
This will set these options when the package is loaded into memory.
Is there a definitive way to save options or information pertaining to a certain package between sessions?
For example say somebody made a game and released it as an R package. If they wanted to save high scores and not have them reset each time R started a new session what would be the best way to do this? Currently I can only think of storing a file in the users home directory but I'm not sure if I like that approach.
This may be an approach. I created a dummy package with a dummy function (any function I create is bound to be a dummy function) and a data set I called scores that I set as follows:
scores <- NA
Then I created the package with the scores data set.
Then I used the following to change the data set from within R.
loc <- paste0(find.package("new"), "/Data")
unlink(paste0(loc, "/scores.rda"), recursive = TRUE, force = FALSE)
scores <- 10
save(scores, file=paste0(loc, "/scores.rda"))
Then when I unloaded the library and re loaded agin the data set now says:
> scores
[1] 10
Could this be modified to do what you want? You'd have to have it save in between somehow but am not sure on how to do this without messing with .Last function.
EDIT:
It appears this option is not viable in that when you compile as a package and use lazy load it saves the data sets as:
RData.rbd, RData.rbx, not as .rda files. That means the approach I use above is kinda worthless in that we want it to automatically be recognized.
EDIT2
This approach works and I tried it on a package I made. You can't do lazy load of the data and you have to either explicitly use data(scores) or use data(scores) inside of the function you're calling. I also assigned scores to .scores int he global.env the first time it was created and used exists inside the function to see if it exists. If `.scores. existed I assigned that to scores within the function. Once you unload the library and laod again you never have to worry about that again.
Maybe an alternative is to save this as a function somehow that can be altered using Josh's advice here: Permanently replacing a function
I guess there is no way to store settings without saving them to disk or a database, some way or another. It can be done silently though by putting the code below in your ~/.Rprofile. However, if you have packages that save settings in other ways than using options you need to add them manually.
I know this is exactly what you said you did not want, but it might spark some debate at least.
.Last <- function(){
my.options <- options()
save(my.options, file="~/.Roptions.Rdata")
}
.First <- function(){
tryCatch({
load("~/.Roptions.Rdata")
do.call(options, my.options)
rm(my.options)
}, error=function(...){})
}
To my suprise try(..., silent=TRUE) gives a warning on startup if ~/.Roptions.Rdata does not exist, which is why I used tryCatch instead.
The modern answer to this problem is well explained at https://blog.r-hub.io/2020/03/12/user-preferences/
I think I will be trying the hoardr package! Here is an example that worked for me :)
x <- hoardr::hoard()
x$cache_path_set("yourpackage", type = 'user_cache_dir')
x$mkdir()
scores<-data.frame(
user=c("one","two","three"),
score=c("500,200,1100")
)
save(scores,file = file.path(x$cache_path_get(), "scores.rdata"))
x$list()
x$details()
#new session
x <- hoardr::hoard()
x$cache_path_set("yourpackage", type = 'user_cache_dir')
x$list()
x$details()
load(file = file.path(x$cache_path_get(), "scores.rdata"))
PS - you can see a working example in the rnoaa package found on at github "opensci/rnoaa". Check their R/onload.r file! I can expand if needed.
I have a few convenience functions in my .Rprofile, such as this handy function for returning the size of objects in memory. Sometimes I like to clean out my workspace without restarting and I do this with rm(list=ls()) which deletes all my user created objects AND my custom functions. I'd really like to not blow up my custom functions.
One way around this seems to be creating a package with my custom functions so that my functions end up in their own namespace. That's not particularly hard, but is there an easier way to ensure custom functions don't get killed by rm()?
Combine attach and sys.source to source into an environment and attach that environment. Here I have two functions in file my_fun.R:
foo <- function(x) {
mean(x)
}
bar <- function(x) {
sd(x)
}
Before I load these functions, they are obviously not found:
> foo(1:10)
Error: could not find function "foo"
> bar(1:10)
Error: could not find function "bar"
Create an environment and source the file into it:
> myEnv <- new.env()
> sys.source("my_fun.R", envir = myEnv)
Still not visible as we haven't attached anything
> foo(1:10)
Error: could not find function "foo"
> bar(1:10)
Error: could not find function "bar"
and when we do so, they are visible, and because we have attached a copy of the environment to the search path the functions survive being rm()-ed:
> attach(myEnv)
> foo(1:10)
[1] 5.5
> bar(1:10)
[1] 3.027650
> rm(list = ls())
> foo(1:10)
[1] 5.5
I still think you would be better off with your own personal package, but the above might suffice in the meantime. Just remember the copy on the search path is just that, a copy. If the functions are fairly stable and you're not editing them then the above might be useful but it is probably more hassle than it is worth if you are developing the functions and modifying them.
A second option is to just name them all .foo rather than foo as ls() will not return objects named like that unless argument all = TRUE is set:
> .foo <- function(x) mean(x)
> ls()
character(0)
> ls(all = TRUE)
[1] ".foo" ".Random.seed"
Here are two ways:
1) Have each of your function names start with a dot., e.g. .f instead of f. ls will not list such functions unless you use ls(all.names = TRUE) therefore they won't be passed to your rm command.
or,
2) Put this in your .Rprofile
attach(list(
f = function(x) x,
g = function(x) x*x
), name = "MyFunctions")
The functions will appear as a component named "MyFunctions" on your search list rather than in your workspace and they will be accessible almost the same as if they were in your workspace. search() will display your search list and ls("MyFunctions") will list the names of the functions you attached. Since they are not in your workspace the rm command you normally use won't remove them. If you do wish to remove them use detach("MyFunctions") .
Gavin's answer is wonderful, and I just upvoted it. Merely for completeness, let me toss in another one:
R> q("no")
followed by
M-x R
to create a new session---which re-reads the .Rprofile. Easy, fast, and cheap.
Other than that, private packages are the way in my book.
Another alternative: keep the functions in a separate file which is sourced within .RProfile. You can re-source the contents directly from within R at your leisure.
I find that often my R environment gets cluttered with various objects when I'm creating or debugging a function. I wanted a way to efficiently keep the environment free of these objects while retaining personal functions.
The simple function below was my solution. It does 2 things:
1) deletes all non-function objects that do not begin with a capital letter and then
2) saves the environment as an RData file
(requires the R.oo package)
cleanup=function(filename="C:/mymainR.RData"){
library(R.oo)
# create a dataframe listing all personal objects
everything=ll(envir=1)
#get the objects that are not functions
nonfunction=as.vector(everything[everything$data.class!="function",1])
#nonfunction objects that do not begin with a capital letter should be deleted
trash=nonfunction[grep('[[:lower:]]{1}',nonfunction)]
remove(list=trash,pos=1)
#save the R environment
save.image(filename)
print(paste("New, CLEAN R environment saved in",filename))
}
In order to use this function 3 rules must always be kept:
1) Keep all data external to R.
2) Use names that begin with a capital letter for non-function objects that I want to keep permanently available.
3) Obsolete functions must be removed manually with rm.
Obviously this isn't a general solution for everyone...and potentially disastrous if you don't live by rules #1 and #2. But it does have numerous advantages: a) fear of my data getting nuked by cleanup() keeps me disciplined about using R exclusively as a processor and not a database, b) my main R environment is so small I can backup as an email attachment, c) new functions are automatically saved (I don't have to manually manage a list of personal functions) and d) all modifications to preexisting functions are retained. Of course the best advantage is the most obvious one...I don't have to spend time doing ls() and reviewing objects to decide whether they should be rm'd.
Even if you don't care for the specifics of my system, the "ll" function in R.oo is very useful for this kind of thing. It can be used to implement just about any set of clean up rules that fit your personal programming style.
Patrick Mohr
A nth, quick and dirty option, would be to use lsf.str() when using rm(), to get all the functions in the current workspace. ...and let you name the functions as you wish.
pattern <- paste0('*',lsf.str(), '$', collapse = "|")
rm(list = ls()[-grep(pattern, ls())])
I agree, it may not be the best practice, but it gets the job done! (and I have to selectively clean after myself anyway...)
Similar to Gavin's answer, the following loads a file of functions but without leaving an extra environment object around:
if('my_namespace' %in% search()) detach('my_namespace'); source('my_functions.R', attach(NULL, name='my_namespace'))
This removes the old version of the namespace if it was attached (useful for development), then attaches an empty new environment called my_namespace and sources my_functions.R into it. If you don't remove the old version you will build up multiple attached environments of the same name.
Should you wish to see which functions have been loaded, look at the output for
ls('my_namespace')
To unload, use
detach('my_namespace')
These attached functions, like a package, will not be deleted by rm(list=ls()).
My question is about avoiding namespace pollution when writing modules in R.
Right now, in my R project, I have functions1.R with doFoo() and doBar(), functions2.R with other functions, and main.R with the main program in it, which first does source('functions1.R'); source('functions2.R'), and then calls the other functions.
I've been starting the program from the R GUI in Mac OS X, with source('main.R'). This is fine the first time, but after that, the variables that were defined the first time through the program are defined for the second time functions*.R are sourced, and so the functions get a whole bunch of extra variables defined.
I don't want that! I want an "undefined variable" error when my function uses a variable it shouldn't! Twice this has given me very late nights of debugging!
So how do other people deal with this sort of problem? Is there something like source(), but that makes an independent namespace that doesn't fall through to the main one? Making a package seems like one solution, but it seems like a big pain in the butt compared to e.g. Python, where a source file is automatically a separate namespace.
Any tips? Thank you!
I would explore two possible solutions to this.
a) Think more in a more functional manner. Don't create any variables outside of a function. so, for example, main.R should contain one function main(), which sources in the other files, and does the work. when main returns, none of the clutter will remain.
b) Clean things up manually:
#main.R
prior_variables <- ls()
source('functions1.R')
source('functions2.R')
#stuff happens
rm(list = setdiff(ls(),prior_variables))`
The main function you want to use is sys.source(), which will load your functions/variables in a namespace ("environment" in R) other than the global one. One other thing you can do in R that is fantastic is to attach namespaces to your search() path so that you need not reference the namespace directly. That is, if "namespace1" is on your search path, a function within it, say "fun1", need not be called as namespace1.fun1() as in Python, but as fun1(). [Method resolution order:] If there are many functions with the same name, the one in the environment that appears first in the search() list will be called. To call a function in a particular namespace explicitly, one of many possible syntaxes - albeit a bit ugly - is get("fun1","namespace1")(...) where ... are the arguments to fun1(). This should also work with variables, using the syntax get("var1","namespace1"). I do this all the time (I usually load just functions, but the distinction between functions and variables in R is small) so I've written a few convenience functions that loads from my ~/.Rprofile.
name.to.env <- function(env.name)
## returns named environment on search() path
pos.to.env(grep(env.name,search()))
attach.env <- function(env.name)
## creates and attaches environment to search path if it doesn't already exist
if( all(regexpr(env.name,search())<0) ) attach(NULL,name=env.name,pos=2)
populate.env <- function(env.name,path,...) {
## populates environment with functions in file or directory
## creates and attaches named environment to search() path
## if it doesn't already exist
attach.env(env.name)
if( file.info(path[1])$isdir )
lapply(list.files(path,full.names=TRUE,...),
sys.source,name.to.env(env.name)) else
lapply(path,sys.source,name.to.env(env.name))
invisible()
}
Example usage:
populate.env("fun1","pathtofile/functions1.R")
populate.env("fun2","pathtofile/functions2.R")
and so on, which will create two separate namespaces: "fun1" and "fun2", which are attached to the search() path ("fun2" will be higher on the search() list in this case). This is akin to doing something like
attach(NULL,name="fun1")
sys.source("pathtofile/functions1.R",pos.to.env(2))
manually for each file ("2" is the default position on the search() path). The way that populate.env() is written, if a directory, say "functions/", contains many R files without conflicting function names, you can call it as
populate.env("myfunctions","functions/")
to load all functions (and variables) into a single namespace. With name.to.env(), you can also do something like
with(name.to.env("fun1"), doStuff(var1))
or
evalq(doStuff(var1), name.to.env("fun1"))
Of course, if your project grows big and you have lots and lots of functions (and variables), writing a package is the way to go.
If you switch to using packages, you get namespaces as a side-benefit (provided you use a NAMESPACE file). There are other advantages for using packages.
If you were really trying to avoid packages (which you shouldn't), then you could try assigning your variables in specific environments.
Well avoiding namespace pollution, as you put it, is just a matter of diligently partitioning the namespace and keeping your global namespace uncluttered.
Here are the essential functions for those two kinds of tasks:
Understanding/Navigating the Namespace Structure
At start-up, R creates a new environment to store all objects created during that session--this is the "global environment".
# to get the name of that environment:
globalenv()
But this isn't the root environment. The root is an environment called "the empty environment"--all environments chain back to it:
emptyenv()
returns: <environment: R_EmptyEnv>
# to view all of the chained parent environments (which includes '.GlobalEnv'):
search()
Creating New Environments:
workspace1 = new.env()
is.environment(workspace1)
returns: [1] TRUE
class(workspace1)
returns: [1] "environment"
# add an object to this new environment:
with(workspace1, attach(what="/Users/doug/Documents/test_obj.RData",
name=deparse(substitute(what)), warn.conflicts=T, pos=2))
# verify that it's there:
exists("test_obj", where=workspace1)
returns: [1] TRUE
# to locate the new environment (if it's not visible from your current environment)
parent.env(workspace1)
returns: <environment: R_GlobalEnv>
objects(".GlobalEnv")
returns: [1] "test_obj"
Coming from python, et al., this system (at first) seemed to me like a room full of carnival mirrors. The R Gurus on the other hand seem to be quite comfortable with it. I'm sure there are a number of reasons why, but my intuition is that they don't let environments persist. I notice that R beginners use 'attach', as in attach('this_dataframe'); I've noticed that experienced R users don't do that; they use 'with' instead eg,
with(this_dataframe, tapply(etc....))
(I suppose they would achieve the same thing if they used 'attach' then 'detach' but 'with' is faster and you don't have to remember the second step.) In other words, namespace collisions are avoided in part by limiting the objects visible from the global namespace.