How do we set constant variables while building R packages? - r

We are building a package in R for our service (a robo-advisor here in Brazil) and we send requests all the time to our external API inside our functions.
As it is the first time we build a package we have some questions. :(
When we will use our package to run some scripts we will need some information as api_path, login, password.
How do we place this information inside our package?
Here is a real example:
get_asset_daily <- function(asset_id) {
api_path <- "https://api.verios.com.br"
url <- paste0(api_path, "/assets/", asset_id, "/dailies?asc=d")
data <- fromJSON(url)
data
}
Sometimes we use a staging version of the API and we have to constantly switch paths. How we should call it inside our function?
Should we set a global environment variable, a package environment variable, just define api_path in our scripts or a package config file?
How do we do that?
Thanks for your help in advance.
Ana

One approach would be to use R's options interface. Create a file zzz.r in the R directory (this is the customary name for this file) with the following:
.onLoad <- function(libname, pkgname) {
options(api_path='...', username='name', password='pwd')
}
This will set these options when the package is loaded into memory.

Related

Loading libraries for use in an specific environment in R

I have written some functions to facilitate repeated tasks among my R projects. I am trying to use an environment to load them easily but also prevent them from appearing when I use ls() or delete them with rm(list=ls()).
As a dummy example I have an environment loader function in a file that I can just source from my current project and an additional file for each specialized environment I want to have.
currentProject.R
environments/env_loader.R
environments/colors_env.R
env_loader.R
.environmentLoader <- function(env_file, env_name='my_env') {
sys.source(env_file, envir=attach(NULL, name=env_name))
}
path <- dirname(sys.frame(1)$ofile) # this script's path
#
# Automatically load
.environmentLoader(paste(path, 'colors_env.R', sep='/'), env_name='my_colors')
colors_env.R
library(RColorBrewer) # this doesn't work
# Return a list of colors
dummyColors <- function(n) {
require(RColorBrewer) # This doesn't work
return(brewer.pal(n, 'Blues'))
}
CurrentProject.R
source('./environments/env_loader.R')
# Get a list of 5 colors
dummyColors(5)
This works great except when my functions require me to load a library. In my example, I need to load the RColorBrewer library to use the brewer.pal function in colors_env.R, but the way is now I just get an error Error in brewer.pal(n, "Blues") : could not find function "brewer.pal".
I tried just using library(RColorBrewer) or using require inside my dummyColors function or adding stuff like evalq(library("RColorBrewer"), envir=parent.env(environment())) to the colors_env.R file but it doesn't work. Any suggestions?
If you are using similar functions across projects, I would recommend creating an R package. It's essentially what you're doing in many ways, but you don't have reinvent a lot of the loading mechanisms, etc. Hadley Wickham's book R Packages is very good for this topic. It doesn't need to be a completely fully built out, CRAN ready sort of thing. You can just create a personal package with misc. functions you frequently use.
That being said, the solution for your specific question would be to explicitly use the namespace to call the function.
dummyColors <- function(n) {
require(RColorBrewer) # This doesn't work
return(RColorBrewer::brewer.pal(n, 'Blues'))
}
Create a package and then run it. Use kitten to build the boilerplate, copy your file to it, optionally build it if you want a .tar.gz file or omit that step if you don't need it and finally install it. Then test it out. We have assumed colors_env.R, shown in the question, is in current directory.
(Note that require should always be within an if so that if it does not load then the error is caught. If not within an if use library which will guarantee an error message in that case.)
# create package
library(devtools)
library(pkgKitten)
kitten("colors")
file.copy("colors_env.R", "./colors/R")
build("colors") # optional = will create colors_1.0.tar.gz
install("colors")
# test
library(colors)
dummyColors(5)
## Loading required package: RColorBrewer
## [1] "#EFF3FF" "#BDD7E7" "#6BAED6" "#3182BD" "#08519C"

Altering internal data in R package

I'm working on a package (let's call it myPackage) that needs to refer to external data, which is too big to incorporate into the package itself (it's a lot of netCDF files).
Because of this I have an internal PATH variable (which I initiate in /data-raw/, which saves it to sysdata.rda, and I've turned off lazy loading). When asked to get data, any function can then use this PATH to find the data.
I want the user to be able to specify their path, so I wrote a function:
setpath<-function(path){
myPackage:::PATH = path
}
Doing setpath("C:/") doesn't work. I get the error: Error in setpath("C:/") : object 'myPackage' not found.
I've also tried the following alternative function:
setpath<-function(path){
PATH = path
}
This way, the variable myPackage:::PATH never changes.
How should I be doing this? Is internal data read-only?
You can use options(). Create an option with
options(myPackageRepositoryPath = "some/path")
and retrieve it
path <- getOption(myPackageRepositoryPath)
The same way as you set an option, you also can overwrite an option:
setpath<-function(path){
options(myPackageRepositoryPath = path)
}

Package that downloads data from the internet during installation

Is anyone aware of a package that downloads a dataset from the internet during the installation process and then prepares and saves it so that it is available when loading the package using library(packageName)? Are there any drawbacks in this approach (besides the obvious one that package installation will fail if the data source is unavailable or the data format has changed)?
EDIT: Some background. The data is three tab-separated files in a ZIP archive, owned by federal statistics and generally freely accessible. I have R code which downloads, extracts and prepares the data, in the end three data frames are created which could be saved in .RData format.
I am thinking about creating two packages: A "data" package that provides the data, and a "code" package that operates on it.
I did this mockup before, while you were posting your edit. I presume it would work, but not tested. I've commented it so you can see what you would need to change. The idea here is to check to see if an expected object is available in the current working environment. If it is not, check to see that the file that the data can be found in is in the current working directory. If that is not found, prompt the user to download the file, then proceed from there.
myFunction <- function(this, that, dataset) {
# We're giving the user a chance to specify the dataset.
# Maybe they have already downloaded it and saved it.
if (is.null(dataset)) {
# Check to see if the object is already in the workspace.
# If it is not, check to see whether the .RData file that
# contains the object is in the current working directory.
if (!exists("OBJECTNAME", where = 1)) {
if (isTRUE(list.files(
pattern = "^DATAFILE.RData$") == "DATAFILE.RData")) {
load("DATAFILE.RData")
# If neither of those are successful, prompt the user
# to download the dataset.
} else {
ans = readline(
"DATAFILE.RData dataset not found in working directory.
OBJECTNAME object not found in workspace. \n
Download and load the dataset now? (y/n) ")
if (ans != "y")
return(invisible())
# I usually use RCurl in case the URL is https
require(RCurl)
baseURL = c("http://some/base/url/")
# Here, we actually download the data
temp = getBinaryURL(paste0(baseURL, "DATAFILE.RData"))
# Here we load the data
load(rawConnection(temp), envir=.GlobalEnv)
message("OBJECTNAME data downloaded from \n",
paste0(baseURL, "DATAFILE.RData \n"),
"and added to your workspace\n\n")
rm(temp, baseURL)
}
}
dataset <- OBJECTNAME
}
TEMP <- dataset
## Other fun stuff with TEMP, this, and that.
}
Two packages, hosted at Github
Here's another approach, building on the comments between #juba and I. The basic concept is to have, as you describe, one package for the codes and one for the data. This function would be part of the package that contains your code. It will:
Check to see if the data package is installed
Check to see if the version of the data package you have installed matches the version at Github, which we are going to assume is the most up to date version.
When it fails any of the checks, it asks the user if they want to update their installation of the package. In this case, for demonstration, I've linked to one of my packages in progress at Github. This should give you an idea of what you need to substitute to get it to work with your own package once you've hosted it there.
CheckVersionFirst <- function() {
# Check to see if installed
if (!"StataDCTutils" %in% installed.packages()[, 1]) {
Checks <- "Failed"
} else {
# Compare version numbers
require(RCurl)
temp <- getURL("https://raw.github.com/mrdwab/StataDCTutils/master/DESCRIPTION")
CurrentVersion <- gsub("^\\s|\\s$", "",
gsub(".*Version:(.*)\\nDate.*", "\\1", temp))
if (packageVersion("StataDCTutils") == CurrentVersion) {
Checks <- "Passed"
}
if (packageVersion("StataDCTutils") < CurrentVersion) {
Checks <- "Failed"
}
}
switch(
Checks,
Passed = { message("Everything looks OK! Proceeding!") },
Failed = {
ans = readline(
"'StataDCTutils is either outdated or not installed. Update now? (y/n) ")
if (ans != "y")
return(invisible())
require(devtools)
install_github("StataDCTutils", "mrdwab")
})
# Some cool things you want to do after you are sure the data is there
}
Try it out with CheckVersionFirst().
Note: This would succeed only if you religiously remember to update your version number in your description file every time you push a new version of the data to Github!
So, to clarify/recap/expand, the basic idea would be to:
Periodically push the updated version of your data package to Github, being sure to change the version number of the data package in its DESCRIPTION file when you do so.
Integrate this CheckVersionFirst() function as an .onLoad event in your code package. (Obviously modify the function to match your account and package name).
Change the commented line that reads # Some cool things you want to do after you are sure the data is there to reflect the cool things you actually want to do, which would probably start with library(YOURDATAPACKAGE) to load the data....
This method may not be efficient, but a good workaround. If you are making a package that needs regularly updated data, first make a package which has that data. It does not need any functions, but I like the concept of a setter (which you might not need in this case) & getter.
Then when you make your package, have the 'data'-package as a dependency. This way, whenever someone installs your package, he/she will always have the latest data.
On your part, you'll just have to swap out the data in your 'data' package, and upload it to the repo you want.
If you don't know how to build a package, check ?packages.skeleton and R CMD CHECK, R CMD BUILD

Employ environments to handle package-data in package-functions

I recently wrote a R extension. The functions use data contained in the package and must therefore load them. Subroutines also need to access the data.
This is the approach taken:
main<- function(...){
data(data)
sub <- function(...,data=data){...}
...
}
I'm unhappy with the fact that the data resides in .GlobalEnv so it still hangs around when the function had terminated (also undermining the downpassing via argument concept).
Please put me on the right track! How do you employ environments, when you have to handle package-data in package-functions?
It looks that you are looking for the LazyData directive in your namepace:
LazyData: yes
Othewise, data has the envir argument you can use to control in which environment you want to load your data, so for example if you wanted the data to be loaded inside main, you could use :
main<- function(...){
data(data, envir = environment() )
sub <- function(...,data=data){...}
...
}
If the data is needed for your functions, not for the user of the package, it should be saved in a file called sysdata.rda located in the R directory.
From R extensions:
Two exceptions are allowed: if the R subdirectory contains a file
sysdata.rda (a saved image of R objects: please use suitable
compression as suggested by tools::resaveRdaFiles) this will be
lazy-loaded into the namespace/package environment – this is intended
for system datasets that are not intended to be user-accessible via
data.

In R can I save loaded packages with the workspace?

R normally only saves objects in .GlobalEnv:
$ R
> library(rjson)
> fromJSON
function (...) ...
> q(save='yes')
$ R
> fromJSON
Error: object 'fromJSON' not found
Is there a way to have this information saved as well?
You are now able to save R session information to a file and load it in another session.
First install the "session" package:
install.packages('session')
Load your libraries and your data, then save the session state to a file:
library(session)
library(ggplot2) # plotting
test <- 100
save.session(file='test.Rda')
Then you can load the session state in another session:
library(session)
restore.session(file='test.Rda')
#ggplot2 (and associated data) should have loaded with the session data
head(diamonds)
ggplot(data = diamonds, aes(x = carat)) +
geom_histogram()
print(test)
Reference: https://www.rdocumentation.org/packages/session/versions/1.0.3/topics/save.session
To the best of my knowledge, no. The workspace is for objects like data and functions. Starting R with particular packages loaded is what your .Rprofile file is for, and you can have a different one in each directory.
You could, I suppose, save a function in the workspace that loads the packages you want, and then run that function when you first start R.
joran is right, but I want to mention a technique that, while cumbersome, might be helpful.
You can use a checkpointing program such as DMTCP to save the entire R process and restart it later.
I'd recommend not saving anything between r sessions and instead recreate it all using code. This is much more likely to lead to reproducible results.

Resources