How to manage internal utility functions shared by two R packages - r

I am developing an R package but I would like to break it down into two packages,
say A and B, where B depends on A.
In the course of development I have created a number of internal utility
functions, say .util1(), .util2(), etc. They are useful to keep
my code tidy and avoid repetitions, but I don't want to export them and make them
available to other users.
Rather than having one copy of these functions in both A and B, my idea was to put all of them in package A, and then access them from A using B:::.util1(), ... etc. On the other hand that doesn't look very neat, and I will have to document all these "hidden" dependencies somewhere (given that I will not explicitly export them from A). Are there other alternatives? Thanks!

How about this, using the "zoo" package and its internal variable ".packageName" for illustration purpose. You may replace them with the names of your package and internal variable/function when testing.
library(zoo) # Load a library
zoo:::.packageName # Access an internal variable
.packageName # A test - Fail to call without the Namespace
pkg.env <- getNamespace("zoo") # Store the Namespace
attach(pkg.env) # Attach it
.packageName # Succeed to call directly !
detach(pkg.env) # Detach it afterward
(Edited)
## To export an internal object to the current Namespace (without "attach")
assign(".packageName",get(".packageName",envir=pkg.env))
## Or using a loop if you have a few of internal objects to export
for (obj_name in a_list_of_names) {
assign(obj_name,get(obj_name,envir=pkg.env))
}

Related

Loading libraries for use in an specific environment in R

I have written some functions to facilitate repeated tasks among my R projects. I am trying to use an environment to load them easily but also prevent them from appearing when I use ls() or delete them with rm(list=ls()).
As a dummy example I have an environment loader function in a file that I can just source from my current project and an additional file for each specialized environment I want to have.
currentProject.R
environments/env_loader.R
environments/colors_env.R
env_loader.R
.environmentLoader <- function(env_file, env_name='my_env') {
sys.source(env_file, envir=attach(NULL, name=env_name))
}
path <- dirname(sys.frame(1)$ofile) # this script's path
#
# Automatically load
.environmentLoader(paste(path, 'colors_env.R', sep='/'), env_name='my_colors')
colors_env.R
library(RColorBrewer) # this doesn't work
# Return a list of colors
dummyColors <- function(n) {
require(RColorBrewer) # This doesn't work
return(brewer.pal(n, 'Blues'))
}
CurrentProject.R
source('./environments/env_loader.R')
# Get a list of 5 colors
dummyColors(5)
This works great except when my functions require me to load a library. In my example, I need to load the RColorBrewer library to use the brewer.pal function in colors_env.R, but the way is now I just get an error Error in brewer.pal(n, "Blues") : could not find function "brewer.pal".
I tried just using library(RColorBrewer) or using require inside my dummyColors function or adding stuff like evalq(library("RColorBrewer"), envir=parent.env(environment())) to the colors_env.R file but it doesn't work. Any suggestions?
If you are using similar functions across projects, I would recommend creating an R package. It's essentially what you're doing in many ways, but you don't have reinvent a lot of the loading mechanisms, etc. Hadley Wickham's book R Packages is very good for this topic. It doesn't need to be a completely fully built out, CRAN ready sort of thing. You can just create a personal package with misc. functions you frequently use.
That being said, the solution for your specific question would be to explicitly use the namespace to call the function.
dummyColors <- function(n) {
require(RColorBrewer) # This doesn't work
return(RColorBrewer::brewer.pal(n, 'Blues'))
}
Create a package and then run it. Use kitten to build the boilerplate, copy your file to it, optionally build it if you want a .tar.gz file or omit that step if you don't need it and finally install it. Then test it out. We have assumed colors_env.R, shown in the question, is in current directory.
(Note that require should always be within an if so that if it does not load then the error is caught. If not within an if use library which will guarantee an error message in that case.)
# create package
library(devtools)
library(pkgKitten)
kitten("colors")
file.copy("colors_env.R", "./colors/R")
build("colors") # optional = will create colors_1.0.tar.gz
install("colors")
# test
library(colors)
dummyColors(5)
## Loading required package: RColorBrewer
## [1] "#EFF3FF" "#BDD7E7" "#6BAED6" "#3182BD" "#08519C"

R: How to check if the libraries that I am loading, I use them for my code in R?

I have used several packages of R libraries for my study. All libraries charge together at the beginning of my code. And here is the problem. It turns out that I have done several tests with different functions that were already in the packages of R. However, in the final code I have not implemented all the functions I have tried. Therefore, I am loading libraries that I do not use.
Would there be any way to check the libraries to know if they really are necessary for my code?
Start by restarting R with a fresh environment, no libraries loaded. For this demonstration, I'm going to define two functions:
zoo1 <- function() na.locf(1:10)
zoo2 <- function() zoo::na.locf(1:10)
With no libraries loaded, let's try something:
codetools::checkUsage(zoo1)
# <anonymous>: no visible global function definition for 'na.locf'
codetools::checkUsage(zoo2)
library(zoo)
# Attaching package: 'zoo'
# The following objects are masked from 'package:base':
# as.Date, as.Date.numeric
codetools::checkUsage(zoo1)
Okay, so we know we can check a single function to see if it is abusing scope and/or using non-base functions. Let's assume that you've loaded your script full of functions (but not the calls to require or library), so let's do this process for all of them. Let's first unload zoo, so that we'll see a complaint again about our zoo1 function:
detach("package:zoo", unload=TRUE)
Now let's iterate over all functions:
allfuncs <- Filter(function(a) is.function(get(a)), ls())
str(sapply(allfuncs, function(fn) capture.output(codetools::checkUsage(get(fn))), simplify=FALSE))
# List of 2
# $ zoo1: chr "<anonymous>: no visible global function definition for 'na.locf'"
# $ zoo2: chr(0)
Now you know to look in the function named zoo1 for a call to na.locf. It'll be up to you to find in which not-yet-loaded package this function resides, but that might be more more reasonable, depending on the number of packages you are loading.
Some side-thoughts:
If you have a script file that does not have everything comfortably ensconced in functions, then just wrap all of the global R code into a single function, say bigfunctionfortest <- function() { as the first line and } as the last. Then source the file and run codetools::checkUsage(bigfunctionfortest).
Package developers have to go through a process that uses this, so that the Imports: and Depends: sections of NAMESPACE (another ref: http://r-pkgs.had.co.nz/namespace.html) will be correct. One good trick to do that will prevent "namespace pollution" is loading the namespace but not the package ... and though that may sound confusing, it often results in using zoo::na.locf for all non-base functions. This gets old quickly (especially if you are using dplyr and such, where most of your daily functions are non-base), suggesting those oft-used functions should be directly imported instead of just referenced wholly. If you're familiar with python, then:
# R
library(zoo)
na.locf(c(1,2,NA,3))
is analagous to
# fake-python
from zoo import *
na_locf([1,2,None,3])
(if that package/function exists). Then the non-polluting variant looks like:
# R
zoo::na.locf(c(1,2,NA,3))
# fake-python
import zoo
zoo.na_locf([1,2,None,3])
where the function's package (and/or subdir packaging) must be used explicitly. There is no ambiguity. It is explicit. This is by some/many considered "A Good Thing (tm)".
(Language-philes will likely say that library(zoo) and from zoo import * are not exactly the same ... a better way to describe what is happening is that they bring everything from zoo into the search path of functions, potentially causing masking as we saw in a console message earlier; while the :: functionality only loads the namespace but does not add it to the search path. Lots of things going on in the background.)

How do I load data from another package from within my package

One of the functions in a package that I am developing uses a data set from the acs:: package (the fips.state object). I can load this data into my working environment via
data(fips.state, package = "acs"),
but I do not know the proper way to load this data for my function. I have tried
#importFrom acs fips.state,
but data sets are not exported. I do not want to copy the data and save it to my package because this seems like a poor development practice.
I have looked in http://r-pkgs.had.co.nz/namespace.html, http://kbroman.org/pkg_primer/pages/docs.html, and https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Data-in-packages, but they do not include any information on sharing data sets from one package to another.
Basically, how do I make a data set, that is required by functions in another package available to the functions in my package?
If you don't have control over the acs package, then acs::fips.state seems to be your best bet, as suggested by #paleolimbot.
If you are going to make frequent calls to fips.state, then I'd suggest making a local copy via fips.state <- acs::fips.state, as there is a small cost in looking up objects from other packages that you might do well to avoid incurring multiple times.
But if you are able to influence the acs package (even if you are not, I think this is a useful generalization), then mikefc suggests an alternative solution, which is to set the fips.state object up as internal to the package, and then to export it:
usethis::use_data(fips.state, other.data, internal = FALSE)
And then in NAMESPACE:
export(fips.state)
or if using roxygen2:
#' Fips state
#' #name fips.state
#' #export
"fips.state"
Then in your own package, you can simply #importFrom acs fips.state.
You can always use package::object_name (e.g., dplyr::starwars) anywhere in your package code, without using an import statement.
is_starwars_character <- function(character) {
character %in% dplyr::starwars$name
}
is_starwars_character("Luke Skywalker")
#> [1] TRUE
is_starwars_character("Indiana Jones")
#> [1] FALSE

Employ environments to handle package-data in package-functions

I recently wrote a R extension. The functions use data contained in the package and must therefore load them. Subroutines also need to access the data.
This is the approach taken:
main<- function(...){
data(data)
sub <- function(...,data=data){...}
...
}
I'm unhappy with the fact that the data resides in .GlobalEnv so it still hangs around when the function had terminated (also undermining the downpassing via argument concept).
Please put me on the right track! How do you employ environments, when you have to handle package-data in package-functions?
It looks that you are looking for the LazyData directive in your namepace:
LazyData: yes
Othewise, data has the envir argument you can use to control in which environment you want to load your data, so for example if you wanted the data to be loaded inside main, you could use :
main<- function(...){
data(data, envir = environment() )
sub <- function(...,data=data){...}
...
}
If the data is needed for your functions, not for the user of the package, it should be saved in a file called sysdata.rda located in the R directory.
From R extensions:
Two exceptions are allowed: if the R subdirectory contains a file
sysdata.rda (a saved image of R objects: please use suitable
compression as suggested by tools::resaveRdaFiles) this will be
lazy-loaded into the namespace/package environment – this is intended
for system datasets that are not intended to be user-accessible via
data.

R: disentangling scopes

My question is about avoiding namespace pollution when writing modules in R.
Right now, in my R project, I have functions1.R with doFoo() and doBar(), functions2.R with other functions, and main.R with the main program in it, which first does source('functions1.R'); source('functions2.R'), and then calls the other functions.
I've been starting the program from the R GUI in Mac OS X, with source('main.R'). This is fine the first time, but after that, the variables that were defined the first time through the program are defined for the second time functions*.R are sourced, and so the functions get a whole bunch of extra variables defined.
I don't want that! I want an "undefined variable" error when my function uses a variable it shouldn't! Twice this has given me very late nights of debugging!
So how do other people deal with this sort of problem? Is there something like source(), but that makes an independent namespace that doesn't fall through to the main one? Making a package seems like one solution, but it seems like a big pain in the butt compared to e.g. Python, where a source file is automatically a separate namespace.
Any tips? Thank you!
I would explore two possible solutions to this.
a) Think more in a more functional manner. Don't create any variables outside of a function. so, for example, main.R should contain one function main(), which sources in the other files, and does the work. when main returns, none of the clutter will remain.
b) Clean things up manually:
#main.R
prior_variables <- ls()
source('functions1.R')
source('functions2.R')
#stuff happens
rm(list = setdiff(ls(),prior_variables))`
The main function you want to use is sys.source(), which will load your functions/variables in a namespace ("environment" in R) other than the global one. One other thing you can do in R that is fantastic is to attach namespaces to your search() path so that you need not reference the namespace directly. That is, if "namespace1" is on your search path, a function within it, say "fun1", need not be called as namespace1.fun1() as in Python, but as fun1(). [Method resolution order:] If there are many functions with the same name, the one in the environment that appears first in the search() list will be called. To call a function in a particular namespace explicitly, one of many possible syntaxes - albeit a bit ugly - is get("fun1","namespace1")(...) where ... are the arguments to fun1(). This should also work with variables, using the syntax get("var1","namespace1"). I do this all the time (I usually load just functions, but the distinction between functions and variables in R is small) so I've written a few convenience functions that loads from my ~/.Rprofile.
name.to.env <- function(env.name)
## returns named environment on search() path
pos.to.env(grep(env.name,search()))
attach.env <- function(env.name)
## creates and attaches environment to search path if it doesn't already exist
if( all(regexpr(env.name,search())<0) ) attach(NULL,name=env.name,pos=2)
populate.env <- function(env.name,path,...) {
## populates environment with functions in file or directory
## creates and attaches named environment to search() path
## if it doesn't already exist
attach.env(env.name)
if( file.info(path[1])$isdir )
lapply(list.files(path,full.names=TRUE,...),
sys.source,name.to.env(env.name)) else
lapply(path,sys.source,name.to.env(env.name))
invisible()
}
Example usage:
populate.env("fun1","pathtofile/functions1.R")
populate.env("fun2","pathtofile/functions2.R")
and so on, which will create two separate namespaces: "fun1" and "fun2", which are attached to the search() path ("fun2" will be higher on the search() list in this case). This is akin to doing something like
attach(NULL,name="fun1")
sys.source("pathtofile/functions1.R",pos.to.env(2))
manually for each file ("2" is the default position on the search() path). The way that populate.env() is written, if a directory, say "functions/", contains many R files without conflicting function names, you can call it as
populate.env("myfunctions","functions/")
to load all functions (and variables) into a single namespace. With name.to.env(), you can also do something like
with(name.to.env("fun1"), doStuff(var1))
or
evalq(doStuff(var1), name.to.env("fun1"))
Of course, if your project grows big and you have lots and lots of functions (and variables), writing a package is the way to go.
If you switch to using packages, you get namespaces as a side-benefit (provided you use a NAMESPACE file). There are other advantages for using packages.
If you were really trying to avoid packages (which you shouldn't), then you could try assigning your variables in specific environments.
Well avoiding namespace pollution, as you put it, is just a matter of diligently partitioning the namespace and keeping your global namespace uncluttered.
Here are the essential functions for those two kinds of tasks:
Understanding/Navigating the Namespace Structure
At start-up, R creates a new environment to store all objects created during that session--this is the "global environment".
# to get the name of that environment:
globalenv()
But this isn't the root environment. The root is an environment called "the empty environment"--all environments chain back to it:
emptyenv()
returns: <environment: R_EmptyEnv>
# to view all of the chained parent environments (which includes '.GlobalEnv'):
search()
Creating New Environments:
workspace1 = new.env()
is.environment(workspace1)
returns: [1] TRUE
class(workspace1)
returns: [1] "environment"
# add an object to this new environment:
with(workspace1, attach(what="/Users/doug/Documents/test_obj.RData",
name=deparse(substitute(what)), warn.conflicts=T, pos=2))
# verify that it's there:
exists("test_obj", where=workspace1)
returns: [1] TRUE
# to locate the new environment (if it's not visible from your current environment)
parent.env(workspace1)
returns: <environment: R_GlobalEnv>
objects(".GlobalEnv")
returns: [1] "test_obj"
Coming from python, et al., this system (at first) seemed to me like a room full of carnival mirrors. The R Gurus on the other hand seem to be quite comfortable with it. I'm sure there are a number of reasons why, but my intuition is that they don't let environments persist. I notice that R beginners use 'attach', as in attach('this_dataframe'); I've noticed that experienced R users don't do that; they use 'with' instead eg,
with(this_dataframe, tapply(etc....))
(I suppose they would achieve the same thing if they used 'attach' then 'detach' but 'with' is faster and you don't have to remember the second step.) In other words, namespace collisions are avoided in part by limiting the objects visible from the global namespace.

Resources