Should one load the dataset before or inside of the function? - r

Q. How does one write an R package function that references an R package dataset in a way that is simple/friendly/efficient for an R user. For example, how does one handle afunction() that calls adataset from a package?
What I think may not be simple/friendly/efficient:
User is required to run
data(adataset) before running
afunction(...)
or else receiving an Error: ... object 'adataset' not found. I have noticed some packages have built-in datasets that can be called anytime the package is loaded, for example, iris, which one can call without bringing it to the Global Environment.
Possible options which I have entertained:
Write data(NamedDataSet) directly into the function. Is this a bad idea. I thought perhaps it could be, looking at memory and given my limiting understanding of function environments.
Code the structure of the dataset directly into the function. I think this works depending on the size of the data but it makes me wonder about how to go about proper documentation in the package.
Change Nothing. Given a large enough dataset, maybe it does not make sense to implement a way different from reading it before calling the function.
Any comments are appreciated.

You might find these resources about writing R data packages useful:
the "External Data" section of R Packages
Creating an R data package
Creating an R data package (overview)
In particular, take note of the DESCRIPTION file and usage of the line LazyData: true. This is how datasets are made available without having to use data(), as in the iris example that you mention.

Related

How do I rebuild data-raw using roxygen2?

I have a package that uses a pre-built dataset that may be modified over time by other parts of the package. Specifically, I have files that I add to the inst directory that will be indexed to make a data.frame. The indexing does not take a very long time (about 15-30 seconds), but it is longer than something I'd like to do on every package load.
Is there a way to automate the indexing so that it will occur with roxygen2::roxygenize()? What I'd really love to have happen is that an R function from the package would run any time I run devtools::document(). I think that this is possible by creating a custom roclet, but I don't understand quite how I would do this in practice.
After reading more documentation, I found the answer to the question. The #examples idea from #r2evans helped me find something that is more directly targeting something like this. I used #' #eval my_function() where my_function() is an exported function in the package that builds the data and generates the documentation for the data.

Understanding Environments within Packages

I am writing a package for the first time and I am trying to understand how it works in principle.
I need a variable to be available to many functions within the package. The common suggestion is to use a new environment that you specifically create for this purpose from within the package. For example, here:
How to define "hidden global variables" inside R packages?
or in this snippet of code:
https://github.com/eddelbuettel/rcppgsl/blob/master/R/inline.R#L19-L34
Based on this suggestion, right now my code starts with:
.pkg_env <- new.env(parent=emptyenv())
Yet, I cannot reconcile this with the principle that "the code in a package is run when the package is built" as mentioned here (https://r-pkgs.org/r.html#understand-when-code-is-executed). Perhaps, I am missing something in the way environments are managed in R, but I would guess you need that line to be run at load to create the environment.

Why does calling `detach` cause R to "forget" functions?

Sometimes I use attach with some subset terms to work with odd dimensions of study data. To prevent "masking" variables in the environment (really the warning message itself) I simply call detach() to just remove whatever dataset I was working with from the R search path. When I get muddled in scripting, I may end up calling detach a few times. Well, interestingly if I call it enough, R removes functions that are loaded at start-up as part of packages like from utils, stats, and graphics. Why does "detach" remove these functions?
R removes base functions from the search path, like plot and ? and so on.
These functions that were removed are often called “base” functions but they are not part of the actual ‹base› package. Rather, plot is from the package ‹graphics›, and ? is from the package ‹utils›, both of which are part of the R default packages, and are therefore attached by default. Both packages are attached after package:base, and you’re accidentally detaching these packages with your too many detach calls (package:base itself cannot be detached; this is important because if it were detached, you couldn’t reattach it: the functions necessary for that are inside package:base).
To expand on this, attach and detach are usually used in conjunction with package environments rather than data sets: to enable the uses functions from a package without explicitly typing the package name (e.g. graphics::plot), the library function attaches these packages. When loading R, some packages are attached by default. You can find more information about this in Hadley Wickham’s Advanced R.
As you noticed, you can also attach and detach data sets. However, this is generally discouraged (quite strongly, in fact). Instead, you can use data transformation functions from the base package (e.g. with and transform, as noted by Moody_Mudskipper in a comment) or from data manipulation package (‹dplyr› is state of the art; an alternative is ‹data.table›).

In which environment should I store package data the user (and functions) should be able to modify?

background
I am the maintainer of two packages that I would like to add to CRAN. They were rejected because some functions assign variables to .GlobalEnv. Now I am trying to find a different but as convenient way to handle these variables.
the packages
Both packages belong to the Database of Odorant Responses, DoOR. We collect published odor-response data in the DoOR.data package, the algorithms in the DoOR.functions package merge these data into a single consensus dataset.
intended functionality
The data package contains a precomputed consensus dataset. The user is able to modify the underlying original datasets (e.g. add his own datasets, remove some...) and compute a new consensus dataset. Thus the functions must be able to access the modified datasets.
The easiest way was to load the complete package data into the .GlobalEnv (via a function) and then modify data there. This was also straight forward for the user, as he saw the relevant datasets in his "main" environment. The problem is that writing into the user environment is bad practice and CRAN wouldn't accept the package this way (understandable).
things I tried
assigning only modified datasets to .GlobalEnv, non explicitly via parent.frame() - Hadley pointed out that this is still bad, in the end we are writing into the users environment.
writing only modified datasets into a dedicated new environment door.env <- new.env().
door.env is not in the search path, thus data in it is ignored by the functions
putting it into the search path with attach(door.env), as I learned, creates a new environment in the search path, thus any further edits in new.env() will again be ignored by the functions
it is complicated to see and edit the data in new.env for the user, I'd rather have a solution where a user would not have to learn environment handling
So bottom line, with all solutions I tried I ended up with multiple copies of datasets in different environments and I am afraid that this confuses the average user of our packe (including me :))
Hopefully someone has an idea of where to store data, easily accessible to user and functions.
EDIT: the data is stored as *.RData files under /data in the DoOR.data package. I tried using LazyData = true to avoid loading all the data into the .GlobalEnv. This works good, the probllems with the manipulated/updated data remain.

How to retrieve R function script from a specific package?

I realize there are generic functions like plot, predict for a list of packages. I am wondering how could I get the R script of these generic functions for a specific package, like the lme4::predict. I tried lme4::predict, but it comes with error:
> lme4::predict
Error: 'predict' is not an exported object from 'namespace:lme4'
Since you state that my suggestion above was helpful I will tell you my process. I used my own co-authored package called pacman. This package was developed because we had a hard time remembering all the obscurely named functions to get information on and work with add on packages.
I used this to figure out what you wanted:
library(pacman)
p_funs(lme4, all=TRUE)
I set all = TRUE as predict is a method for a specific class (like print, summary and plot). Generally, these methods are not exported so p_funs won't return them unless you set all = TRUE. Then I scrolled down to the p section and found only a single predict method: predict.merMod
Next I realized it wasn't exported so :: won't show me the stuff and extra colon power is needed, hence: lme4:::predict.merMod
As pointed out by David and rawr above, some functions can be suborn little snippets (methods etc.), thus methods and getAnywhere are helpful.
Here is an example of that:
library(tm)
dissimilarity #The good stuff is hid
methods(dissimilarity) #I want the good stuff
getAnywhere("dissimilarity.DocumentTermMatrix")
Small end note
And of course you don't need pacman to look at functions for packages, it's what I use and helpful but it just wraps base R things. Use THE SOURCE to figure out exactly what.

Resources