In which environment should I store package data the user (and functions) should be able to modify? - r

background
I am the maintainer of two packages that I would like to add to CRAN. They were rejected because some functions assign variables to .GlobalEnv. Now I am trying to find a different but as convenient way to handle these variables.
the packages
Both packages belong to the Database of Odorant Responses, DoOR. We collect published odor-response data in the DoOR.data package, the algorithms in the DoOR.functions package merge these data into a single consensus dataset.
intended functionality
The data package contains a precomputed consensus dataset. The user is able to modify the underlying original datasets (e.g. add his own datasets, remove some...) and compute a new consensus dataset. Thus the functions must be able to access the modified datasets.
The easiest way was to load the complete package data into the .GlobalEnv (via a function) and then modify data there. This was also straight forward for the user, as he saw the relevant datasets in his "main" environment. The problem is that writing into the user environment is bad practice and CRAN wouldn't accept the package this way (understandable).
things I tried
assigning only modified datasets to .GlobalEnv, non explicitly via parent.frame() - Hadley pointed out that this is still bad, in the end we are writing into the users environment.
writing only modified datasets into a dedicated new environment door.env <- new.env().
door.env is not in the search path, thus data in it is ignored by the functions
putting it into the search path with attach(door.env), as I learned, creates a new environment in the search path, thus any further edits in new.env() will again be ignored by the functions
it is complicated to see and edit the data in new.env for the user, I'd rather have a solution where a user would not have to learn environment handling
So bottom line, with all solutions I tried I ended up with multiple copies of datasets in different environments and I am afraid that this confuses the average user of our packe (including me :))
Hopefully someone has an idea of where to store data, easily accessible to user and functions.
EDIT: the data is stored as *.RData files under /data in the DoOR.data package. I tried using LazyData = true to avoid loading all the data into the .GlobalEnv. This works good, the probllems with the manipulated/updated data remain.

Related

R package building : what is the best solution to use large and external data that need to be regularly updated?

We are creating a package on R whose main function is to geocode addresses in Belgium (= transform a "number-street-postcode" string into X-Y spatial coordinates). To do this, we need several data files, namely all buildings with their geographical coordinates in Belgium, as well as municipal boundary data containing geometries.
We face two problems in creating this package:
The files take up space: about 300-400Mb in total. This size is a problem, because we want to eventually put this package on CRAN. A solution we found on Stackoverflow is to create another package only for the data, and to host this package elsewhere. But there is then a second problem that arises (see next point).
Some of the files we use are files produced by public authorities. These files are publicly available for download and they are updated weekly. We have created a function that downloads the data if the data on the computer is more than one week old, and transforms it for the geocoding function (we have created a specific structure to optimize the processing). We are new to package creation, but from what we understand, it is not possible to update data every week if it is originally contained in the package (maybe we are wrong?). It would be possible to create a weekly update of the package, but this would be very tedious for us. Instead, we want the user to be able to update the data whenever he wants, and for the data to persist.
So we are wondering what is the best solution regarding this data issue for our package. In summary, here is what we want:
Find a user-friendly way to download the data and use it with the package.
That the user can update the data whenever he wants with a function of the package, and that this data persists on the computer.
We found an example that could work: it is the Rpostal package, which also relies on large external data (https://github.com/Datactuariat/Rpostal). The author found the solution to install the data outside the package, and to specify the directory where they are located each time a function is used. It is then necessary to define libpostal_path as argument in the functions, so that it works.
However, we wonder if there is a solution to store the files in a directory internal to R or to our package, so we don't have to define this directory in the functions. Would it be possible, for example, to download these files into the package directory, without the user having the choice, so that we can know their path in any case and without the user having to mention it?
Are we on the right track or do you think there is a better solution for us?

How to use data.table::setDTthreads() in my own package?

I'm developing very small package for the first time (and - perhaps it is important in context of my question - would like to publish it on CRAN). This package uses functions from data.table and base R. I would like to take benefits of paralell computations provided by data.table::setDTthreads() function.
When user loads data.table package, this function is calling immediately, but I'm not doing this when developing my package. What I did now is just: (1) in the DESCRIPTION file I have added data.table to Imports field; (2) in the NAMESPACE I have included import(data.table).
As I know this is not the same as library(data.table) and I don't want this, because I don't want to load data.table when user will load my package. But still I would like to use data.table::setDTthreads() function. Where should I include this in my script? Or maybe I'm using it since in the NAMESPACE I have included import(data.table)?
My package contains only one .R file in R/ directory, with few functions, but only one will be exported to be visible for the user (so the other ones are just helper functions). Let's say, it looks like this:
#' roxygen2 skeleton
main_function <- function(x) {
x <- helper_function_1(x)
x
}
helper_function_1 <- function(x) {
x
}
What I really worry about is that when I will use data.table::setDTthreads() in my package it will have an impact on user's environment, i.e. I will enable parallel computation and set threads not just for my function in my package, but generally for user's session.
Very good question.
Yes, it will affect all data.table calls (including those from other packages) in user environment and not just those from your package.
General advise is to not set this value in your package but let users know that they could set it themselves. If you want to set it in your package you should document it really well.
Note that 50% vs. 100% is often very small difference (can be less than 5%, or even slow down on a shared environments) so I suggest you to measure if it is really worth to mess with user environment if benefits are small.
Check those timings for example
https://github.com/h2oai/db-benchmark/issues/202
You could also fill a feature request for a possibility to set number of threads just for calls from a single package. It technically possible by checking top environment of a call.

Why does calling `detach` cause R to "forget" functions?

Sometimes I use attach with some subset terms to work with odd dimensions of study data. To prevent "masking" variables in the environment (really the warning message itself) I simply call detach() to just remove whatever dataset I was working with from the R search path. When I get muddled in scripting, I may end up calling detach a few times. Well, interestingly if I call it enough, R removes functions that are loaded at start-up as part of packages like from utils, stats, and graphics. Why does "detach" remove these functions?
R removes base functions from the search path, like plot and ? and so on.
These functions that were removed are often called “base” functions but they are not part of the actual ‹base› package. Rather, plot is from the package ‹graphics›, and ? is from the package ‹utils›, both of which are part of the R default packages, and are therefore attached by default. Both packages are attached after package:base, and you’re accidentally detaching these packages with your too many detach calls (package:base itself cannot be detached; this is important because if it were detached, you couldn’t reattach it: the functions necessary for that are inside package:base).
To expand on this, attach and detach are usually used in conjunction with package environments rather than data sets: to enable the uses functions from a package without explicitly typing the package name (e.g. graphics::plot), the library function attaches these packages. When loading R, some packages are attached by default. You can find more information about this in Hadley Wickham’s Advanced R.
As you noticed, you can also attach and detach data sets. However, this is generally discouraged (quite strongly, in fact). Instead, you can use data transformation functions from the base package (e.g. with and transform, as noted by Moody_Mudskipper in a comment) or from data manipulation package (‹dplyr› is state of the art; an alternative is ‹data.table›).

Using `data()` for time series objects in R

I apologise if this question has been asked already (I haven't been able to find it). I was under the impression that I could access datasets in R using data(), for example, from the datasets package. However, this doesn't work for time series objects. Are there other examples where this is not the case? (And why?)
data("ldeaths") # no dice
ts("ldeaths") # works
(However, this works for data("austres"), which is also a time-series object).
The data function is designed to load package data sets and all their attributes, time series or otherwise.
I think the issue your having is that there is no stand-alone data set called ldeaths in the datasets package. ldeaths does exist as 1 of 3 data sets in the UKLungDeaths data set. The other two are fdeaths and mdeaths.
The following should lazily load all data sets.
data(UKLungDeaths)
Then, typing ldeaths in the console or using it as an argument in some function will load it.
str(ldeaths)
While it is uncommon for package authors to include multiple objects in 1 data set, it does happen. This line from the data function documentation gives on a 'heads up' about this:
"For each given data set, the first two types (‘.R’ or ‘.r’, and ‘.RData’ or ‘.rda’ files) can create several variables in the load environment, which might all be named differently from the data set"
That is the case here, as while there are three time series objects contained in the data set, not one of them is named UKLungDeaths.
This choice occurs when the package author uses the save function to write multiple R objects to an external file. In the wild, I've seen folks use the save function to bundle a description file with the data set, although this would not be the proper way to document something in a full on package. If your really curious, go read the documentation on the save function.
Justin
r

Should one load the dataset before or inside of the function?

Q. How does one write an R package function that references an R package dataset in a way that is simple/friendly/efficient for an R user. For example, how does one handle afunction() that calls adataset from a package?
What I think may not be simple/friendly/efficient:
User is required to run
data(adataset) before running
afunction(...)
or else receiving an Error: ... object 'adataset' not found. I have noticed some packages have built-in datasets that can be called anytime the package is loaded, for example, iris, which one can call without bringing it to the Global Environment.
Possible options which I have entertained:
Write data(NamedDataSet) directly into the function. Is this a bad idea. I thought perhaps it could be, looking at memory and given my limiting understanding of function environments.
Code the structure of the dataset directly into the function. I think this works depending on the size of the data but it makes me wonder about how to go about proper documentation in the package.
Change Nothing. Given a large enough dataset, maybe it does not make sense to implement a way different from reading it before calling the function.
Any comments are appreciated.
You might find these resources about writing R data packages useful:
the "External Data" section of R Packages
Creating an R data package
Creating an R data package (overview)
In particular, take note of the DESCRIPTION file and usage of the line LazyData: true. This is how datasets are made available without having to use data(), as in the iris example that you mention.

Resources