R package building : what is the best solution to use large and external data that need to be regularly updated? - r

We are creating a package on R whose main function is to geocode addresses in Belgium (= transform a "number-street-postcode" string into X-Y spatial coordinates). To do this, we need several data files, namely all buildings with their geographical coordinates in Belgium, as well as municipal boundary data containing geometries.
We face two problems in creating this package:
The files take up space: about 300-400Mb in total. This size is a problem, because we want to eventually put this package on CRAN. A solution we found on Stackoverflow is to create another package only for the data, and to host this package elsewhere. But there is then a second problem that arises (see next point).
Some of the files we use are files produced by public authorities. These files are publicly available for download and they are updated weekly. We have created a function that downloads the data if the data on the computer is more than one week old, and transforms it for the geocoding function (we have created a specific structure to optimize the processing). We are new to package creation, but from what we understand, it is not possible to update data every week if it is originally contained in the package (maybe we are wrong?). It would be possible to create a weekly update of the package, but this would be very tedious for us. Instead, we want the user to be able to update the data whenever he wants, and for the data to persist.
So we are wondering what is the best solution regarding this data issue for our package. In summary, here is what we want:
Find a user-friendly way to download the data and use it with the package.
That the user can update the data whenever he wants with a function of the package, and that this data persists on the computer.
We found an example that could work: it is the Rpostal package, which also relies on large external data (https://github.com/Datactuariat/Rpostal). The author found the solution to install the data outside the package, and to specify the directory where they are located each time a function is used. It is then necessary to define libpostal_path as argument in the functions, so that it works.
However, we wonder if there is a solution to store the files in a directory internal to R or to our package, so we don't have to define this directory in the functions. Would it be possible, for example, to download these files into the package directory, without the user having the choice, so that we can know their path in any case and without the user having to mention it?
Are we on the right track or do you think there is a better solution for us?

Related

Subset of features on external memory

I have a large file that I'm not able to load so I'm using a local file with xgb.DMatrix. But I'd like to use only a subset of the features. The documentation on xgboost says that the colset argument on slice is "currently not used" and there is no metion of this feature in the github page. And I haven't found any other clue of how to do column subsetting with external memory.
I wish to compare models generated with different features subsettings. The only thing I could think of is to create a new file with the features that I want to use but it's taking a long time and will take a lot of memory... I can't help wondering if there is a better way.
ps.: I tried using h2o package too but h2o.importFile froze.

In which environment should I store package data the user (and functions) should be able to modify?

background
I am the maintainer of two packages that I would like to add to CRAN. They were rejected because some functions assign variables to .GlobalEnv. Now I am trying to find a different but as convenient way to handle these variables.
the packages
Both packages belong to the Database of Odorant Responses, DoOR. We collect published odor-response data in the DoOR.data package, the algorithms in the DoOR.functions package merge these data into a single consensus dataset.
intended functionality
The data package contains a precomputed consensus dataset. The user is able to modify the underlying original datasets (e.g. add his own datasets, remove some...) and compute a new consensus dataset. Thus the functions must be able to access the modified datasets.
The easiest way was to load the complete package data into the .GlobalEnv (via a function) and then modify data there. This was also straight forward for the user, as he saw the relevant datasets in his "main" environment. The problem is that writing into the user environment is bad practice and CRAN wouldn't accept the package this way (understandable).
things I tried
assigning only modified datasets to .GlobalEnv, non explicitly via parent.frame() - Hadley pointed out that this is still bad, in the end we are writing into the users environment.
writing only modified datasets into a dedicated new environment door.env <- new.env().
door.env is not in the search path, thus data in it is ignored by the functions
putting it into the search path with attach(door.env), as I learned, creates a new environment in the search path, thus any further edits in new.env() will again be ignored by the functions
it is complicated to see and edit the data in new.env for the user, I'd rather have a solution where a user would not have to learn environment handling
So bottom line, with all solutions I tried I ended up with multiple copies of datasets in different environments and I am afraid that this confuses the average user of our packe (including me :))
Hopefully someone has an idea of where to store data, easily accessible to user and functions.
EDIT: the data is stored as *.RData files under /data in the DoOR.data package. I tried using LazyData = true to avoid loading all the data into the .GlobalEnv. This works good, the probllems with the manipulated/updated data remain.

Where to store data for an R package hosted on GitHub

I'm working on building a package in R and have a couple very large data sets that I would like to make available to package users without having to re-run my code that extracted the data initially. My package (which is still a work in progress) is hosted on GitHub. It's primarily for my own use as I work on a larger research project. Is there a way to include a .csv of a data set so that it stays stored on GitHub? Ideally it would be something like the default data sets mtcars or diamonds. Is there a way to dput() the data set and then store it in my package function file?
Additional information: I've been using a combination of roxygen2 and devtools to build and launch. This question is related but is one step ahead of what I need.
I hope my question is clear!

R package, size of dataset vis-a-vis code

I am designing an R package (http://github.com/bquast/decompr) to run the Wang-Wei-Zhu export decomposition (http://www.nber.org/papers/w19677).
The complete package is only about 79 kilobyte.
I want to supply an example dataset especially because the input objects are somewhat complex. A relevant real world dataset is available from http://www.wiod.org, however, the total size of the .Rdata object would come to about 1 megabyte.
My question therefore is, would it be a good idea to include the relevant dataset that is so much larger than the package itself?
It is not usual for code to be significantly smaller than data. However, I will not be the only one to suggest the following (especially if you want to submit to CRAN):
Consult the R Extensions manual. In particular, make sure that the data file is in a compressed format and use LazyData when applicable.
The CRAN Repository Policies also have a thing or two to say about data files. There is a hard maximum of 5MB for documentation and data. If the code is likely to change and the data are not, consider creating a separate data package.
PDF documentation can also be distributed, so it is possible to write a "vignette" that is not built by running code when the package is bundled, but instead illustrates usage with static code snippets that show how to download the data. Downloading in the vignette itself is prohibited, as the manual states that all files necessary to build it must be available on the local file system.
I also would have to ask if including a subset of the data is not sufficient to illustrate the use of the package.
Finally, if you don't intend to submit to a package repository, I can't imagine a megabyte download being a breach of etiquette.

Update the dataset in an installed package

Is it possible to update a dataset in a local, installed package?
A package that I maintain has a dataset based on periodically-updated data. I would like to update the local version of my dataset and save the changes back to the package such that next time I load the data, i.e. data(xxx), the updated version of the dataset will load.
In the medium and long term I will update the package and then upload a new version to CRAN, but I'm looking for a short term solution.
If it is possible, how would I do it?
You could
by updating the source and re-installing, yes. Preferably with a new distinctive version number.
by forcefully overwriting it, possibly. Not the proper way to do it.
What I would try to do is to put a mechanism to acquire this data in the package, but separate the (changing ?) data from the code.
Packages are not first and foremost a means to direct data acquisition, in particular for changing data sets. Most packages include fixed data to demonstrate or illustrate a method or implementation.

Resources