The CRAN policy limits R package size to 5 Mb, which is little for graphical applications such as mapping. There are multiple ways of handling the package size limitations, all of which come with their drawbacks. The alternatives have been listed below.
My question is: how to make an R package download data files only once (i.e. they are saved to a place where R finds them after restarting)? The solution should work for all common CRAN platforms.
I have been developing a mapping package for R which is supposed to plot bathymetric maps anywhere around the globe in ggplot2. I list alternatives to handle large data files in CRAN packages I have come across. The alternatives are written map-making in mind but apply for any case where large, single files are required:
Moving large files to a data package and making the original package depend on the data package.
a) If the data package is <5 Mb, it can be uploaded to CRAN, and one can make the original depend or import the data package in the DESCRIPTION field. User can simply use the install.packages() function as they would with any other CRAN package. Things work CRANtastic and everyone is happy.
b) If the data package is >5 Mb, things get messy. One alternative, in theory, would be to make a separate data package for each file given that the data files are all <5 Mb. Then one could use the approach in 1a for each data package. This alternative is so hacky that I have not had the nerves to try it in practice. It would be interesting to hear in the comments if someone has.
c) Another and better alternative is to use the drat package to make a data package, for example, to GitHub. This alternative has the benefit that the user can write install.packages() to install the original package from CRAN but also has quite a few disadvantages for the developer. Setting up the data package to pass all CRAN checks can be slightly challenging as all the steps have not been correctly specified anywhere online at the moment: the original package has to ask for permission to install the data package; the data package has to be distributed as separate binaries for the current development version of R at least for Windows and Mac, but possibly also for Fedora in the drat repository; the data package should be listed as Suggests: with an URL under Additional_repositories: in the DESCRIPTION file; to mention some surprises I have encountered so far. All in all, this alternative is great for the user but requires maintenance from the developer.
Some mapping packages (such as marmap) download data to temporary files from external servers. This approach has the benefit that CRAN requirements are easy to fulfill, and the user does not have to store any more data than required for the application. The approach also allows specifying the resolution in the download function, which is great for "zooming" the maps. The disadvantages are that the process is bound to take more time than simply storing the map data locally. Another disadvantage is that the map data need to be distributed in raster format (or the server has to crop vectors). At the time of writing, vector data allow easier manipulation of colors and styles in R and ggplot2 than raster data. Vectors also make sharper figures as the elements are not bound to resolution. The third disadvantage is that the download method (to my knowledge) has to be targetted to temporary files (i.e. they get lost when R is restarted) when writing a CRAN package due to operating system differences. As far as I know, it is not allowed to add Rdata files to already downloaded and existing R packages, and finding a location to download data that works for all major CRAN operating systems can be difficult.
I keep on getting rejected by CRAN time after time because I have not managed to solve the data download problem. There is some help available online but I feel this issue has not been addressed sufficiently yet. The optimal solution would download sp vector shapefiles
as needed when making maps (the objects can be stored in .Rdata format). This would allow the addition of detailed maps for certain frequently needed regions. The shapefiles could be stored on GitHub, which would allow quick and flexible modification of these files during development.
Have you tried using xz compression to reduce the size of your sysdata? I believe the default is gzip, with the compression level set to 6. If you use either bzip2 or xz compression when saving your package data with save(), R will use these compression algorithms in conjunction with a compression level of 9. The upshot is that you get smaller package data objects.
The getNOAA.bathy() function from the marmap package has a keep argument which defaults to FALSE. If set to TRUE, the dataset downloaded from the ETOPO1 database on NOAA servers is stored locally, in the working directory of the current R session. The argument Path allows the user to specify where the dataset should be saved (version 1.0.5, available on GitHub but not on CRAN yet).
When the user calls getNOAA.bathy(), the function first checks if the requested data is available locally, either in the current working directory or in the user provided path. If it is (same bounding box and resolution), then the NOAA servers are not queried and the local data file is loaded instead. If not, the data is downloaded from NOAA servers. IMHO, this method has the following advantages:
if keep=FALSE: nothing is stored locally, which avoids adding too much clutter to the user's disk when loading many different test datasets.
if keep=TRUE: the data is stored locally. Loading the data will be much faster the next time (and it can be done offline) since everything happens locally.
In a script, the same getNOAA.bathy() function is used to first download data from NOAA servers and load local files when available. The user does not have to worry to manually save the data, nor to alter his\her script to load local data the next time, since the function automatically loads the data from the most appropriate source (web server or internal disk).
there's no need to pack any heavy data within the package.
As far as I can tell, the only drawback is that on Windows machines, paths are limited to 250 characters, which might cause some trouble when generating filenames to save the data. Indeed, depending on the bounding box and resolution of the data downloaded on NOAA servers, filenames can be pretty long due to floating point arithmetics. An easy fix is to round the coordinates of the bounding box (using either round(), ceiling() or floor()) to a few decimal places before generating the name of the file to save.
In general I wouldn't make it too hacky. I think there could be ways to trick the package to load additional data online during installation and add it to the package itself. Would be somehow nice - but I don't think it is popular with the CRAN maintainers.
What about the following ? :
CRAN package for the functions
Github package for your data
In the CRAN package you import devtools and with the .onLoad method you install the Github data package with devtools::install_github. (on load is called, when the package is loaded with library()/require()). You see this sometimes with package startup messages.
I could imagine the following advantages:
is not done during installation but at package load
is somehow more transparent to the user (especially if you put a message)
has only to be done once (afterwards on load can just check if the data package is there and loads it)
the data is actually in a package and not a user path
the data is there for offline use once loaded
if you check for data package version in .onLoad, you could also trigger/make an update for the data without updating the CRAN package
A implementation could look like this:
#' #import devtools
.onLoad <- function(libname, pkgname){
if (! "wordcloud" %in% utils::installed.packages()) {
message("installing data super dupa data package")
devtools::install_github("ifellows/wordcloud")
}
else {
require(wordcloud)
message("Everything fine, ready for usage!")
}
}
The .onLoad has just to be out in any of your .R files. For your concrete implementation you could also refine this further. I don't have anything to to with the wordcloud package - was just the first thing I quickly found on GitHub as an example to install with install_github.
If there is an error message saying something with staged install - you have to add StagedInstall: no to your DESCRIPTION file.
You could have a function to install the data at a chosen location, and have the path stored in an option defined in your .R Profile: option(yourpackage.datapath = your path). You might suggest that the user stores it in your package installation path.
The installing function prints first the code above and proposes you to copy and paste it in your .RProfile while the data is downloading :
if(is.null(getOption("yourpackage.datapath")))
stop('you have not defined the "yourpackage.datapath" option, please make sure the data is installed using `yourpackage::install_yourdata", then copy `option(yourpackage.datapath = yourpath)` to your R profile.')
You could also open it using edit() for instance. Or place it in your pastebin but you don't want extra dependencies and I think you'd need some to do this. I don't think CRAN will let you edit the .RProfile automatically but this is not too bad of a manual action. The installation function could check that the option is set before even downloading.
The data can be stored in a global variable of your namespace. You just need to define a environment object in your package and a function to modify it :
globals <- new.env()
load_data <- function(path) globals$data <- readRDS(path)
Then your functions will test if globals$data is NULL before either loading the data (after checking if path option was set properly) or moving on.
Once it's done, as long as the data or RProfile are not removed, it will work forever, and if they are removed the functions will catch it and give instructions as to how to fix the issue.
Another option here is to load the data in .onLoad, it means you'll have some logic in there to deal with the first time the package is loaded. As .onLoad knows the installation path through the libname argument you can even impose to download your data there, and load it right after you checked it's there (using a global variable as above) , so no need for options and RProfile.
As long as the user is prompted I think it will be fine with CRAN.
Two alternatives that might be of interest:
Create an additional install function that installs from Github the data package(s). The rnaturalearth package has a great example with the install_rnaturalearthhires function.
Use the pins package to register a board_url. The pins package works by downloading and storing the file on cache. Whenever it is called it looks to the original url to see if there were any changes. If there weren't, it uses the one it already has in memory. If it has no Internet connection, it also uses the one in memory. As an example we use the pins package in our covidmx package to update COVID-19 data from the Internet.
I'm experimenting with GitHub and I created a little package for my colleagues to use. They install it with the devtools package and install_github() function directly in R. I also have some example data and a R-Markdown file that shows the usage of all functions in the package and can be published via GitHub Pages.
I would like to know what would be the best practice to enable others to use this example data to learn the package.
I can think of two different options:
Host the data in a separate directory which is not part of the installation and tell people to download it manually or use something like the download.file() function from R at the beginning of the example script to download all data that could be packed into a .zip.
Make the data part of the package installation, however this would require the data to be fairly small which is difficult in my particular case (data is 10MB).
Ideally the examples in the R-documentation (.Rd files in the man folder) could also use the same examples as in the markdown file. also in this case, option (2) seems to be favorable.
Could anybody give me some advice what would be the best way to go, sort of the "industry standard" if there is any.
Packrat is often recommended as the virtual environment for R, but it doesn't fully meet my need of contributing to R open source. Packrat's "virtual environment" is stored directly in the project directory, requiring me to modify the .gitignore to ignore them when I make a pull request to the open source upstream.
In contrast, something like conda stores the virtual environment somewhere else, leaving no trace in the project codebase itself.
So how do R open source contributors deal manage dependencies during package development? Ideally the solution would work well with devtools and Rstudio.
There is nothing wrong in having Packrat in .gitignore.
You can use .git/info/exclude file thus avoiding touching the .gitignore.
I build transport models for various government agencies. My model is managed through GitHub, and it depends on R to perform certain calculations. I currently have my entire r installation folder in the repository. This can't be the right solution, but here are some of my constraints:
My clients are usually even less sophisticated programmers then I am. When they download/clone the model, it just needs to work.
This needs to be the case 10 years from now - regardless of what the current build of R and all the package dependencies are.
Placing my entire R folder in the repo solves these two problems, but creates some new ones:
The repository is much larger than it needs to be / longer download time.
If the transport model is updated to a new version (say v2.0), I'd want to update R and its packages to the latest versions. I'm afraid this would increase the size of the repo even further.
One solution I understand is submodules. I could place the full R folder in a separate repo and bring it in as a submodule. This, at the very least, cleans up the model repository.
What about zipping the R folder? Some early testing showed that git can diff the zip file, but I don't know if it is doing it as a flat file or reading the contents. Also, is GitHub going to complain about 100MB+ zip file? I'd like to avoid GitLFS if I can, but asking my clients to unzip that file wouldn't be a problem.
I also looked at packrat, but as far as I can tell, that only works for R projects.
Lastly, I don't entirely understand makefiles / recipes, but it would be nice if there was a script I could run that would download specific versions of R and it's libraries. One complicating thing is that some of the R packages are private GitHub repos.
Anyway, I'm happy to provide more info if needed. Thank you for your help!
some background: I'm a fairly beginning sysadmin maintaining the server for our department. The server houses several VM's, mostly Ubuntu SE 12.04, usually with a separate VM per project.
One of the tools we use is R and RStudio, also server edition. I've set this up so everyone can access it through their browser, but I'm still wondering what the best way would be to deal with package management. Ideally, I will have one folder/library with our "common" packages, which are common in many projects and use cases. I would admin this library, since I'm the only user in sudo. My colleagues should be able to add packages on a case-by-case basis in their "personal" R folders, that get checked as a backup in case a certain package is not available in our main folder.
My question has a few parts:
- Is this actually a viable way to set this up?
- How would I configure this?
- Is there a way to easily automate this library for use in other VM's?
I have a similar question pertaining to Python, but maybe I should make a new question for that..
R supports multiple libaries for packages by default. Libraries are basically just folders in which installed packages are placed.
You can use
.libPaths()
in R to view what paths are use as libraries on your system. On my Ubuntu 13.10 system, there are
a personal library at "~/R/x86_64-pc-linux-gnu-library/3.0" where packages installed by the user are placed,
"/usr/lib/R/library" where packages installed via apt-get are placed and
"/usr/lib/R/site-library" which is a system-wide library for packages that are shared by all users.
You can add additional libraries to R, but from how I understand your question, installing packages to /usr/lib/R/site-library might be what you are looking for. This can be archived relatively easily by running R as root and calling install.packages() and update.packages() from there as usual. However, running R as root is a security risk and not a good idea, so maybe it is better to create an separate user with write access to /usr/lib/R/site-library and to use that one instead of root.
If you mount /usr/lib/R/site-library on multiple VM, they should also share the packages installed there. Does this answer your question?
Having common library and personal library locations is completely feasible.
Each user should have two environment variables set. R_LIBS should point to the common library, and R_LIBS_USER should point to their personal location. See ?.Library for more information.
You can check a user's library paths using .libPaths(). You probably want users to install packages to their personal library, so some fiddling may be required to make sure that the personal library is the first element in of .libPaths().