Sourcing an R script whenever a package is loaded - r

I'm creating an R package for the handling of a specific dataset that is regularly updated in our organization, but not on a fixed schedule (making it unsuitable for something such as a cronjob). As a result, users must currently run a set of two scripts for data processing before they begin to analyze the data. In converting this set of functions into a package, I'm hoping to alleviate this by having the scripts be called whenever the package is first loaded to R (with analogous functions if people would like to manually check for an update in the middle of a multi-day session).
I've seen ways to deal with compiling external files upon package installation, but nothing on how to get R to run a script whenever the package is loaded (not just installed). Does anyone know if this is possible, and if so, how to do it?
Thanks!

These functions are outlined in the Writing R Extensions Guide, (which, if you're writing a package, you should be reading carefully) specifically section 1.5.3 Load Hooks
You can define an .onLoad function that will be called when you package loads.

Related

How to make a CRAN package to download data only once regardless of OS?

The CRAN policy limits R package size to 5 Mb, which is little for graphical applications such as mapping. There are multiple ways of handling the package size limitations, all of which come with their drawbacks. The alternatives have been listed below.
My question is: how to make an R package download data files only once (i.e. they are saved to a place where R finds them after restarting)? The solution should work for all common CRAN platforms.
I have been developing a mapping package for R which is supposed to plot bathymetric maps anywhere around the globe in ggplot2. I list alternatives to handle large data files in CRAN packages I have come across. The alternatives are written map-making in mind but apply for any case where large, single files are required:
Moving large files to a data package and making the original package depend on the data package.
a) If the data package is <5 Mb, it can be uploaded to CRAN, and one can make the original depend or import the data package in the DESCRIPTION field. User can simply use the install.packages() function as they would with any other CRAN package. Things work CRANtastic and everyone is happy.
b) If the data package is >5 Mb, things get messy. One alternative, in theory, would be to make a separate data package for each file given that the data files are all <5 Mb. Then one could use the approach in 1a for each data package. This alternative is so hacky that I have not had the nerves to try it in practice. It would be interesting to hear in the comments if someone has.
c) Another and better alternative is to use the drat package to make a data package, for example, to GitHub. This alternative has the benefit that the user can write install.packages() to install the original package from CRAN but also has quite a few disadvantages for the developer. Setting up the data package to pass all CRAN checks can be slightly challenging as all the steps have not been correctly specified anywhere online at the moment: the original package has to ask for permission to install the data package; the data package has to be distributed as separate binaries for the current development version of R at least for Windows and Mac, but possibly also for Fedora in the drat repository; the data package should be listed as Suggests: with an URL under Additional_repositories: in the DESCRIPTION file; to mention some surprises I have encountered so far. All in all, this alternative is great for the user but requires maintenance from the developer.
Some mapping packages (such as marmap) download data to temporary files from external servers. This approach has the benefit that CRAN requirements are easy to fulfill, and the user does not have to store any more data than required for the application. The approach also allows specifying the resolution in the download function, which is great for "zooming" the maps. The disadvantages are that the process is bound to take more time than simply storing the map data locally. Another disadvantage is that the map data need to be distributed in raster format (or the server has to crop vectors). At the time of writing, vector data allow easier manipulation of colors and styles in R and ggplot2 than raster data. Vectors also make sharper figures as the elements are not bound to resolution. The third disadvantage is that the download method (to my knowledge) has to be targetted to temporary files (i.e. they get lost when R is restarted) when writing a CRAN package due to operating system differences. As far as I know, it is not allowed to add Rdata files to already downloaded and existing R packages, and finding a location to download data that works for all major CRAN operating systems can be difficult.
I keep on getting rejected by CRAN time after time because I have not managed to solve the data download problem. There is some help available online but I feel this issue has not been addressed sufficiently yet. The optimal solution would download sp vector shapefiles
as needed when making maps (the objects can be stored in .Rdata format). This would allow the addition of detailed maps for certain frequently needed regions. The shapefiles could be stored on GitHub, which would allow quick and flexible modification of these files during development.
Have you tried using xz compression to reduce the size of your sysdata? I believe the default is gzip, with the compression level set to 6. If you use either bzip2 or xz compression when saving your package data with save(), R will use these compression algorithms in conjunction with a compression level of 9. The upshot is that you get smaller package data objects.
The getNOAA.bathy() function from the marmap package has a keep argument which defaults to FALSE. If set to TRUE, the dataset downloaded from the ETOPO1 database on NOAA servers is stored locally, in the working directory of the current R session. The argument Path allows the user to specify where the dataset should be saved (version 1.0.5, available on GitHub but not on CRAN yet).
When the user calls getNOAA.bathy(), the function first checks if the requested data is available locally, either in the current working directory or in the user provided path. If it is (same bounding box and resolution), then the NOAA servers are not queried and the local data file is loaded instead. If not, the data is downloaded from NOAA servers. IMHO, this method has the following advantages:
if keep=FALSE: nothing is stored locally, which avoids adding too much clutter to the user's disk when loading many different test datasets.
if keep=TRUE: the data is stored locally. Loading the data will be much faster the next time (and it can be done offline) since everything happens locally.
In a script, the same getNOAA.bathy() function is used to first download data from NOAA servers and load local files when available. The user does not have to worry to manually save the data, nor to alter his\her script to load local data the next time, since the function automatically loads the data from the most appropriate source (web server or internal disk).
there's no need to pack any heavy data within the package.
As far as I can tell, the only drawback is that on Windows machines, paths are limited to 250 characters, which might cause some trouble when generating filenames to save the data. Indeed, depending on the bounding box and resolution of the data downloaded on NOAA servers, filenames can be pretty long due to floating point arithmetics. An easy fix is to round the coordinates of the bounding box (using either round(), ceiling() or floor()) to a few decimal places before generating the name of the file to save.
In general I wouldn't make it too hacky. I think there could be ways to trick the package to load additional data online during installation and add it to the package itself. Would be somehow nice - but I don't think it is popular with the CRAN maintainers.
What about the following ? :
CRAN package for the functions
Github package for your data
In the CRAN package you import devtools and with the .onLoad method you install the Github data package with devtools::install_github. (on load is called, when the package is loaded with library()/require()). You see this sometimes with package startup messages.
I could imagine the following advantages:
is not done during installation but at package load
is somehow more transparent to the user (especially if you put a message)
has only to be done once (afterwards on load can just check if the data package is there and loads it)
the data is actually in a package and not a user path
the data is there for offline use once loaded
if you check for data package version in .onLoad, you could also trigger/make an update for the data without updating the CRAN package
A implementation could look like this:
#' #import devtools
.onLoad <- function(libname, pkgname){
if (! "wordcloud" %in% utils::installed.packages()) {
message("installing data super dupa data package")
devtools::install_github("ifellows/wordcloud")
}
else {
require(wordcloud)
message("Everything fine, ready for usage!")
}
}
The .onLoad has just to be out in any of your .R files. For your concrete implementation you could also refine this further. I don't have anything to to with the wordcloud package - was just the first thing I quickly found on GitHub as an example to install with install_github.
If there is an error message saying something with staged install - you have to add StagedInstall: no to your DESCRIPTION file.
You could have a function to install the data at a chosen location, and have the path stored in an option defined in your .R Profile: option(yourpackage.datapath = your path). You might suggest that the user stores it in your package installation path.
The installing function prints first the code above and proposes you to copy and paste it in your .RProfile while the data is downloading :
if(is.null(getOption("yourpackage.datapath")))
stop('you have not defined the "yourpackage.datapath" option, please make sure the data is installed using `yourpackage::install_yourdata", then copy `option(yourpackage.datapath = yourpath)` to your R profile.')
You could also open it using edit() for instance. Or place it in your pastebin but you don't want extra dependencies and I think you'd need some to do this. I don't think CRAN will let you edit the .RProfile automatically but this is not too bad of a manual action. The installation function could check that the option is set before even downloading.
The data can be stored in a global variable of your namespace. You just need to define a environment object in your package and a function to modify it :
globals <- new.env()
load_data <- function(path) globals$data <- readRDS(path)
Then your functions will test if globals$data is NULL before either loading the data (after checking if path option was set properly) or moving on.
Once it's done, as long as the data or RProfile are not removed, it will work forever, and if they are removed the functions will catch it and give instructions as to how to fix the issue.
Another option here is to load the data in .onLoad, it means you'll have some logic in there to deal with the first time the package is loaded. As .onLoad knows the installation path through the libname argument you can even impose to download your data there, and load it right after you checked it's there (using a global variable as above) , so no need for options and RProfile.
As long as the user is prompted I think it will be fine with CRAN.
Two alternatives that might be of interest:
Create an additional install function that installs from Github the data package(s). The rnaturalearth package has a great example with the install_rnaturalearthhires function.
Use the pins package to register a board_url. The pins package works by downloading and storing the file on cache. Whenever it is called it looks to the original url to see if there were any changes. If there weren't, it uses the one it already has in memory. If it has no Internet connection, it also uses the one in memory. As an example we use the pins package in our covidmx package to update COVID-19 data from the Internet.

How do I unit test functions in my R package that interact with the file system?

I'm working on an R package at work. My package has gotten large enough that I've decided I need some form of repeatable testing. I settled upon using testthat and mockery. I'm not a developer, so this is the first time I'm writing tests at this level.
I deal with a lot of data files and it's very convenient to have functions in my package to help locate files. These functions interact with the file system via calls to dir. For example,
Data from one event can be split over multiple files. If I have file datafile_2017.10.20_12.00.00, I have a function that can find the next file that is part of the same event, i.e. datafile_2017.10.20_12.05.00.
My question is this: what is the best way to test functions like this? My intuition is to avoid using actual files stored somewhere else in my repository because that can fail for a number of reasons, e.g. different paths, different repo states b/w systems. I searched around and it looks like different languages have mocking libraries that allow for mocking directory structures. I haven't found anything like that for R (except for testthatsomemore, but it was removed from CRAN sometime in 2016).
Is there an R package that allows for mocking directory structures? Or am I wrong to move away from storing small test files in my repo?

Big R project with several packages and developers: Best setup for easy version controll based on packages

I have to restructure a big project written in R, which is later consisting several packages as well as developers. Everything is set up on a git server.
The question is: How do I manage frequent changes inside packages without having to build them every time and developers updating them after they made a new pull? Is there any best practice or automation for that? I don't want source() with unbuilt packages and R.files but would like to stick with a package like structure as much as possible. We will work in a Windows environment.
Thanks.
So I fiddled around a while, tried different setups and came up with an arrangement which fits my needs.
It basically consists two git repositories. The first on (let's call it base-repo) of them contains most scripts on which all later packages are based on. The second repo we will call the "package-repo".
Most development work should be done on the base-repo. The base-repo is under CI control via a build server and unit tests.
The package-repo contains folders for each package we want to build and the base-repo as a git-submodule.
Each package can now be constructed via a very simple bash/shell script (“build script”):
check out a commit/tag of the submodule base-repo on which the stable
package build should be based on
copy files which are necessary for the package into the specific package folder
checks and builds the package
script can also create a history file of package
script can either be invoked manually or by a build server
This approach can also be combined with packrat. Additional code which is very package specific can now be also added to the package-repo and is under version control while independed from the base-repo
The approach could be further extended to trigger the build of packages from the package-repo based on pushes to the base-repo. Packages with a build script pointing to master as a commit will always be up to date and if under control of a build server it will ensure that changes to the base-repo will not break the package. Also it is possible to create several packages containing the same scripts from base-repo.
See also: git: symlink/reference to a file in an external repository

update of a R package which is lazy loaded

I have several unix servers using a R package which is installed on a shared R library folder. The packages are lazy loaded (that's the default) from this shared folder.
Now I want to update the package:
1) is it possible (and clean) to do that without closing all R instances?
2) More precisely, I am concerned about the following:
2)a) The warning I get from the user interface when I try to install a package that is already loaded:
2)b)
From https://cran.r-project.org/doc/manuals/r-release/R-ints.html#Lazy-loading,
When a package/namespace which uses it is loaded, the package/namespace environment is populated with promises for all the named objects: when these promises are evaluated they load the actual code from a database.
Does that mean that the R instance will read again from the library folder when doing the actual evaluation of each object (in which case that means I need to either deactivate the lazy loading, or close all R instances before updating the package)
3) is there an alternative way to maintain R packages on a network of servers, that are running scripts all the time, without having to put each server offline one by one)
Thanks for your input
You asked
1) is it possible (and clean) to do that without closing all R instances?
and I can assure that yes, that it is how works and done everywhere.
As for
2) More precisely, I am concerned about the following:
you are reading it wrong. An R restart is simply recommended to ensure the new package is loaded as you cannot insert it into a running session.
Further
3) is there an alternative way to maintain R packages on a network of servers, that are running scripts all the time, without having to put each server offline one by one)
you never have to take a server off-line just to update a user-space package. E.g. we don't even take them off-line when we, say, upgrade the entire Ubuntu release twice a year.

Is there a persistent location that is always writable which can be used as data cache by a package?

Is there a predefined location where an R package could store cached data? The data should persist across sessions. I was thinking about creating a subdirectory of ${R_LIBS_USER}/package_name, but I'm not sure if this is portable and if this is "allowed" if my package is installed systemwide.
The idea is the following: Create an R script mydata.R in the data subdirectory of the package which would be executed by calling data(mydata) (according to the documentation of data()). This script would load the data from the internet and cache it, if it hasn't been cached before. (If the data has been cached already, the cache will be used.) In addition, a function will be provided to invalidate the cache and/or to check if a newer version of the data is available online.
This is from the documentation of data():
Currently, four formats of data files are supported:
files ending ‘.R’ or ‘.r’ are source()d in, with the R working directory changed temporarily to the directory containing the respective file. (data ensures that the utils package is attached, in case it had been run via utils::data.)
...
Indeed, creating a file fortytwo.R in the data subdirectory of a package with the following contents:
fortytwo = data.frame(answer=42)
and then executing data(fortytwo) creates a data frame variable fortytwo. Now the question is: Where would fortytwo.R cache the data if it were difficult to compute?
EDIT: I am thinking about creating two packages: A "data" package that provides the data, and a "code" package that operates on it. The question concerns the "data" package: Where can it store files in a per-user storage so that it is persistent across R sessions and is accessible from different R projects?
Related: Package that downloads data from the internet during installation.
There is no absolutely defined location for package-specific persistent caching in R. However, the R.cache package provides an interface for creating and managing cached data. It looks like it could be useful for your scenario.
When users load R.cache (library(R.cache)), they get the following prompt:
The R.cache package needs to create a directory that will hold cache files.
It is convenient to use one in the user's home directory, because it remains
also after restarting R. Do you wish to create the '~/.Rcache/' directory? If
not, a temporary directory (/tmp/RtmpqdUcbP/.Rcache) that is specific to this
R session will be used. [Y/n]:
They can then choose to create the cache directory in their home directory, which is presumably persistent, or to create a session-specific directory. If you make your data package depend on R.cache, you could check for the existence of the cached object(s) in its .onLoad() hook function and download the data if it isn't there. Alternatively, you could do this in the way suggested in your own question.
Have you looked at in-memory databases? H2 & Redis have bindings in R via RH2 & rredis- both allow you to share the data across r sessions- till the creating session is alive. in order to have it persisting across non-concurrent sessions, you need to write your data to the disk (assuming you can't re-create it on the fly- that would defeat the purpose of this question), and I believe the data package would be a good option. That way, you could add an update function that initializes everytime you load either package (i.e. if the code package has the right dependencies)
An example is RWeka & RWekaJars packages. Look them up on CRAN, and it should be fairly easy to understand how they work.

Resources