Is there a persistent location that is always writable which can be used as data cache by a package? - r

Is there a predefined location where an R package could store cached data? The data should persist across sessions. I was thinking about creating a subdirectory of ${R_LIBS_USER}/package_name, but I'm not sure if this is portable and if this is "allowed" if my package is installed systemwide.
The idea is the following: Create an R script mydata.R in the data subdirectory of the package which would be executed by calling data(mydata) (according to the documentation of data()). This script would load the data from the internet and cache it, if it hasn't been cached before. (If the data has been cached already, the cache will be used.) In addition, a function will be provided to invalidate the cache and/or to check if a newer version of the data is available online.
This is from the documentation of data():
Currently, four formats of data files are supported:
files ending ‘.R’ or ‘.r’ are source()d in, with the R working directory changed temporarily to the directory containing the respective file. (data ensures that the utils package is attached, in case it had been run via utils::data.)
...
Indeed, creating a file fortytwo.R in the data subdirectory of a package with the following contents:
fortytwo = data.frame(answer=42)
and then executing data(fortytwo) creates a data frame variable fortytwo. Now the question is: Where would fortytwo.R cache the data if it were difficult to compute?
EDIT: I am thinking about creating two packages: A "data" package that provides the data, and a "code" package that operates on it. The question concerns the "data" package: Where can it store files in a per-user storage so that it is persistent across R sessions and is accessible from different R projects?
Related: Package that downloads data from the internet during installation.

There is no absolutely defined location for package-specific persistent caching in R. However, the R.cache package provides an interface for creating and managing cached data. It looks like it could be useful for your scenario.
When users load R.cache (library(R.cache)), they get the following prompt:
The R.cache package needs to create a directory that will hold cache files.
It is convenient to use one in the user's home directory, because it remains
also after restarting R. Do you wish to create the '~/.Rcache/' directory? If
not, a temporary directory (/tmp/RtmpqdUcbP/.Rcache) that is specific to this
R session will be used. [Y/n]:
They can then choose to create the cache directory in their home directory, which is presumably persistent, or to create a session-specific directory. If you make your data package depend on R.cache, you could check for the existence of the cached object(s) in its .onLoad() hook function and download the data if it isn't there. Alternatively, you could do this in the way suggested in your own question.

Have you looked at in-memory databases? H2 & Redis have bindings in R via RH2 & rredis- both allow you to share the data across r sessions- till the creating session is alive. in order to have it persisting across non-concurrent sessions, you need to write your data to the disk (assuming you can't re-create it on the fly- that would defeat the purpose of this question), and I believe the data package would be a good option. That way, you could add an update function that initializes everytime you load either package (i.e. if the code package has the right dependencies)
An example is RWeka & RWekaJars packages. Look them up on CRAN, and it should be fairly easy to understand how they work.

Related

How to make a CRAN package to download data only once regardless of OS?

The CRAN policy limits R package size to 5 Mb, which is little for graphical applications such as mapping. There are multiple ways of handling the package size limitations, all of which come with their drawbacks. The alternatives have been listed below.
My question is: how to make an R package download data files only once (i.e. they are saved to a place where R finds them after restarting)? The solution should work for all common CRAN platforms.
I have been developing a mapping package for R which is supposed to plot bathymetric maps anywhere around the globe in ggplot2. I list alternatives to handle large data files in CRAN packages I have come across. The alternatives are written map-making in mind but apply for any case where large, single files are required:
Moving large files to a data package and making the original package depend on the data package.
a) If the data package is <5 Mb, it can be uploaded to CRAN, and one can make the original depend or import the data package in the DESCRIPTION field. User can simply use the install.packages() function as they would with any other CRAN package. Things work CRANtastic and everyone is happy.
b) If the data package is >5 Mb, things get messy. One alternative, in theory, would be to make a separate data package for each file given that the data files are all <5 Mb. Then one could use the approach in 1a for each data package. This alternative is so hacky that I have not had the nerves to try it in practice. It would be interesting to hear in the comments if someone has.
c) Another and better alternative is to use the drat package to make a data package, for example, to GitHub. This alternative has the benefit that the user can write install.packages() to install the original package from CRAN but also has quite a few disadvantages for the developer. Setting up the data package to pass all CRAN checks can be slightly challenging as all the steps have not been correctly specified anywhere online at the moment: the original package has to ask for permission to install the data package; the data package has to be distributed as separate binaries for the current development version of R at least for Windows and Mac, but possibly also for Fedora in the drat repository; the data package should be listed as Suggests: with an URL under Additional_repositories: in the DESCRIPTION file; to mention some surprises I have encountered so far. All in all, this alternative is great for the user but requires maintenance from the developer.
Some mapping packages (such as marmap) download data to temporary files from external servers. This approach has the benefit that CRAN requirements are easy to fulfill, and the user does not have to store any more data than required for the application. The approach also allows specifying the resolution in the download function, which is great for "zooming" the maps. The disadvantages are that the process is bound to take more time than simply storing the map data locally. Another disadvantage is that the map data need to be distributed in raster format (or the server has to crop vectors). At the time of writing, vector data allow easier manipulation of colors and styles in R and ggplot2 than raster data. Vectors also make sharper figures as the elements are not bound to resolution. The third disadvantage is that the download method (to my knowledge) has to be targetted to temporary files (i.e. they get lost when R is restarted) when writing a CRAN package due to operating system differences. As far as I know, it is not allowed to add Rdata files to already downloaded and existing R packages, and finding a location to download data that works for all major CRAN operating systems can be difficult.
I keep on getting rejected by CRAN time after time because I have not managed to solve the data download problem. There is some help available online but I feel this issue has not been addressed sufficiently yet. The optimal solution would download sp vector shapefiles
as needed when making maps (the objects can be stored in .Rdata format). This would allow the addition of detailed maps for certain frequently needed regions. The shapefiles could be stored on GitHub, which would allow quick and flexible modification of these files during development.
Have you tried using xz compression to reduce the size of your sysdata? I believe the default is gzip, with the compression level set to 6. If you use either bzip2 or xz compression when saving your package data with save(), R will use these compression algorithms in conjunction with a compression level of 9. The upshot is that you get smaller package data objects.
The getNOAA.bathy() function from the marmap package has a keep argument which defaults to FALSE. If set to TRUE, the dataset downloaded from the ETOPO1 database on NOAA servers is stored locally, in the working directory of the current R session. The argument Path allows the user to specify where the dataset should be saved (version 1.0.5, available on GitHub but not on CRAN yet).
When the user calls getNOAA.bathy(), the function first checks if the requested data is available locally, either in the current working directory or in the user provided path. If it is (same bounding box and resolution), then the NOAA servers are not queried and the local data file is loaded instead. If not, the data is downloaded from NOAA servers. IMHO, this method has the following advantages:
if keep=FALSE: nothing is stored locally, which avoids adding too much clutter to the user's disk when loading many different test datasets.
if keep=TRUE: the data is stored locally. Loading the data will be much faster the next time (and it can be done offline) since everything happens locally.
In a script, the same getNOAA.bathy() function is used to first download data from NOAA servers and load local files when available. The user does not have to worry to manually save the data, nor to alter his\her script to load local data the next time, since the function automatically loads the data from the most appropriate source (web server or internal disk).
there's no need to pack any heavy data within the package.
As far as I can tell, the only drawback is that on Windows machines, paths are limited to 250 characters, which might cause some trouble when generating filenames to save the data. Indeed, depending on the bounding box and resolution of the data downloaded on NOAA servers, filenames can be pretty long due to floating point arithmetics. An easy fix is to round the coordinates of the bounding box (using either round(), ceiling() or floor()) to a few decimal places before generating the name of the file to save.
In general I wouldn't make it too hacky. I think there could be ways to trick the package to load additional data online during installation and add it to the package itself. Would be somehow nice - but I don't think it is popular with the CRAN maintainers.
What about the following ? :
CRAN package for the functions
Github package for your data
In the CRAN package you import devtools and with the .onLoad method you install the Github data package with devtools::install_github. (on load is called, when the package is loaded with library()/require()). You see this sometimes with package startup messages.
I could imagine the following advantages:
is not done during installation but at package load
is somehow more transparent to the user (especially if you put a message)
has only to be done once (afterwards on load can just check if the data package is there and loads it)
the data is actually in a package and not a user path
the data is there for offline use once loaded
if you check for data package version in .onLoad, you could also trigger/make an update for the data without updating the CRAN package
A implementation could look like this:
#' #import devtools
.onLoad <- function(libname, pkgname){
if (! "wordcloud" %in% utils::installed.packages()) {
message("installing data super dupa data package")
devtools::install_github("ifellows/wordcloud")
}
else {
require(wordcloud)
message("Everything fine, ready for usage!")
}
}
The .onLoad has just to be out in any of your .R files. For your concrete implementation you could also refine this further. I don't have anything to to with the wordcloud package - was just the first thing I quickly found on GitHub as an example to install with install_github.
If there is an error message saying something with staged install - you have to add StagedInstall: no to your DESCRIPTION file.
You could have a function to install the data at a chosen location, and have the path stored in an option defined in your .R Profile: option(yourpackage.datapath = your path). You might suggest that the user stores it in your package installation path.
The installing function prints first the code above and proposes you to copy and paste it in your .RProfile while the data is downloading :
if(is.null(getOption("yourpackage.datapath")))
stop('you have not defined the "yourpackage.datapath" option, please make sure the data is installed using `yourpackage::install_yourdata", then copy `option(yourpackage.datapath = yourpath)` to your R profile.')
You could also open it using edit() for instance. Or place it in your pastebin but you don't want extra dependencies and I think you'd need some to do this. I don't think CRAN will let you edit the .RProfile automatically but this is not too bad of a manual action. The installation function could check that the option is set before even downloading.
The data can be stored in a global variable of your namespace. You just need to define a environment object in your package and a function to modify it :
globals <- new.env()
load_data <- function(path) globals$data <- readRDS(path)
Then your functions will test if globals$data is NULL before either loading the data (after checking if path option was set properly) or moving on.
Once it's done, as long as the data or RProfile are not removed, it will work forever, and if they are removed the functions will catch it and give instructions as to how to fix the issue.
Another option here is to load the data in .onLoad, it means you'll have some logic in there to deal with the first time the package is loaded. As .onLoad knows the installation path through the libname argument you can even impose to download your data there, and load it right after you checked it's there (using a global variable as above) , so no need for options and RProfile.
As long as the user is prompted I think it will be fine with CRAN.
Two alternatives that might be of interest:
Create an additional install function that installs from Github the data package(s). The rnaturalearth package has a great example with the install_rnaturalearthhires function.
Use the pins package to register a board_url. The pins package works by downloading and storing the file on cache. Whenever it is called it looks to the original url to see if there were any changes. If there weren't, it uses the one it already has in memory. If it has no Internet connection, it also uses the one in memory. As an example we use the pins package in our covidmx package to update COVID-19 data from the Internet.

How do I unit test functions in my R package that interact with the file system?

I'm working on an R package at work. My package has gotten large enough that I've decided I need some form of repeatable testing. I settled upon using testthat and mockery. I'm not a developer, so this is the first time I'm writing tests at this level.
I deal with a lot of data files and it's very convenient to have functions in my package to help locate files. These functions interact with the file system via calls to dir. For example,
Data from one event can be split over multiple files. If I have file datafile_2017.10.20_12.00.00, I have a function that can find the next file that is part of the same event, i.e. datafile_2017.10.20_12.05.00.
My question is this: what is the best way to test functions like this? My intuition is to avoid using actual files stored somewhere else in my repository because that can fail for a number of reasons, e.g. different paths, different repo states b/w systems. I searched around and it looks like different languages have mocking libraries that allow for mocking directory structures. I haven't found anything like that for R (except for testthatsomemore, but it was removed from CRAN sometime in 2016).
Is there an R package that allows for mocking directory structures? Or am I wrong to move away from storing small test files in my repo?

R package with code that only runs once (per installation)

I'd like to create an R package that, upon installation, displays contact information for the maintainer and ask the user for permission to count them in our list of installations. It would also be acceptable to have the code run the first time the user calls one of our functions, instead of immediately on installation. Either way, this message should only appear once ever (unless the user reinstalls / updates the package).
What I've considered:
I know how to include a dataset for internal use, but I don't know how to change that data permanently.
We could set an environment variable / app setting, but I don't know if there's a way to make that persist after the end of the session.
Using an external service / server would be excessively heavyweight, and wouldn't allow users who don't want to be tracked to turn off the message.
Is there a good way to do this?
This can run more than once but only within a limited time window so perhaps it is good enough.
Add this code to your package and it will issue the message any time the package is loaded within 7 days of installation and thereafter it will not issue the message again until the package is updated.
It works by comparing the time the install files were created to the current time. It does not require write permissions to any directory, only read, so it should work generally.
.onLoad <- function(libname, pkgname) {
ctime <- file.info(find.package(pkgname, libname))$ctime
if (difftime(Sys.time(), ctime, unit = "day") < 7)
packageStartupMessage("This msg will go away one week after installing this package")
}
You may have to bite the bullet and store state information across sessions to show it once and only once.
Some packages which may help:
settings which retrieves user configuration settings
config which retrieves configuration information
httr which access config info
registry which offers a registry
pkgconfig offers private configuration.
but I am not sure which one reads and writes. Maybe the last one fits the bill.
Edit: Turns out that even pkgconfig does not persist values across sessions. I have solved this problem with company-local code when I had control over directories or databases to write. For public and portable code it is a little harder. I still think there is a package out there that stores user-level config on all major OSs but I cannot for now remember the name.
Edit 2: With a nod to Gabor Csardi to refresh my memory, the rappdirs solves the problem of portably supplying a config location per-user (with other tricks too, a port of a corresponding Python library). Combine this with a simple cvs or rds file to store when (at all) you last showed the message and you can now show it once and exactly once. Not even again after a package upgrade.
The following code allows you to create a file in the package library:
activate_file = paste(system.file('extdata', package = 'your_package'), 'activated.txt', sep = '/')
file.exists(activate_file)
# FALSE
file.create(activate_file)
file.exists(activate_file)
# TRUE
Now you can check in .onLoad whether or not the activated.txt file exists. The first time you show the message, and then you create activated.txt, and in the next time the package is used the onload function sees the file and can skip the message.
Advantages:
Persistent over sessions.
Platform independent way that ensures the user has write privileges to create the file.
Disadvantages:
Reinstall/upgrade wipes the activated file, thus showing the message again.
If this is not acceptable, you could try and find a persistent location, e.g. in the home drive to do this (e.g. ~/.your_package/activated.txt). Then the challenge is to make this platform independent. Maybe look at path.expand(~) to get the current users home drive, not sure if this works on Windows.

update of a R package which is lazy loaded

I have several unix servers using a R package which is installed on a shared R library folder. The packages are lazy loaded (that's the default) from this shared folder.
Now I want to update the package:
1) is it possible (and clean) to do that without closing all R instances?
2) More precisely, I am concerned about the following:
2)a) The warning I get from the user interface when I try to install a package that is already loaded:
2)b)
From https://cran.r-project.org/doc/manuals/r-release/R-ints.html#Lazy-loading,
When a package/namespace which uses it is loaded, the package/namespace environment is populated with promises for all the named objects: when these promises are evaluated they load the actual code from a database.
Does that mean that the R instance will read again from the library folder when doing the actual evaluation of each object (in which case that means I need to either deactivate the lazy loading, or close all R instances before updating the package)
3) is there an alternative way to maintain R packages on a network of servers, that are running scripts all the time, without having to put each server offline one by one)
Thanks for your input
You asked
1) is it possible (and clean) to do that without closing all R instances?
and I can assure that yes, that it is how works and done everywhere.
As for
2) More precisely, I am concerned about the following:
you are reading it wrong. An R restart is simply recommended to ensure the new package is loaded as you cannot insert it into a running session.
Further
3) is there an alternative way to maintain R packages on a network of servers, that are running scripts all the time, without having to put each server offline one by one)
you never have to take a server off-line just to update a user-space package. E.g. we don't even take them off-line when we, say, upgrade the entire Ubuntu release twice a year.

Sourcing an R script whenever a package is loaded

I'm creating an R package for the handling of a specific dataset that is regularly updated in our organization, but not on a fixed schedule (making it unsuitable for something such as a cronjob). As a result, users must currently run a set of two scripts for data processing before they begin to analyze the data. In converting this set of functions into a package, I'm hoping to alleviate this by having the scripts be called whenever the package is first loaded to R (with analogous functions if people would like to manually check for an update in the middle of a multi-day session).
I've seen ways to deal with compiling external files upon package installation, but nothing on how to get R to run a script whenever the package is loaded (not just installed). Does anyone know if this is possible, and if so, how to do it?
Thanks!
These functions are outlined in the Writing R Extensions Guide, (which, if you're writing a package, you should be reading carefully) specifically section 1.5.3 Load Hooks
You can define an .onLoad function that will be called when you package loads.

Categories

Resources