R - Automatic Creation of Data Packages - r

I have data on a server in the form of SAS data sets that are updated daily. I would like these to be packaged auto-magically into R packages and then dropped in a package repository on the server. This should allow my co-workers and I to easily work with this packaged data in R and keep up-to-date as it changes each day by simply calling install.packages and update.packages.
What is a good way to implement this automatic creation of data packages?
I have written some code that pulls in the data set, converts it and then uses packages.skeleton() to dynamically create the package structure. I then have to overwrite the DESCRIPTION file to update the version along with some other edits. Then finally have to call tools::build and tools::check to package the whole lot and drop it in the repository. Is there a better way?

What you can do is to create an R file under your data folder to load data:
data
--sas_data.R
And in this sas_data.R you write your code to load the data from the server. The code should be something like :
download.file(urll,dest_file)
## process here
sas_data = read.table(dest_file)
Then you call it using data:
data(sas_data)

I would recommend using a makefile to automate the conversion of datasets. This would be useful especially if there are multiple datasets and the conversion process is time consuming.
I am assuming that the sas files are in a directory called sas. Here is the makefile.
By typing make data, all the *.sas7bdat files are read from the sas directory, using the package sas7bdat and saved as *.rda files of the same name in the data directory of the package. You can add more automation by adding package installation to the makefile and using a continuous integration system like TravisCI so that your R package is always up-to-date.
I have created a sample repo to illustrate my idea. This is an interesting question and I think it makes sense to develop a simple, flexible and robust approach to data packing.
SAS_FILES = $(wildcard sas/*.sas7bdat)
RDA_FILES = $(patsubst sas/%.sas7bdat, data/%.rda, $(SAS_FILES))
data: $(RDA_FILES)
data/%.rda: sas/%.sas7bdat
Rscript -e "library(sas7bdat); library(tools); fname = file_path_sans_ext(basename('$<')); assign(fname, read.sas7bdat('$<')); save($(basename $(notdir $<)), file = '$#')"

Related

Is there an R function to make a copy of all the source code used to generate an analysis?

I have a file run_experiment.rmd which performs an analysis on data using a bunch of .r scripts in another folder.
Every analysis is saved into its own timestamped folder. I save the outputs of the analysis, the inputs used, and if possible I would also like to save the code used to generate the analysis (including the contents of both the .rmd file and the .r files).
The reason for this is because if I make changes to the way my analyses are run, then if I re-run the analysis using the new updated file, I will get different results. If possible, I would like to keep legacy versions of the code so that I can always, if need be, re-run the original analysis.
Have you considered using a git repository to commit your code and output each time you update/run it? I think this is the optimal solution for what you are describing. Each commit would have a timestamp associated with it for you to rollback to a previous version when needed.
The best way to do this is to put all of those scripts into an R package, and in your Rmd file, print sessionInfo() to record the package version used.
You should change the version number of the package each time you make a non-trivial change to it (or even better, with every change).
Then when you want to reproduce the analysis, you use the sessionInfo() listing to work out which version of R and of the packages to install, and you'll get the same environment.
There are packages to help with this (pak and renv, maybe others), but I haven't used them, so I can't give details or recommendations.

how to source an R project that includes here::here() links with another R project that also includes here::here() links

I have an R studio project with a main.R (see sample code) file that sources a few other scripts within this project using here::here(), and also uses here() within the scripts its sourcing. This first R studio project produces a dataset that I would to use in another R studio project also using here() with a similar structure in terms of the main.R script.
First project
library(here)
here::here()
#1. load packages
source(paste0(here::here(),"/R/load_packages.R"))
#2. load UDF functions
source(paste0(here::here(),"/R/functions.R"))
#3. Load BA data
source(paste0(here::here(),"/analysis/load_ba.R"))
#4. Load CDS data
source(paste0(here::here(),"/analysis/load_cds.R"))
#5. Calculate
source(paste0(here::here(),"/analysis/calculate.R"))
Second project
library(here)
here::here()
#load packages
source(base::paste0(here::here(),"/analysis/packages.R"))
#load and manipulate pop/ds data
source("first project full file path/main.R")
So my question is, what is the best way to source the first main.R file that produces the data set I want to use in the second R studio project without the here() links breaking?
One option is to write the output dataset to csv and then read it in, but maybe there is a better way?
The Right Way to do this is to build any code you want to re-use as a package that exports functions (and maybe fixed datasets) for other code to re-use.
I can think of hacky ways to do what you want, most of them rely on passing a folder from the calling code in a global variable or changing the current folder to the called code using setwd (or withr::with_dir) but these are messy. Make packages and create functions instead.
You might be tempted by the whereami package and its thisfile function. But read the help - even the authors don't want you to use it:
*CAVEAT*: Use this function only if your
workflow does not permit other solution: if a script needs to know
its location, it should be set outside the context of the script
if possible.

Do .Rout files preserve the R working environment?

I recently started looking into Makefiles to keep track of the scripts inside my research project. To really understand what is going on, I would like to understand the contents of .Rout files produced by R CMD BATCH a little better.
Christopher Gandrud is using a Makefile for his book Reproducible research with R and RStudio. The sample project (https://github.com/christophergandrud/rep-res-book-v3-examples/tree/master/data) has only three .R files: two of them download and clean data, the third one merges both datasets. They are invoked by the following lines of the Makefile:
# Key variables to define
RDIR = .
# Run the RSOURCE files
$(RDIR)/%.Rout: $(RDIR)/%.R
R CMD BATCH $<
None of the first two files outputs data; nor does the merge script explicitly import data - it just uses the objects created in the first two scripts. So how is the data preserved between the scripts?
To me it seems like the batch execution happens within the same R environment, preserving both objects and loaded packages. Is this really the case? And is it the .Rout file that transfers the objects from one script to the other or is it a property of the batch execution itself?
If the working environment is really preserved between the scripts, I see a lot of potential for issues if there are objects with the same names or functions with the same names from different packages. Another issue of this setup seems to be that the Makefile cannot propagate changes in the first two files downstream because there is no explicit input/prerequisite for the merge script.
I would appreciate to learn if my intuition is right and if there are better ways to execute R files in a Makefile.
By default R CMD BATCH will save your workspace to a hidden .Rdata file after running unless you choose --no-save. That's why it's not really the recommended way to run R script. The recommended way is with Rscript which will not save by default. You must write code explicitly to save if that's what you want. This is different than the Rout file which should only have the output from the commands run in the script.
In this case, execution doesn't happen in the exact same environment. R is still called three times, but that environment is serialized and reloaded between each run.
You are correct that there may be a lot of problems with saving and re-loading workspaces by default. That's why most people recommend you do not do that. But in this cause, the author just figured it made things easier for their workflow so they used it. It would be better to be more explicit about input and output files in general though.

Building R packages - using environment variables in DESCRIPTION file?

At our site, we have a large amount of custom R code that is used to build a set of packages for internal use and distribution to our R users. We try to maintain the entire library in a versioning scheme so that the version numbers and the date are the same. The problem is that we've gotten to the point where the number of packages is substantial enough that manual modification of the DESCRIPTION file and the package .Rd file is very time consuming, and it would be nice to automate these pieces.
We could write a pre-script that goes through the full set of files and writes the current data and version number. This could be done with out a lot of pain, but it would modify our current build chain and we would have to adapt the various steps.
Is there a way that this can be done without having to do a pre-build file modification step? In other words, can the DESCRIPTION file and the .Rd file contain something akin to an environment variable that will be substituted with the current information when called upon by R CMD build ?
You cannot use environment variables as R, when running R CMD build ... or R CMD INSTALL ..., sees the file as fixed.
But the no problem that cannot be fixed by another layer of indirection saying remains true. Your R source code could simply be files within another layer in which you text substitution according to some pattern. If you like autoconf, you could just have DESCRIPTION.in and have a configure script query the environment variables, or a meta-config file or database, or something else, and have that written out. Similarly you could have a sed or perl or python or R or ... script doing the textual substitution.
I used to let svn fill in the argument to Date: in DESCRIPTION, and also encoded revision numbers in an included header file. It's all scriptable to your heart's content.

What methods exist for distributing a semi-live dataset with an R package?

I am building a package for internal use using devtools. I would like to have the package load in data from a file/connection (that differs depending on the date package is built). The data is large-ish so having a onetime cost of parsing and loading the data during package building is preferable.
Currently, I have a data.R file under R/ that assigns the data to package-level variables, the values are assigned during package installation (or at least that's what appears to be happening). This less than ideal setup mostly works. In order to get all instances of the package to have the same data I have to distribute the data file with the package (currently it's being copied to inst/ by a helper script before building the package) instead of just having it all be packaged together. There must be a better way.
Such as:
Generate .rda files during package building (but this requires not running the same code during package install)
I can do this with a Makefile but that seems like overkill
Can I have R code that is only run during package building and not during install?
Run R code in data/
But the data is munged using code in the package in question. I can fix that with Collate (I think) but then I have to maintain the order of all of the .R files (but with that added complexity I might as well use a Makefile?)
Build two packages, one with all of the code I want, one with the data.
Obvious, clever things I've not thought of.
tl;dr: What are some methods for adding a snapshot of dynamically changing data to an R package frozen for deployment?
As #BenBolker points out in the comments above, splitting the dataset out into a different package has precedent in the community (most notably the core package datasets) and has additional benefits.
The separation of functions from data also makes working on historic versions of the data easier to do with the up to date functions.
I currently have an tools-to-munge package and a things-to-munge package. Using a helper script I can build the tools-to-munge and setup a Suggests (or Depends) in the DESCRIPTION of both packages to point to the appropriate incrementing version of the packages. After the new tools-to-munge package has been built I can build the things-to-munge package as necessary using the functions in the tools-to-munge package.

Resources