Building R packages - using environment variables in DESCRIPTION file? - r

At our site, we have a large amount of custom R code that is used to build a set of packages for internal use and distribution to our R users. We try to maintain the entire library in a versioning scheme so that the version numbers and the date are the same. The problem is that we've gotten to the point where the number of packages is substantial enough that manual modification of the DESCRIPTION file and the package .Rd file is very time consuming, and it would be nice to automate these pieces.
We could write a pre-script that goes through the full set of files and writes the current data and version number. This could be done with out a lot of pain, but it would modify our current build chain and we would have to adapt the various steps.
Is there a way that this can be done without having to do a pre-build file modification step? In other words, can the DESCRIPTION file and the .Rd file contain something akin to an environment variable that will be substituted with the current information when called upon by R CMD build ?

You cannot use environment variables as R, when running R CMD build ... or R CMD INSTALL ..., sees the file as fixed.
But the no problem that cannot be fixed by another layer of indirection saying remains true. Your R source code could simply be files within another layer in which you text substitution according to some pattern. If you like autoconf, you could just have DESCRIPTION.in and have a configure script query the environment variables, or a meta-config file or database, or something else, and have that written out. Similarly you could have a sed or perl or python or R or ... script doing the textual substitution.
I used to let svn fill in the argument to Date: in DESCRIPTION, and also encoded revision numbers in an included header file. It's all scriptable to your heart's content.

Related

Is there an R function to make a copy of all the source code used to generate an analysis?

I have a file run_experiment.rmd which performs an analysis on data using a bunch of .r scripts in another folder.
Every analysis is saved into its own timestamped folder. I save the outputs of the analysis, the inputs used, and if possible I would also like to save the code used to generate the analysis (including the contents of both the .rmd file and the .r files).
The reason for this is because if I make changes to the way my analyses are run, then if I re-run the analysis using the new updated file, I will get different results. If possible, I would like to keep legacy versions of the code so that I can always, if need be, re-run the original analysis.
Have you considered using a git repository to commit your code and output each time you update/run it? I think this is the optimal solution for what you are describing. Each commit would have a timestamp associated with it for you to rollback to a previous version when needed.
The best way to do this is to put all of those scripts into an R package, and in your Rmd file, print sessionInfo() to record the package version used.
You should change the version number of the package each time you make a non-trivial change to it (or even better, with every change).
Then when you want to reproduce the analysis, you use the sessionInfo() listing to work out which version of R and of the packages to install, and you'll get the same environment.
There are packages to help with this (pak and renv, maybe others), but I haven't used them, so I can't give details or recommendations.

Do .Rout files preserve the R working environment?

I recently started looking into Makefiles to keep track of the scripts inside my research project. To really understand what is going on, I would like to understand the contents of .Rout files produced by R CMD BATCH a little better.
Christopher Gandrud is using a Makefile for his book Reproducible research with R and RStudio. The sample project (https://github.com/christophergandrud/rep-res-book-v3-examples/tree/master/data) has only three .R files: two of them download and clean data, the third one merges both datasets. They are invoked by the following lines of the Makefile:
# Key variables to define
RDIR = .
# Run the RSOURCE files
$(RDIR)/%.Rout: $(RDIR)/%.R
R CMD BATCH $<
None of the first two files outputs data; nor does the merge script explicitly import data - it just uses the objects created in the first two scripts. So how is the data preserved between the scripts?
To me it seems like the batch execution happens within the same R environment, preserving both objects and loaded packages. Is this really the case? And is it the .Rout file that transfers the objects from one script to the other or is it a property of the batch execution itself?
If the working environment is really preserved between the scripts, I see a lot of potential for issues if there are objects with the same names or functions with the same names from different packages. Another issue of this setup seems to be that the Makefile cannot propagate changes in the first two files downstream because there is no explicit input/prerequisite for the merge script.
I would appreciate to learn if my intuition is right and if there are better ways to execute R files in a Makefile.
By default R CMD BATCH will save your workspace to a hidden .Rdata file after running unless you choose --no-save. That's why it's not really the recommended way to run R script. The recommended way is with Rscript which will not save by default. You must write code explicitly to save if that's what you want. This is different than the Rout file which should only have the output from the commands run in the script.
In this case, execution doesn't happen in the exact same environment. R is still called three times, but that environment is serialized and reloaded between each run.
You are correct that there may be a lot of problems with saving and re-loading workspaces by default. That's why most people recommend you do not do that. But in this cause, the author just figured it made things easier for their workflow so they used it. It would be better to be more explicit about input and output files in general though.

Are there any good resources/best-practices to "industrialize" code in R for a data science project?

I need to "industrialize" an R code for a data science project, because the project will be rerun several times in the future with fresh data. The new code should be really easy to follow even for people who have not worked on the project before and they should be able to redo the whole workflow quite quickly. Therefore I am looking for tips, suggestions, resources and best-practices on how to achieve this objective.
Thank you for your help in advance!
You can make an R package out of your project, because it has everything you need for a standalone project that you want to share with others :
Easy to share, download and install
R has a very efficient documentation system for your functions and objects when you work within R Studio. Combined with roxygen2, it enables you to document precisely every function, and makes the code clearer since you can avoid commenting with inline comments (but please do so anyway if needed)
You can specify quite easily which dependancies your package will need, so that every one knows what to install for your project to work. You can also use packrat if you want to mimic python's virtualenv
R also provide a long format documentation system, which are called vignettes and are similar to a printed notebook : you can display code, text, code results, etc. This is were you will write guidelines and methods on how to use the functions, provide detailed instructions for a certain method, etc. Once the package is installed they are automatically included and available for all users.
The only downside is the following : since R is a functional programming language, a package consists of mainly functions, and some other relevant objects (data, for instance), but not really scripts.
More details about the last point if your project consists in a script that calls a set of functions to do something, it cannot directly appear within the package. Two options here : a) you make a dispatcher function that runs a set of functions to do the job, so that users just have to call one function to run the whole method (not really good for maintenance) ; b) you make the whole script appear in a vignette (see above). With this method, people just have to write a single R file (which can be copy-pasted from the vignette), which may look like this :
library(mydatascienceproject)
library(...)
...
dothis()
dothat()
finishwork()
That enables you to execute the whole work from a terminal or a distant machine with Rscript, with the following (using argparse to add arguments)
Rscript myautomatedtask.R --arg1 anargument --arg2 anotherargument
And finally if you write a bash file calling Rscript, you can automate everything !
Feel free to read Hadley Wickham's book about R packages, it is super clear, full of best practices and of great help in writing your packages.
One can get lost in the multiple files in the project's folder, so it should be structured properly: link
Naming conventions that I use: first, second.
Set up the random seed, so the outputs should be reproducible.
Documentation is important: you can use the Roxygen skeleton in rstudio (default ctrl+alt+shift+r).
I usually separate the code into smaller, logically cohesive scripts, and use a main.R script, that uses the others.
If you use a special set of libraries, you can consider using packrat. Once you set it up, you can manage the installed project-specific libraries.

Where to put R files that generate package data

I am currently developing an R package and want it to be as clean as possible, so I try to resolve all WARNINGs and NOTEs displayed by devtools::check().
One of these notes is related to some code I use for generating sample data to go with the package:
checking top-level files ... NOTE
Non-standard file/directory found at top level:
'generate_sample_data.R'
It's an R script currently placed in the package root directory and not meant to be distributed with the package (because it doesn't really seem useful to include)
So here's my question:
Where should I put such a file or how do I tell R to leave it be?
Is .Rbuildignore the right way to go?
Currently devtools::build() puts the R script in the final package, so I shouldn't just ignore the NOTE.
As suggested in http://r-pkgs.had.co.nz/data.html, it makes sense to use ./data-raw/ for scripts/functions that are necessary for creating/updating data but not something you need in the package itself. After adding ./data-raw/ to ./.Rbuildignore, the package generation should ignore anything within that directory. (And, as you commented, there is a helper-function devtools::use_data_raw().)

How to use Makefiles with R CMD build

I am developing a R package. It is based on a project that only used Makefile. Most of it easily translated to the R CMD build workflow. However the pdfs I need to create are a bit complex and I don't get them right unless I tinker - so far I figured how to do it with a Makefile.
In the R package documentations I find references to use Makefiles for sources and even for vignettes.
I don't grasp how these should be applied. From these documentations I had the impression Makefiles would be called in the process of R CMD build but when I put Makefile in the described directories they are just ignored. However R CMD check recognises them and outputs passing tests.
I also have seen some Makefiles that call R CMD build inside - but I keep wondering how these would execute when I use install.packages. That doesn't seem right - I mean why would R CMD check these if it wouldn't care about. And there's also this page in R packages about adding SystemRequiremens: GNU make - why do this for a file you don't use?
So what is the best practice nowadays? And are there examples in the wild that I can look at?
Updates
As I was asked for an example
I want to build a vignette as similar as described in "Writing package vignettes". There is a master Latex file which includes several Rnw files.
The concrete dilemmas are:
how do I build the pdf vignette?
how can I enforce dependencies — obviously the rnws need to rendered first
the Rnw need slowly calculated data that is neither intended to go into the package nor in the repo (it's some gigabytes) — but it is reused several times during the build.
So far I do it with a Makefile, the general pattern is like this:
tmp/test.pdf: tmp/test.tex tmp/rnw1.tex tmp/rnw2.tex
latexmk -outdir=$(#D) $<
tmp/%.tex: r/%.rnw
Rscript -e "knitr::knit('$<', output='$#')"
tmp/rnw1.tex tmp/rnw2.tex: tmp/slowdata.Rdata
tmp/slowdata.Rdata: r/ireallytakeforever.R
Rscript $<
Bdecaf,
Ok, answer version 2.0 - chuckle.
You mentioned that "The question is how Makefiles and the package build workflow are supposed to go together". In that context, my recommendation is you review a set of example R package makefiles:
Makefile for Yihui Xie's knitr package for R.
Makefile for my R/qtlcharts package.
The knitr package makefile (in my view) provides a good example of how to build vignettes. You need to review the makefile and directory structure, that would be the template I would recommend you review and use.
I'd also recommend you look at maker, a Makefile for R package development. On top of this, I would start with Karl Broman guides - (this is what I used myself as a source reference a while back now eclipsed by Hadley's book on packages but still useful (in my view).
Minimal make: A minimal tutorial on Make
R package Primer.
The other recommendation is to read Rob Hynman's article I referenced previously
Makefiles for R/LaTeX projects
between them, you should be able to do what you request. Above and beyond that you have the base R package manual you referenced.
I hope the above helps.
T.
Referenced pages:
minimal make A minimal tutorial on make - Author Karl Broman
I would argue that the most important tool for reproducible research is not Sweave or knitr but GNU make.
Consider, for example, all of the files associated with a manuscript. In the simplest case, I would have an R script for each figure plus a LaTeX file for the main text. And then a BibTeX file for the references.
Compiling the final PDF is a bit of work:
Run each R script through R to produce the relevant figure.
Run latex and then bibtex and then latex a couple of more times.
And the R scripts need to be run before latex is, and only if they’ve changed.
A simple example
GNU make makes this easy. In your directory for the manuscript, you create a text file called Makefile that looks something like the following (here using pdflatex).
mypaper.pdf: mypaper.bib mypaper.tex Figs/fig1.pdf Figs/fig2.pdf
pdflatex mypaper
bibtex mypaper
pdflatex mypaper
pdflatex mypaper
Figs/fig1.pdf: R/fig1.R
cd R;R CMD BATCH fig1.R
Figs/fig2.pdf: R/fig2.R
cd R;R CMD BATCH fig2.R
Each batch of lines indicates a file to be created (the target), the files it depends on (the prerequisites), and then a set of commands needed to construct the target from the dependent files. Note that the lines with the commands must start with a tab character (not spaces).
Another great feature: in the example above, you’d only build fig1.pdf when fig1.R changed. And note that the dependencies propagate. If you change fig1.R, then fig1.pdf will change, and so mypaper.pdf will be re-built.
One oddity: if you need to change directories to run a command, do the cd on the same line as the related command. The following would not work:
### this doesn't work ###
Figs/fig1.pdf: R/fig1.R
cd R
R CMD BATCH fig1.R
You can, however, use \ for a continuation line, line so:
### this works ###
Figs/fig1.pdf: R/fig1.R
cd R;\
R CMD BATCH fig1.R
Note that you still need to use the semicolon (;).
Using GNU make
You probably already have GNU make installed on your computer. Type make --version in a terminal/shell to see. (On Windows, go here to download make.)
To use make:
Go into the the directory for your project.
Create the Makefile file.
Every time you want to build the project, type make.
In the example above, if you want to build fig1.pdf without building mypaper.pdf, just type make fig1.pdf.
Frills
You can go a long way with just simple make files as above, specifying the target files, their dependencies, and the commands to create them. But there are a lot of frills you can add, to save some typing.
Here are some of the options that I use. (See the make documentation for further details.)
Variables
If you’ll be repeating the same piece of code multiple times, you might want to define a variable.
For example, you might want to run R with the flag --vanilla. You could then define a variable R_OPTS:
R_OPTS=--vanilla
You refer to this variable as $(R_OPTS) (or ${R_OPTS}; either parentheses or curly braces is allowed), so in the R commands you would use something like
cd R;R CMD BATCH $(R_OPTS) fig1.R
An advantage of this is that you just need to type out the options you want once; if you change your mind about the R options you want to use, you just have to change them in the one place.
For example, I actually like to use the following:
R_OPTS=--no-save --no-restore --no-init-file --no-site-file
This is like --vanilla but without --no-environ (which I need because I use the .Renviron file to define R_LIBS, to say that I have R packages defined in an alternative directory).
Automatic variables
There are a bunch of automatic variables that you can use to save yourself a lot of typing. Here are the ones that I use most:
$# the file name of the target
$< the name of the first prerequisite (i.e., dependency)
$^ the names of all prerequisites (i.e., dependencies)
$(#D) the directory part of the target
$(#F) the file part of the target
$(<D) the directory part of the first prerequisite (i.e., dependency)
$(<F) the file part of the first prerequisite (i.e., dependency)
For example, in our simple example, we could simplify the lines
Figs/fig1.pdf: R/fig1.R
cd R;R CMD BATCH fig1.R
We could instead write
Figs/fig1.pdf: R/fig1.R
cd $(<D);R CMD BATCH $(<F)
The automatic variable $(<D) will take the value of the directory of the first prerequisite, R in this case. $(<F) will take value of the file part of the first prerequisite, fig1.R in this case.
Okay, that’s not really a simplification. There doesn’t seem to be much advantage to this, unless perhaps the directory were an obnoxiously long string and we wanted to avoid having to type it twice. The main advantage comes in the next section.
Pattern rules
If a number of files are to be built in the same way, you may want to use a pattern rule. The key idea is that you can use the symbol % as a wildcard, to be expanded to any string of text.
For example, our two figures are being built in basically the same way. We could simplify the example by including one set of lines covering both fig1.pdf and fig2.pdf:
Figs/%.pdf: R/%.R
cd $(<D);R CMD BATCH $(<F)
This saves typing and makes the file easier to maintain and extend. If you want to add a third figure, you just add it as another dependency (i.e., prerequisite) for mypaper.pdf.
Our example, with the frills
Adding all of this together, here’s what our example Makefile will look like.
R_OPTS=--vanilla
mypaper.pdf: mypaper.bib mypaper.tex Figs/fig1.pdf Figs/fig2.pdf
pdflatex mypaper
bibtex mypaper
pdflatex mypaper
pdflatex mypaper
Figs/%.pdf: R/%.R
cd $(<D);R CMD BATCH $(R_OPTS) $(<F)
The advantage of the added frills: less typing, and it’s easier to extend to include additional figures. The disadvantage: it’s harder for others who are less familiar with GNU Make to understand what it’s doing.
More complicated examples
There are complicated Makefiles all over the place. Poke around github and study them.
Here are some of my own examples:
Makefile for my AIL probabilities paper
Makefile for my phylo QTL paper
Makefile for my pre-CC probabilities paper
Makefile for a talk on interactive graphs.
Makefile for a talk on QTL mapping for function-valued
traits.
Makefile for my R/qtlcharts package.
And here are some examples from Mike Bostock:
Makefile for us-rivers
Makefile for protovis
Makefile for topotree
Also look at the Makefile for Yihui Xie’s knitr package for R.
Also of interest is maker, a Makefile for R package development.
Resources
GNU make webpage
Official manual
O’Reilly Managing projects with GNU make book (part of the Open Books project)
Software carpentry’s make tutorial
Mike Bostock’s “Why Use Make”
GNU Make for reproducible data analysis by Zachary Jones
Makefiles for R/LaTeX projects by Rob Hyndman
R package primer
a minimal tutorial
A minimal tutorial on how to make an R package.
R packages are the best way to distribute R code and documentation,
and, despite the impression that the official manual
(Writing R Extensions)
might give, they really are quite simple to create.
You should make an R package even for code that you don't plan to
distribute. You'll find it is easier to keep track of your own
personal R functions if they are in a package. And it's good to write
documentation, even if it's just for your future self.
Hadley Wickham wrote
a book about R packages (free online; also
available in paper form from
Amazon). You
might just jump straight there.
Hilary Parker wrote a
short and clear tutorial on writing R packages.
If you want a crash course, you should start there. A lot of people
have successfully built R packages from her instructions.
But there is value in having a diversity of
resources, so I thought I'd go ahead and write my own minimal tutorial.
The following list of topics looks forbidding, but each is short and
straightforward (and hopefully clear). If you're put off by the list
of topics,
and you've not already abandoned me in favor of
Hadley's book, then why aren't you reading
Hilary's tutorial?
If anyone's still with me, the following pages cover the essentials of
making an R package.
Why write an R package?
The minimal R package
Building and installing an R package
Making it a proper package
Writing documentation with Roxygen2
Software licenses
Checking an R package
The following are important but not essential.
Putting it on GitHub
Getting it on CRAN
Writing vignettes
Writing tests
Including datasets
Connecting to other packages
The following contains links to other resources:
Further resources
If anything here is confusing (or wrong!), or if I've missed
important details, please
submit an issue, or (even
better) fork the GitHub repository for this website,
make modifications, and submit a pull request.
The source for this tutorial is on github.
Also see my tutorials on
git/github,
GNU make,
knitr,
making a web site with GitHub Pages,
data organization,
and reproducible research.

Resources