I have bits of code that I want to show in the examples of a package but neither run (when example(my_fun) is run) nor test (when R CMD check is run) because they're slow enough to annoy users who might unthinkingly run them, and definitely slow enough to annoy the CRAN maintainers.
Writing R Extensions says
You can use \dontrun{} for text that should only be shown, but not run ...
and
Finally, there is \donttest, used (at the beginning of a separate line) to mark code that should be run by example() but not by R CMD check.
Should I nest these, i.e.
\donttest
\dontrun{first slow example ...}
\dontrun{second slow example ...}
? That technically seems to go against the wording in WRE (i.e. it says that \donttest code should be run by example() ...) ?
I could just include them in the examples in a commented-out form or using if (FALSE) { ... } if it came to it ... but that seems ugly.
\dontrun subsumes \donttest: code that is marked with the former will neither be run by example(), nor by R CMD check. I know this because my packages for talking to Azure use \dontrun liberally, for examples that assume you have an Azure account.
Related
Is there a comprehensive list somewhere of what you can pass to R CMD CHECK? I don't see anything in the manual but it's brutal to read. It'd be great if every line of this output could be run independently or if I could at least skip some things but I don't see a comprehensive list describing how to do this.
Even better would be something I could use like
check_dependencies()
check_executable_files()
check_hidden_files()
For context, I had a note about large file sizes that was hard to debug (it was plotly using it's full JS library in my vignettes) and devtools::check() took 3 minutes each time I ran it.
As rawr# points out, the "official" answer is to run R CMD check --help, which will give the most accurate results for the version of R you are running.
Still, it is surprisingly difficult to find this info online. It would be nice not to need the terminal to look this up... so I'm filing this answer with some pointers in the right direction for anyone else that manages to Google their way to this Q&A:
This permalink highlights the documentation as of this writing, but that will stale as the function evolves.
tools/R/check.R is the correct file at HEAD; search for "Usage <- function", where the contents of R CMD check --help are maintained. (in principle, this could also change, but this anchor has remained accurate for the entire 12-year history of the R implementation of R CMD check)
sessionInfo() includes very useful info that will improve the chances of someone being able to run your code on their machine, including
OS and version
R version
Attached packages
What other info can be provided with an R script to ensure someone else will be able to run it in their environment?
NB please include how to get that info (i.e. what command to run or where to look for it)
While this is not a complete answer, I tend to include this function with scripts I send along as it will download a package if the computer does not have it. This is more of a suggestion for scripts. For packages, you can explicitly put what versions of other packages your package depends on.
package_load<-function(packages = NULL, quiet=TRUE,
verbose=FALSE, warn.conflicts=FALSE){
# download required packages if they're not already
pkgsToDownload<- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(pkgsToDownload)>0)
install.packages(pkgsToDownload, repos="http://cran.us.r-project.org",
quiet=quiet, verbose=verbose)
# then load them
for(i in 1:length(packages))
require(packages[i], character.only=T, quietly=quiet,
warn.conflicts=warn.conflicts)
}
## Example of use
package_load(c('dplyr', 'rgdal'))
This is helpful for one off scripts as it gets over the hurdle of a different computer not having the appropriate packages. However, I generally suggest to folks to make sure their version of R is up to date as well.
Is this the best solution? Probably not, but it does help with minor scripts you send along to others. For a larger code base, it would probably be better to put together a package or a docker image.
I think the criterion you listed are the "basics" of reusability of a script. The next levels would be the possible interaction of your scripts (e.g. R Shiny scripts will use web features: therefore, giving the web browser and its version used to produce the script is a good practice). Also, another kind of information would be commentaries precising the expected input and outputs.
NB: I would precise "attached packages and their versions", just for us to be sure...
I need to "industrialize" an R code for a data science project, because the project will be rerun several times in the future with fresh data. The new code should be really easy to follow even for people who have not worked on the project before and they should be able to redo the whole workflow quite quickly. Therefore I am looking for tips, suggestions, resources and best-practices on how to achieve this objective.
Thank you for your help in advance!
You can make an R package out of your project, because it has everything you need for a standalone project that you want to share with others :
Easy to share, download and install
R has a very efficient documentation system for your functions and objects when you work within R Studio. Combined with roxygen2, it enables you to document precisely every function, and makes the code clearer since you can avoid commenting with inline comments (but please do so anyway if needed)
You can specify quite easily which dependancies your package will need, so that every one knows what to install for your project to work. You can also use packrat if you want to mimic python's virtualenv
R also provide a long format documentation system, which are called vignettes and are similar to a printed notebook : you can display code, text, code results, etc. This is were you will write guidelines and methods on how to use the functions, provide detailed instructions for a certain method, etc. Once the package is installed they are automatically included and available for all users.
The only downside is the following : since R is a functional programming language, a package consists of mainly functions, and some other relevant objects (data, for instance), but not really scripts.
More details about the last point if your project consists in a script that calls a set of functions to do something, it cannot directly appear within the package. Two options here : a) you make a dispatcher function that runs a set of functions to do the job, so that users just have to call one function to run the whole method (not really good for maintenance) ; b) you make the whole script appear in a vignette (see above). With this method, people just have to write a single R file (which can be copy-pasted from the vignette), which may look like this :
library(mydatascienceproject)
library(...)
...
dothis()
dothat()
finishwork()
That enables you to execute the whole work from a terminal or a distant machine with Rscript, with the following (using argparse to add arguments)
Rscript myautomatedtask.R --arg1 anargument --arg2 anotherargument
And finally if you write a bash file calling Rscript, you can automate everything !
Feel free to read Hadley Wickham's book about R packages, it is super clear, full of best practices and of great help in writing your packages.
One can get lost in the multiple files in the project's folder, so it should be structured properly: link
Naming conventions that I use: first, second.
Set up the random seed, so the outputs should be reproducible.
Documentation is important: you can use the Roxygen skeleton in rstudio (default ctrl+alt+shift+r).
I usually separate the code into smaller, logically cohesive scripts, and use a main.R script, that uses the others.
If you use a special set of libraries, you can consider using packrat. Once you set it up, you can manage the installed project-specific libraries.
I wrote an R function that updates the version number of a package in another question. I work a lot with GitHub and RStudio, and it would safe me quite some time (plus be much more precise) if this function was automatically run every time I opened a certain project (or better yet, make a git commit/push, but I assume that is harder to do). But I don't know how to do this or if this is even possible.
I could use .Rprofile to run R codes every time I start R, so I could just update versions whenever I start R (or build in that it only updates the version if the date is not today or something) but that seems overdoing it.
You can make a separate .Rprofile for each project. You have to put it in the main directory of the project (http://www.rstudio.com/ide/docs/using/projects).
Well I would use .Rprofile for that. There is something to be said for being independent of the tool chain around you: knitr works from RStudio as well as without it, dito for Rcpp/RInside etc pp.
You can hook into commit hooks for svn, both explicitly via hooks in the back end, or simply at your by end adding wrapper scripts. I presume you can do likewise with git but I simply know much less about it. So to abstract this away, I would write myself a 'commitThis' or 'pushThis' or ... function that does the number increment, test run, code push and what have you.
If your code needs RStudio to be already running (e.g. because it's relying on some rstudioapi:: function), putting it directly in .Rprofile won't work (.Rprofile is executed before RStudio is available).
Instead, you could set a hook for "rstudio.sessionInit":
setHook(
hookName = "rstudio.sessionInit",
action = function(newSession) {
if (newSession) {
# your code goes here
},
action = "append"
)
I wrote an R function that updates the version number of a package in another question. I work a lot with GitHub and RStudio, and it would safe me quite some time (plus be much more precise) if this function was automatically run every time I opened a certain project (or better yet, make a git commit/push, but I assume that is harder to do). But I don't know how to do this or if this is even possible.
I could use .Rprofile to run R codes every time I start R, so I could just update versions whenever I start R (or build in that it only updates the version if the date is not today or something) but that seems overdoing it.
You can make a separate .Rprofile for each project. You have to put it in the main directory of the project (http://www.rstudio.com/ide/docs/using/projects).
Well I would use .Rprofile for that. There is something to be said for being independent of the tool chain around you: knitr works from RStudio as well as without it, dito for Rcpp/RInside etc pp.
You can hook into commit hooks for svn, both explicitly via hooks in the back end, or simply at your by end adding wrapper scripts. I presume you can do likewise with git but I simply know much less about it. So to abstract this away, I would write myself a 'commitThis' or 'pushThis' or ... function that does the number increment, test run, code push and what have you.
If your code needs RStudio to be already running (e.g. because it's relying on some rstudioapi:: function), putting it directly in .Rprofile won't work (.Rprofile is executed before RStudio is available).
Instead, you could set a hook for "rstudio.sessionInit":
setHook(
hookName = "rstudio.sessionInit",
action = function(newSession) {
if (newSession) {
# your code goes here
},
action = "append"
)