R: renv within R notebook-scoped (Rmd) workflows - r

I am looking for a way to make my R notebook-centric workflow be more reproducible and subsequently more easily containerized with Docker. For my medium-sized data analysis projects, I work with a very simple structure: a folder with an associated with an .Rproj and an index.html (that is a landing page for Github Pages) that holds other folders that have within them the notebooks, data, scripts, etc. This simple "1 GitHub repo = 1 Rproj" structure was also good for my nb.html files rendered by Github Pages.
.
└── notebooks_project
├── notebook_1
│ ├── notebook_1.Rmd
│ └── ...
├── notebook_2
│ ├── notebook_2.Rmd
│ └── ...
├── notebooks_project.Rproj
├── README.md
├── index.html
└── .gitignore
I wish to keep this workflow that utilizes R notebooks both as literate programming tools and control documents (see RMarkdown Driven Development), as it seems decently suited for medium reproducible analytic projects. Unfortunately, there is a lack of documentation about Rmd-centric workflows using renv, although it seems to be well integrated with it.
Frist, Yihui Xie hinted here that methods related to using renv for individual Rmd documents include: renv::activate(), renv::use(), and renv::embed(). The renv::activate() does ony a part of what renv::init() does: it loads the project and sources the init.R. From my understanding, it does this if a project was already initialized, but it acts like renv::init() if project was not initialized: discovers dependencies, copies them to renv global package cache, writes several files (.Rprofile, renv/activate.R, renv/.gitignore, .Rbuildignore). renv::use() works well within standalone R scripts where the script's dependencies are specified directly within that script and we need those packages automatically installed and loaded when the associated script is run. renv::embed() just embeds a compact representation of renv.lock into a code chunk of the notebook - it changes the .Rmd on render/save by adding the code chunk with dependencies and deletes the call to renv::embed(). As I understand it, using renv::embed() and renv::use() could be sufficient for a reproducible stand-alone notebook. Nevertheless, I don't mind having the lock file in the directory or keeping the renv library as long as they are all in the same directory.
Second, preparing for subsequent Binder or Docker requirements, using renv together with RStudio Package Manager. Grant McDermott provides some useful code here (that may go in the .Rprofile or in the .Rmd itself, I think) and provides the rationale for it:
The lockfile is references against RSPM as the default package
repository (i.e. where to download packages from), rather than one of
the usual CRAN mirrors. Among other things, this enables
time-travelling across different package versions and fast
installation of pre-compiled R package binaries on Linux.
Third, I'd like to use the here package to work with relative paths. It seems the way to go so that the notebooks can run when transferred or when running inside Docker container. Unfortunately, here::here() looks for the .Rproj and will find it in my upper level folder (i.e. notebooks_project). A .here file that may be placed with here::set_here() overrides this behavior making here::here() point to the notebook folder as intended (i.e. notebook1). Unfortunately, the .here file takes effect only on restarting the R session or running unloadNamespace("here") (documented here).
Here is what I have experimented with untill now:
---
title: "<br> R Notebook Template"
subtitle: "RMardown Report"
author: "<br> Claudiu Papasteri"
date: "`r format(Sys.time(), '%d %m %Y')`"
output:
html_notebook:
code_folding: hide
toc: true
toc_depth: 2
number_sections: true
theme: spacelab
highlight: tango
font-family: Arial
---
```{r setup, include = FALSE}
# Set renv activate the current project
renv::activate()
# Set default package source by operating system, so that we automatically pull in pre-built binary snapshots, rather than building from source.
# This can also be appended to .Rprofile
if (Sys.info()[["sysname"]] %in% c("Linux", "Windows")) { # For Linux and Windows use RStudio Package Manager (RSPM)
options(repos = c(RSPM = "https://packagemanager.rstudio.com/all/latest"))
} else {
# For Mac users, we default to installing from CRAN/MRAN instead, since RSPM does not yet support Mac binaries.
options(repos = c(CRAN = "https://cran.rstudio.com/"))
# options(renv.config.mran.enabled = TRUE) ## TRUE by default
}
options(renv.config.repos.override = getOption("repos"))
# Install (if necessary) & Load packages
packages <- c(
"tidyverse", "here"
)
renv::install(packages, prompt = FALSE) # install packages that are not in cache
renv::hydrate(update = FALSE) # install any packages used in the Rnotebook but not provided, do not update
renv::snapshot(prompt = FALSE)
# Set here to Rnotebook directory
here::set_here()
unloadNamespace("here") # need new R session or unload namespace for .here file to take precedence over .Rproj
rrRn_name <- fs::path_file(here::here())
# Set kintr options including root.dir pointing to the .here file in Rnotebook directory
knitr::opts_chunk$set(root.dir = here::here())
# ???
renv::use(lockfile = here::here("renv.lock"), attach = TRUE) # automatic provision an R library when Rnotebook is run and load packages
# renv::embed(path = here::here(rrRn_name), lockfile = here::here("renv.lock")) # if run this embeds the renv.lock inside the Rnotebook
renv::status()$synchronized
```
I'd like my nobooks to be able to run without code change both locally (where dependencies are already installed, cached and where the project was initialized) and when transferred to other systems. Each notebook should have its own renv settings.
I have many questions:
What's wrong with my renv sequence? Is calling renv::activate() on every run (both for initialization and after) the way to go? Should I use renv::use() instead of renv::install() and renv::hydrate()? Is renv::embed() better for a reproducible workflow even though every notebook folder should have its renv.lock and library? renv on activation also creates an .Rproj file (e.g. notebook1.Rproj) thus breaking my simple 1 repo = 1 Rproj - should this concern me?
The renv-RSPM workflow seems great, but is there any advantage of storing that script in the .Rprofile as opposed to having it within the Rmd itself?
Is ther a better way to use here? That unloadNamespace("here") seems hacky but it seems the only way to preserve a use for the .here files.

What's wrong with my renv sequence? Is calling renv::activate() on every run (both for initialization and after) the way to go? Should I use renv::use() instead of renv::install() and renv::hydrate()? Is renv::embed() better for a reproducible workflow even though every notebook folder should have its renv.lock and library?
If you already have a lockfile that you want to use + associate with your projects, then I would recommend just calling renv::restore(lockfile = "/path/to/lockfile"), rather than using renv::use() or renv::embed(). Those tools are specifically for the case where you don't want to use an external lockfile; that is, you'd rather embed your document's dependencies in the document itself.
The question about renv::restore() vs renv::install() comes down to whether you want the exact package versions as encoded in the lockfile, or whatever happens to be current / latest on the R package repositories visible to your session. I think the most typical workflow is something like:
Use renv::install(), renv::hydrate(), or other tools to install packages as you require them;
Confirm that your document is in a good, runnable state,
Call renv::snapshot() to "save" that state,
Use renv::restore() in future runs of your document to "load" that previously-saved state.
renv on activation also creates an .Rproj file (e.g. notebook1.Rproj) thus breaking my simple 1 repo = 1 Rproj - should this concern me?
If this is undesired behavior, you might want to file a bug report at https://github.com/rstudio/renv/issues, with a bit more context.
The renv-RSPM workflow seems great, but is there any advantage of storing that script in the .Rprofile as opposed to having it within the Rmd itself?
It just depends on how visible you want that configuration to be. Do you want it to be active for all R sessions launched in that project directory? If so, then it might belong in the .Rprofile. Do you only want it active for that particular R Markdown document? If so, it might be worth including there. (Bundling it in the R Markdown file also makes it easier to share, since you could then share just the R Markdown document without also needing to share the project / .Rprofile)
Is ther a better way to use here? That unloadNamespace("here") seems hacky but it seems the only way to preserve a use for the .here files.
If I understand correctly, you could just manually create a .here file yourself before loading the here package, e.g.
file.create("/path/to/.here")
library(here)
since that's all set_here() really does.

Related

How to run R projects / use their relative paths from the terminal without setwd() resp. cd

I'm kinda lost on that one:
I have set up an R project, let's call it "Test Project.Rproj". The beauty of R projects is the possibility to use relative paths (relative to the .Rproj file). My project consists of a "main.R" script, which is saved on the same level as the .Rproj file.
Additionally I have a directory called 'Output', where I want my plots and exported data to be saved. My "main.R" file looks like the following:
my_df <- data.frame(A = 1:10, B = 11:20)
my_df |>
writexl::write_xlsx(here::here("Output",
paste0("my_df_",
stringr::str_replace_all(as.character(Sys.time()), ":", ""),
".xlsx")))
My final goal is to automate the execution of the 'main.R' file using the Windows Task Scheduler. But in order to do so, I have to be able to run the script from the terminal. The problem here is the working directory. When opening an R project, all the paths are relative to .Rproj file. But in the terminal the current working directory is <C:\Users\my_name>. Of course I could manually set the working directory via cd "path\to\my\project. But I would like to avoid that.
My current call for the execution of the main.R file in the terminal is the following:
"C:\Program Files\R\R-4.1.0\bin\Rscript" -e "source('C:/Users/my_name/path/to/my/project/main.R')"
My two ideas for a solution are the following, but I am happy for other suggestions as well.
In order to replicate the usual use of a project: Is there a way to execute the .Rproj
file from the terminal? In order to create a similar environment as in RStudio, where all the relative paths are working, when executing scripts from the project afterwards?
There are two packages adressing the problem of relative paths: rprojroot and here, where the former is the basis for the latter. I am pretty sure that here does not provide the needed functionality. I tried adding here::i_am("main.R) to my main.R file, but the project root directory still is not found when executing in the terminal from a working directory outside the project.
For rprojroot to work, I think it is also necessary to have your current working directory somewhere within the project. But this package offers a lot of functionality, so I am not sure wheter I am overlooking something.
So I would be happy about any help. Maybe it is impossible and I have to change the working directory manually - then I would be glad to know that as well.
Some links I used in my research:
https://www.tidyverse.org/blog/2017/12/workflow-vs-script/
https://malco.io/2018/11/05/why-should-i-use-the-here-package-when-i-m-already-using-projects/
http://jenrichmond.rbind.io/post/how-to-use-the-here-package/
Thanks a lot!
Edit: My current implementation is an additional R script, where I manually set the working directory via setwd() and source the main.R file. However it is always suggested to avoid setwd, which is why this whole question exists.

How do I use setwd in a relative way?

Our team uses R scripts in git repos that are shared between several people, across both Mac and Windows (and occasionally Linux) machines. This tends to lead to a bunch of really annoying lines at the top of scripts that look like this:
#path <- 'C:/data-work/project-a/data'
#path <- 'D:/my-stuff/project-a/data'
path = "~/projects/project-a/data"
#path = 'N:/work-projects/project-a/data'
#path <- "/work/project-a/data"
setwd(path)
To run the script, we have to comment/uncomment the correct path variable or the scripts won't run. This is annoying, untidy, and tends to be a bit of a mess in the commit history too.
In past I've got round this by using shell scripts to set directories relative to the script's location and skipping setwd entirely (and then using ./run-scripts.sh instead of Rscript process.R), but as we've got Windows users here, that won't work. Is there a better way to simplify these messy setwd() boilerplates in R?
(side note: in Python, I solve this by using the path library to get the location of the script file itself, and then build relative paths from that. But R doesn't seem to have a way to get the location of the running script's file?)
The answer is to not use setwd() at all, ever. R does things a bit different than Python, for sure, but this is one thing they have in common.
Instead, any scripts you're executing should assume they're being run from a common, top-level, root folder. When you launch a new R process, its working directory (i.e., what getwd() gives) is set to the same folder as the process was spawned from.
As an example, if you had this layout:
.
├── data
│   └── mydata.csv
└── scripts
└── analysis.R
You would run analysis.R from . and analysis.R would reference data/mydata.csv as "data/mydata.csv" (e.g., read.csv("data/mydata.csv, stringsAsFactors = FALSE)).
I would keep your shell scripts or Makefiles that run your R scripts and have the R scripts assume they're being run from the top level of the git repo.
This might look like:
cd . # Whereever `.` above is
Rscript scripts/analysis.R
Further reading:
https://www.tidyverse.org/articles/2017/12/workflow-vs-script/
https://github.com/jennybc/here_here
1) If you are looking for a way to find the path of the currently running script then see:
Rscript: Determine path of the executing script
2) Another approach is to require that users put an option of a prearranged name in their .Rprofile file. Then the script can setwd to that. An attractive aspect of this system is that over time one can forget where various projects are located and with this system one can just look at the .Rprofile file to remind oneself. For example, for projectA each person running the project would put this in their .Rprofile
options(projectA = "...whatever...")
and then the script would start off with:
proj <- getOption("projectA")
if (!is.null(proj)) setwd(proj) else stop("Set option 'projectA' to its directory")
One variation of this is to assume the current directory if projectA is not defined. Although this may seem to be more flexible I personally find the documenting feature of the above code to be a big advantage.
proj <- getOption("projectA")
if (!is.null(proj)) setwd(proj) else cat("Using", getwd(), "\n")
in Python, I solve this by using the path library to get the location of the script file itself, and then build relative paths from that. But R doesn't seem to have a way to get the location of the running script's file?
R itself unfortunately doesn’t have a way for this. But you can achieve the same result in either of two ways:
Use packages instead of scripts where you include code via source. Then you can use the solution outlined in amoeba’s answer. This works because the real issue is that R has no way of telling the source function where to look for scripts.
Use box::use instead of source. The ‘box’ package provides a module system that allows relative imports of code modules. A nice side-effect of this is that the package provides a function that tells you the path of the current script, just like in Python (and, just like in Python, you normally don’t need to use this function directly).

Ignore specific files in packrat search

I am creating reports using R, RStudio, knitr, and packrat. I have a project folder structure similar to below:
project_folder/
- packrat/
- .Rprofile
- analaysis_folder/
- library.R
- child.rnw
- data_folder/
- knitr_rnw_location/
- file.rnw
- .Rprofile
And have set up the .Rprofile with the appropriate lines in the main project_folder and the subdirectory of the .rnw file, according to the recommendations given in RStudio's Limitations and Caveats page.
When I run packrat::init() at the project_folder level, the packrat folder is set up. Then when I open the file.rnw the packrat library is all set up.
However, when I execute packrat::snapshot() it gives errors
Unable to tangle file knitr_rnw_location/file.rnw; cannot parse depndencies
and fails. Is there a way to tell packrat to ignore my .rnw files? All library() are called from separate .R scripts and are source() through the .rnw files. It also searches any variables declared in the knitr chunks and gives the error
Error in eval(x, envir = envir): object 'my_variable_name' not found
In the end, it does state
Snapshot written to "~/project_folder/packrat/packrat.lock"
So I can only assume that packrat::snapshot() was successful. Has anyone else run into the same issue when working with knitr and packrat?
Much appreciated,

Using packrat libraries with knitr and the rstudio compile PDF button

As explained by Yihui Xie in this post, when one uses the Compile PDF button of the RStudio IDE to produce a PDF from a .Rnw file, knit() uses the globalenv() of a new R session. Is there a way that this new R session would use the packrat libraries of my project (even the version of knitr included in my packrat libraries) instead of my personal user libraries to ensure a maximum level of reproducibility? I guess that the new R session would have to be linked to the project itself, but I don't know how to do this efficiently.
I know I could directly use the knit() function instead of the Compile PDF button and, that way, knit() would use my current globalenv(), but I don't like this solution since it's less reproducible.
I think I got the problem myself, but I want to share with others who could confirm I'm right, and possibly help improve my solution.
My specific problem is that my .Rnw file is in a sub-directory of my whole project. When the Compile PDF button creates a new R session, it is created in this sub-directory, thus not finding the .Rprofile file that would initialize packrat. I think the easiest solution would be to create a .Rprofile file in my subdirectory which contains
temp <- getwd()
setwd("..")
source("packrat/init.R")
setwd(temp)
rm(temp)
I have to change the working directory at the project level before source("packrat/init.R") because the file itself refers to the directory...
Anybody can see a better solution?
P.,
I don't know if this solution works for even the knitr package, but I am 99% sure it works for all other packages as it seems to for me.
(I believe) I have a very similar problem. I have my project folder, but my working directory has always been the sub folder where my .rnw file is located, in a subdirectory of my project folder.
The link to Yihiu Xie's answer was very helpful.
Originally I wanted a project folder such as:
project-a/
working/
data/
datas.csv
analysis/
library.R
rscripts.R
rnw/
report.rnw
child/
preamble.rnw
packrat/
But I'm not sure if that is possible with packrat when my R library() calls are not in the working directory and packrat cannot parse the .rnw file (I call the library.R file from a chunck using source() in my .rnw file). A few notes:
I wanted to use a .Rproj file to open the project and have project-a/working as the working directory
If this was true then packrat can find the library.R script
But the .rnw file still defaults to its own working directory when compiling
I thought an .Rprofile with knitr::opts_knit$set(root.dir = "..") would work but I don't think it works for latex commands like input\, it defaults back to the directory containing the .rnw file
I thought this was insufficient because then you have two working directories, one for your r chunks and one for your latex!
Since .rnw always sets the working directory, I put my library.R script in the same directory as my .rnw file which creates the packrat folder in project-a/working/rnw. I am 99% sure this works because when I created the packrat folder in the project-a/working/rnw folder WITHOUT relocating the library.R file it received an error that no packages could be found and I could not compile the .rnw file.
project-a/
working/
data/
datas.csv
analysis/
rscripts.R
rnw/
report.rnw
library.R
packrat/
child/
preamble.rnw
Again, unless I am overlooking something or misunderstanding which packages are being used, this seems to have worked for me. Disclaimer here that I am relatively new to packrat.

Load both local and global .Rprofile in an RStudio project

I am working with several projects in Rstudio that require different .Rprofile files. These files usually consist of two parts: global settings I'd like to have every time I run R and local project-specific settings that are loaded when I open the project.
It is natural to exclude the global part from local .Rprofiles to keep it flexible. However, the relevant topic in the documentation states the following (backed up by this question):
When starting RStudio in an alternate working directory the .Rprofile file located within that directory is sourced. If (and only if) there is not an .Rprofile file in the alternate directory then the global default profile (e.g. ~/.Rprofile) is sourced instead.
How do I force to load the global .Rprofile at all times?
Small example. I currently have 2 .Rprofiles:
cat("global\n"); cat("local_1\n) for project 1;
cat("global\n"); cat("local_2\n) for project 2;
The global .Rprofile does not exist.
I'd like to have 3 of them:
cat("local_1\n) for project 1;
cat("local_2\n) for project 2;
cat("global\n") at the home dir.
How should I tinker with these files and/or Rstudio options to get the same output on startup of both projects?
This is just how R works (ie, the behaviour is not specific to RStudio) -- it sources only one of the .Rprofiles available. AFAIK this is not configurable unfortunately -- see ?Startup for full details on what R does on startup.
If you want to load both, I think you'll have to explicitly source the global .Rprofile from any local .Rprofiles.

Resources