Externalise config file and functions in R markdown - r

I am having problems understanding the (practical) difference between the different ways to externalise code in R notebooks. Having referred to previous questions or to the documentation, it is still unclear the difference in sourcing external .R files or read_chunk() them. For practical purposes let us consider the below:
I want to load libraries with an external config.R file: the most intuitive way, according to me, seems to create config.R as
library(first_package)
library(second_package)
...
and, in the general R notebook (say, main.Rmd) call it like
```{r}
source('config.R')
```
```{r}
# use the libraries included above
```
However, this does not recognise the packages included, so it seems that sourcing an external config file is useless. Likewise using read_chunk() instead. Therefore the question is: How to include libraries at the top, so that they are recognised in the main markdown script?
Say I want to define global functions externally, and then include them in the main notebook: along the same lines as above one would include them in an external foo.R file and include them in the main one.
Again, it seems that read_chunk() does not do the job, whereas source('foo.R') does, in this case; the documentation states that the former "only evaluates code, but does not execute it": when is it ever the case that one wants to only evaluate the code but not execute it? Differently posed: why would one ever use read_chunk() rather than source, for practical purposes?

This does not recognise the packages included
In your example, first_package and second_package are both available in the working environment for the second code chunk.
Try putting library(nycflights13) in the R file and head(airlines) in the second chunk of the Rmd file. Calling knit("main.Rmd") would fail if the nycflights13 package wasn't successfully loaded with source.
read_chunk does in fact accomplish this (along with source) however they go about it differently. With source you will have the global functions available directly after the source (as you have found). With read_chunk however, as you pointed out since it only evaluates code, but does not execute it you need to explicitly execute the chunk and then the function will be available. (See my example with third_config_chunk below. Including the empty chunk of third_config_chunk in the report allows the global some_function to be called in subsequent chunks.)
Regarding "only evaluates code, but does not execute it", this is an entire property of R programming known as lazy evaluation. The idea being that you may want to create a number of functions or template code which is read into your R environment but is not executed on-the-spot, allowing you to modify the environment/parameters prior to evaluation. This also allows you to execute the same code chunks multiple times whereas source will only run once with what is already provided.
Consider an example where you have an external R script which contains a large amount of setup code that isn't needed in your report. It is possible to format this file into many "chunks" which will be loaded into the working environment with read_chunk but won't be evaluated until explicitly told.
In order to externalise your config.R using read_chunk() you would write the R script as:
config.R
# ---- config_preamble
## setup code that is required for config.R
## to run but not for main.Rmd
# ---- first_config_chunk
library(nycflights13)
library(MASS)
# ---- second_config_chunk
y <- 1
# ---- third_config_chunk
some_function <- function(x) {
x + y
}
# ---- fourth_config_chunk
some_function(10)
# ---- config_output
## code that is output during `source`
## and not wanted in main.Rmd
print(some_function(10))
To use this script with the externalisation methodology, you would setup main.Rmd as follows:
main.Rmd
```{r, include=FALSE}
knitr::read_chunk('config.R')
```
```{r first_config_chunk}
```
The packages are now loaded.
```{r third_config_chunk}
```
`some_function` is now available.
```{r new_chunk}
y <- 20
```
```{r fourth_config_chunk}
```
## [1] 30
```{r new_chunk_two}
y <- 100
lapply(seq(3), some_function)
```
## [[1]]
## [1] 101
##
## [[2]]
## [1] 102
##
## [[3]]
## [1] 103
```{r source_file_instead}
source("config.R")
```
## [1] 11
As you can see, if you were to source this file, there would be no way to modify the call to some_function prior to execution and the call would print an output of "11". Now that the chunks are available in the environment, they can be re-called any number of times (after for example, changing the value of y) or used any other way in the current environment (eg. new_chunk_two) which would not be possible with source if you didn't want the rest of the R script to execute.

Related

PDFs produced by R having inconsistent MD5 checksum

I'm testing an R package using testthat. Writing tests for an S3 method plot.foo is a huge headache, because it simply returns NULL, so I decided to save the plot to a file and check if it has been changed since the last run.
pdf(file='plot_foo.pdf')
plot.foo(bar)
dev.off()
tools::md5sum('plot_foo.pdf')
The problem is each time I'm getting a different result with the same input. The output looks the same, though.
replicate(10, {
pdf(file='plot.pdf')
plot(1:10, 10:1)
dev.off()
Sys.sleep(1)
tools::md5sum('plot.pdf')
})
Note that you need to wait a while between each iteration, otherwise the file would be identical, which makes me suspect some time-based metadata is changed.
plot.pdf plot.pdf
"5a0c096fe088342bc3c3d5960c5da1c9" "40d93c26b4901aef55a32b75473d05d2"
plot.pdf plot.pdf
"9815c6d9b2e94cda763a486fcd2ddf08" "a8e8db82d06b79f98416fa034b5aee46"
plot.pdf plot.pdf
"c2770250dbef3b60706559114c434851" "91c8cf124eb61ddebd3edbbb2d01677f"
plot.pdf plot.pdf
"d1594bd83b97fc890410a4c305366682" "f05197f165ec04df3dac4664494f4617"
plot.pdf plot.pdf
"64427124c6a6454e8f0e5944de20be95" "ff1abf2b31dfe688cf8f5994e409cc6d"
How do I force R to produce consistent PDFs? I'm temporary switching to PostScript for testing purposes, but I'd prefer PDF as it's better-supported (Windows doesn't seem to have a builtin PostScript viewer) and thus can also serve as the document.
While I think it's a little rough on a few things, I think vdiffr is going to let you do what you need.
First, I'm going to create a package; fake for now, but necessary, since vdiffr only works in a tightly-controlled environment: a package using testthat.
usethis::create_package("~/StackOverflow/nalzok")
setwd("~/StackOverflow/nalzok")
usethis::use_testthat()
Create a test_something.R test file.
context("basic plot tests")
baseplot1 <- function() hist(1:10)
vdiffr::expect_doppelganger("base 1", baseplot1)
(I'm going to assume that hist(1:10) is something relevant and interesting. Base plots need to be a function, ggplot2 objects do not; see the docs for more.)
I had thought I could call vdiffr::expect_doppelganger directly (as most testthat::expect_* functions often can be), but it needs to be "managed" (setup) first.
vdiffr::manage_cases(".")
Each of the images need to be "verified" (by a human), so this opens a shiny app that iterates through each of the expected doppelgangers:
After validation, each time you test the package, it will verify that the images have not changed:
devtools::test()
# Loading nalzok
# Testing nalzok
# v | OK F W S | Context
# v | 1 | basic plot tests
# == Results =====================================================================
# OK: 1
# Failed: 0
# Warnings: 0
# Skipped: 0
If something changes (perhaps changing the hist(1:10) to hist(2:11)), it'll fail the next test:
devtools::test()
# Loading nalzok
# Testing nalzok
# v | OK F W S | Context
# x | 0 1 | basic plot tests
# --------------------------------------------------------------------------------
# test_something.R:3: failure: (unknown)
# Figures don't match: base-1.svg
# --------------------------------------------------------------------------------
# == Results =====================================================================
# OK: 0
# Failed: 1
# Warnings: 0
# Skipped: 0
It does this by creating a ./tests/testthat/figs/ directory with a directory and .svg file for each expectation, and while you don't need to interact with it, it would make sense for .../figs/ to be version-controlled (you do version-control you package, right?).
Some caveats, I guess:
it is saving to .svg files; if your S3 plot.foo function doesn't play well with SVG (does that happen? I don't know), then I don't know (yet) how to deal with that;
since it's using the text-based SVG format, it will notice if a point or box or something shifts, but only within some basic tolerances; as an example, if even some meta-parameters (limits) are changed sufficiently, it will trigger a failure. This is generally good, since I believe the test should be resilient to minor changes (upstream library, etc).
hist(1:10) # pass
hist(1:10, xlim=c(0,10)) # pass, that's the default x-limit given the data
hist(1:10, xlim=c(0,10+1e-5)) # pass, close enough?
hist(1:10, xlim=c(0,10+1e-4)) # FAIL

Erroneous code diagnostics report in RStudio when sourcing functions via source

I'm working in RStudio on a simple analysis where I source some files via the source command. For example, I have this file with some simple analysis:
analysis.R
# Settings ----------------------------------------------------------------
data("mtcars")
source("Generic Functions.R")
# Some work ---------------------------------------------------------------
# Makes no sense
mtcars$mpg <- CleanPostcode(mtcars$mpg)
The generic functions file has some simple functions that I use to derive graphs and do repetitive tasks. For example the used CleanPostcode function would look like that:
Generic Functions.R
#' The file provides a set of generic functions
# String manipulations ----------------------------------------------------
# Create a clean Postcode for matching
CleanPostcode <- function(MessyPostcode) {
MessyPostcode <- as.character(MessyPostcode)
MessyPostcode <- gsub("[[:space:]]", "", MessyPostcode)
MessyPostcode <- gsub("[[:punct:]]", "", MessyPostcode)
MessyPostcode <- toupper(MessyPostcode)
cln_str <- MessyPostcode
return(cln_str)
}
When I run the first file, the objects are available in the global environment:
There are some other function in the file but they are not relevant to the described problem.
Nevertheless the RStudio sees the object as not available in scope, as illustrated by the yellow triangle next to the code:
Question
Is there a way to make RStudio stop doing that. Maybe changing something to the source command? I tried local = TRUE and got the same thing. The code works with no problems, I just find it annoying.
The report was generated on the version 0.99.491 of RStudio.

R code in package vignette cannot run on CRAN for security reasons. How to manage such vignette?

An R package communicates with a commercial data base using a private
user_name and password to establish connection.
In the package_vignette.Rmd file there is a chunk of code:
```{r, eval = TRUE}
# set user_name and password from user's configuration file
set_connection(file = "/home/user001/connection.config")
# ask data base for all metrics it has
my_data <- get_all_metrics()
# display names of fetched metrics
head(my_data$name)
```
I do not have the rights to provide actual user_name and password to CRAN,
so I can not supply genuine 'connection.config' file with the package.
So, of course, this code fragment leads to Error during CRAN checks.
I know two ways to get around CRAN check:
Use knitr option: eval = FALSE.
Make static vignette with help of the R.rsp package.
The first way is too time-consuming, because there are a lot of chunks,
and I rewrite/rebuild the vignette often.
The second way is better for me. But may be there is a better pattern how to support such vignette? For example, in the package's tests I use testthat::skip_on_cran() to avoid CRAN checks.
The easiest way is just to include the data with your package. Either the dummy data set in:
the data directory. This would allow users to easily access it.
or in inst/extdata. Users can can access this file, but it's a bit more hidden. You would find the location using system.file(package="my_pkg")
In the vignette you would have something
```{r, echo=FALSE}
data(example_data, package="my_pkg")
my_data = example_data
```
```{r, eval = FALSE}
# set user_name and password from user's configuration file
set_connection(file = "/home/user001/connection.config")
# ask data base for all metrics it has
my_data <- get_all_metrics()
```
testthat::skip_on_cran just checks a system variable
> testthat::skip_on_cran
function ()
{
if (identical(Sys.getenv("NOT_CRAN"), "true")) {
return(invisible(TRUE))
}
skip("On CRAN")
}
<environment: namespace:testthat>
From what I gather, this is set by testthat or devtools. Thus, you could use
eval = identical(Sys.getenv("NOT_CRAN"), "true")
in the chunk option and load testthat or devtools in one of the first chunks. Otherwise, you can use a similar mechanism on your site and assign a similar system variable and check if it is "true". E.g., use Sys.setenv("IS_MY_COMP", "true")). Then put a Sys.setenv call in your .Rprofile file if you use R studio or in your R_HOME/Rprofile.site file. See help("Startup") for information on the later option.
Alternatively, you can check if "/home/user001/connection.config" exists with
eval = file.exists("/home/user001/connection.config")
in the chunk option.

retrieve original version of package function even if over-assigned

Suppose I replace a function of a package, for example knitr:::sub_ext.
(Note: I'm particularly interested where it is an internal function, i.e. only accessible by ::: as opposed to ::, but the same answer may work for both).
library(knitr)
my.sub_ext <- function (x, ext) {
return("I'm in your package stealing your functions D:")
}
# replace knitr:::sub_ext with my.sub_ext
knitr <- asNamespace('knitr')
unlockBinding('sub_ext', knitr)
assign('sub_ext', my.sub_ext, knitr)
lockBinding('sub_ext', knitr)
Question: is there any way to retrieve the original knitr:::sub_ext after I've done this? Preferably without reloading the package?
(I know some people want to know why I would want to do this so here it is. Not required reading for the question). I've been patching some functions in packages like so (not actually the sub_ext function...):
original.sub_ext <- knitr:::sub_ext
new.sub_ext <- function (x, ext) {
# some extra code that does something first, e.g.
x <- do.something.with(x)
# now call the original knitr:::sub_ext
original.sub_ext(x, ext)
}
# now set knitr:::sub_ext to new.sub_ext like before.
I agree this is not in general a good idea (in most cases these are quick fixes until changes make their way into CRAN, or they are "feature requests" that would never be approved because they are somewhat case-specific).
The problem with the above is if I accidentally execute it twice (e.g. it's at the top of a script that I run twice without restarting R in between), on the second time original.sub_ext is actually the previous new.sub_ext as opposed to the real knitr:::sub_ext, so I get infinite recursion.
Since sub_ext is an internal function (I wouldn't call it directly, but functions from knitr like knit all call it internally), I can't hope to modify all the functions that call sub_ext to call new.sub_ext manually, hence the approach of replacing the definition in the package namespace.
When you do assign('sub_ext', my.sub_ext, knitr), you are irrevocably overwriting the value previously associated with sub_ext with the value of my.sub_ext. If you first stash the original value, though, it's not hard to reset it when you're done:
library(knitr)
knitr <- asNamespace("knitr")
## Store the original value of sub_ext
.sub_ext <- get("sub_ext", envir = knitr)
## Overwrite it with your own function
my.sub_ext <- function (x, ext) "I'm in your package stealing your functions D:"
assignInNamespace('sub_ext', my.sub_ext, knitr)
knitr:::sub_ext("eg.csv", "pdf")
# [1] "I'm in your package stealing your functions D:"
## Reset when you're done
assignInNamespace('sub_ext', .sub_ext, knitr)
knitr:::sub_ext("eg.csv", "pdf")
# [1] "eg.pdf"
Alternatively, as long as you are just adding lines of code to what's already there, you could add that code using trace(). What's nice about trace() is that, when you are done, you can use untrace() to revert the function's body to its original form:
trace(what = "mean.default",
tracer = quote({
a <- 1
b <- 2
x <- x*(a+b)
}),
at = 1)
mean(1:2)
# Tracing mean.default(1:2) step 1
# [1] 4.5
untrace("mean.default")
# Untracing function "mean.default" in package "base"
mean(1:2)
# [1] 1.5
Note that if the function you are tracing is in a namespace, you'll want to use trace()'s where argument, passing it the name of some other (exported) function that shares the to-be-traced function's namespace. So, to trace an unexported function in knitr's namespace, you could set where=knit

plot.MCA() not included in Sweave

I don't seem to be having a problem with any other method of including a plot via Sweave. However, plot.mca(), a method from the FactoMineR package seems to not have it's plot pulled through. It does create an Rplot.pdf file - but for whatever reason it's not renamed into "RnwFilename-00X.pdf" and not included in the resulting PDF when you compilePdf() it in RStudio.
Here's a trivial example, try it for yourself.
Note you may have to: install.packages("FactoMineR")
\documentclass[a4paper]{article}
% PREAMBLE
\begin{document}
\begin{center}
<<echo=false,fig=true>>=
library(FactoMineR)
x <- data.frame(
A=sample(letters[1:3],100,rep=T),
B=sample(letters[1:4],100,rep=T),
C=sample(letters[1:3],100,rep=T))
fit.mca <- MCA(x, graph=FALSE)
plot(fit.mca, invisible="ind")
#
\end{center}
\end{document}
Update - more details on the error message:
LaTeX errors:
!pdfTeX error: C:\Program Files (x86)\MiKTeX 2.9\miktex\bin\pdflatex.EXE (file
R:/.../RnwFilename-010.pdf): PDF inclusion: required page do
es not exist <0>
It works for me if I tell plot.MCA not to create a new device:
plot(fit.mca, invisible="ind",new.plot = FALSE)
Editorializing a bit, this seems like sub-optimal behavior for a plotting function, which most users (and other code, clearly) will expect to rely on R's default action to open a new device automatically. A plot function should only open a new device if the user has explicitly told it to (either by calling png, pdf etc or by actually setting new.plot = TRUE). Opinions may differ on this, though.

Resources