What does cache actually cache in RMarkdown? Errors in attachment - r

I have just started testing Rmarkdown for use in creating a codebook of a dataset, and I am quite puzzled by its behaviour when using cache = TRUE. I'm running it on using RStudio 1.1.463. rmarkdown_1.11, knitr_1.21 and the tidyverse_1.2.1.
Take the following sample code which includes some doc and chunk options I'm interested in, attaches all libraries I normally use (noting that I've added "|" in a couple of places for appropriate formatting on SO):
---
title: "Test"
date: 2019-03-11
output:
html_document
---
```{r header, echo= FALSE, include=FALSE, cache = TRUE, warning= FALSE}
attach(mtcars)
library(sf)
library(tidyverse)
library(knitr)
library(summarytools)
opts_chunk$set(echo = FALSE, error = TRUE)
|```
# mtcars dataset heading
## map of car purchases
## cyl variable
```{r}
kable(descr(cyl))
|```
When I hit the Knit button on RStudio for the first time (without an existing cache folder), the results are as expected. If I hit Knit again, the following happens:
cyl is not found
kable, descr both throw 'could not find function' errors
If the parent packages/dataframes are called explicitly, these problems disappear. If cache = FALSE there are no issues.
Why would cache = TRUE trigger this behaviour? For this codebook, I thought of attaching the final dataset and then present some summaries for each variable. I would also like to generate a couple of sf maps with many of the variables. I thought of processing everything in such a header chunk, and then call on various bits throughout the document. Should I think differently?
Incidentally, I don't quite understand why it is necessary to explicitly library(knitr) on an Rmarkdown document as I thought it was a key package to 'knit' the document... If I remove it, opts_chunk is not found.
Thanks for any help!

I believe cache = TRUE tries to cache the R objects created in a chunk. Your first chunk does a lot more than just create objects: the attach and require calls each have side effects: modifying the search list, loading packages, etc. Those side effects aren't cached, but they are needed for your document to work: since knitr sees no reason to run the chunk again your document fails on the second run.
You normally use cache = TRUE when the chunk does a long slow computation to produce a dataset for later plotting or summarizing, because then subsequent runs can skip the slow part of the computation.
You ask why require(knitr) is needed. Strictly speaking, it's not needed: you could have used knitr::opts_chunk instead. But more to the point, the idea is that an R Markdown document is a description of a standalone R session. Yes, you need knitr to process it, but it should give the same results as if you just ran the code on its own in an empty session. (This isn't exactly true: knitr's chunk options and hooks modify the behaviour a bit, but it's a convenient mental model of what's going on.)

Libraries are not time-consuming.
I usually pull apart libraries and setup from data in two differents chunks, and the last one set cache=TRUE.

Related

R Markdown / Quarto - Avoid loading package every time I knit a document

Simple question but can't find a solution.
I have a quarto document (but this apply to Markdown as well) in which I use R to execute some code. Obviously, in the first chunk of the document, I load the packages needed (let's say for example):
```{r setup}
library(tidyverse)
library(survival)
library(survminer)
```
Now, everytime I knit the file to render the document, these packages are loaded, which can be pretty time consuming especially if you have a long list of packages to import. using cache=TRUE doesn't seem to work properly. Is there anyway to avoid loading the package everytime I knit the document, and only load them when they are not loaded in the environment/in the first knit call of the session at least?
The usual way to run Quarto or RMarkdown is in a clean session, not in the current session. So you normally only have a minimal set of packages already loaded.
If you run rmarkdown::render( ... ) in a session, that doesn't happen, and things will run in the current session. That will speed up library() calls a lot, because they do nothing if the package has already been attached to the search list.
I don't know if something similar is available for Quarto, but in any case, it's a risky strategy: what you hope for from an RMarkdown or Quarto document is something that is reproducible. If you run in the current session, you run the risk of getting results that depend on variables in the current session.
I'd advise you to identify which packages are slow to load, and try to follow #Arthur's suggestion from a comment: precompute the objects that those packages produce, and just load them in the document. Then you may not need the package at all.

How to request an early exit when knitting an Rmd document?

Let's say you have an R markdown document that will not render cleanly.
I know you can set the knitr chunk option error to TRUE to request that evaluation continue, even in the presence of errors. You can do this for an individual chunk via error = TRUE or in a more global way via knitr::opts_chunk$set(error = TRUE).
But sometimes there are errors that are still fatal to the knitting process. Two examples I've recently encountered: trying to unlink() the current working directory (oops!) and calling rstudioapi::getVersion() from inline R code when RStudio is not available. Is there a general description of these sorts of errors, i.e. the ones beyond the reach of error = TRUE? Is there a way to tolerate errors in inline R code vs in chunks?
Also, are there more official ways to halt knitting early or to automate debugging in this situation?
To exit early from the knitting process, you may use the function knitr::knit_exit() anywhere in the source document (in a code chunk or inline expression). Once knit_exit() is called, knitr will ignore all the rest of the document and write out the results it has collected so far.
There is no way to tolerate errors in inline R code at the moment. You need to make sure inline R code always runs without errors1. If errors do occur, you should see the range of lines that produced the error from the knitr log in the console, of the form Quitting from lines x1-x2 (filename.Rmd). Then you can go to the file filename.Rmd and see what is wrong with the lines from x1 to x2. Same thing applies to code chunks with the chunk option error = FALSE.
Beyond the types of errors mentioned above, it may be tricky to find the source of the problem. For example, when you unintentionally unlink() the current directory, it should not stop the knitting process, because unlink() succeeded anyway. You may run into problems after the knitting process, e.g., LaTeX/HTML cannot find the output figure files. In this case, you can try to apply knit_exit() to all code chunks in the document one by one. One way to achieve this is to set up a chunk hook to run knit_exit() after a certain chunk. Below is an example of using linear search (you can improve it by using bisection instead):
#' Render an input document chunk by chunk until an error occurs
#'
#' #param input the input filename (an Rmd file in this example)
#' #param compile a function to compile the input file, e.g. knitr::knit, or
#' rmarkdown::render
knit_debug = function(input, compile = knitr::knit) {
library(knitr)
lines = readLines(input)
chunk = grep(all_patterns$md$chunk.begin, lines) # line number of chunk headers
knit_hooks$set(debug = function(before) {
if (!before) {
chunk_current <<- chunk_current + 1
if (chunk_current >= chunk_num) knit_exit()
}
})
opts_chunk$set(debug = TRUE)
# try to exit after the i-th chunk and see which chunk introduced the error
for (chunk_num in seq_along(chunk)) {
chunk_current = 0 # a chunk counter, incremented after each chunk
res = try(compile(input))
if (inherits(res, 'try-error')) {
message('The first error came from line ', chunk[chunk_num])
break
}
}
}
This is by design. I think it is a good idea to have error = TRUE for code chunks, since sometimes we want to show errors, for example, for teaching purposes. However, if I allow errors for inline code as well, authors may fail to recognize fatal errors in the inline code. Inline code is normally used to embed values inline, and I don't think it makes much sense if an inline value is an error. Imagine a sentence in a report like The P-value of my test is ERROR, and if knitr didn't signal the error, it will require the authors to read the report output very carefully to spot this issue. I think it is a bad idea to have to rely on human eyes to find such mistakes.
IMHO, difficulty debugging an Rmd document is a warning that something is wrong. I have a rule of thumb: Do the heavy lifting outside the Rmd. Do rendering inside the Rmd, and only rendering. That keeps the Rmd code simple.
My large R programs look like this.
data <- loadData()
analytics <- doAnalytics(data)
rmarkdown::render("theDoc.Rmd", envir=analytics)
(Here, doAnalytics returns a list or environment. That list or environment gets passed to the Rmd document via the envir parameter, making the results of the analytics computations available inside the document.)
The doAnalytics function does the complicated calculations. I can debug it using the regular tools, and I can easily check its output. By the time I call rmarkdown::render, I know the hard stuff is working correctly. The Rmd code is just "print this" and "format that", easy to debug.
This division of responsibility has served me well, and I can recommend it. Especially compared to the mind-bending task of debugging complicated calculations buried inside a dynamically rendered document.

How to use objects from global environment in Rstudio Markdown

I've seen similar questions on Stack Overflow but virtually no conclusive answers, and certainly no answer that worked for me.
What is the easiest way to access and use objects (regression fits, data frames, other objects) that are located in the global R environment in the Markdown (Rstudio) script.
I find it surprising that there is no easy solution to this, given the tendency of the RStudio team to make things comfortable and effective.
Thanks in advance.
For better or worse, this omission is intentional. Relying on objects created outside the document makes your document less reproducible--that is, if your document needs data in the global environment, you can't just give someone (or yourself in two years) the document and data files and let them recreate it themselves.
For this reason, and in order to perform the render in the background, RStudio actually creates a separate R session to render the document. That background R session cannot see any of the environments in the interactive R session you see in RStudio.
The best way around this problem is to take the code you used to create the contents of your global environment and move it inside your document (you can use echo = FALSE if you don't want it to show up in the document). This makes your document self-contained and reproducible.
If you can't do that, there are a few approaches you can take to use the data in the global environment directly:
Instead of using the Knit HTML button, type rmarkdown::render("your_doc.Rmd") at the R console. This will knit in the current session instead of a background session. Alternatively:
Save your global environment to an .Rdata file prior to rendering (use R's save function), and load it in your document.
Well, in my case i found the following solution:
(1) Save your Global Environmental in a .Rdata file inside the same folder where you have your .Rmd file. (You just need click at disquet picture that is on "Global Environmental" panel)
(2) Write the following code in your script of Rmarkdown:
load(file = "filename.RData") # it load the file that you saved before
and stop suffering.
Going to RStudio´s 'Tools' and 'Global options' and visiting the 'R Markdown' tab, you can make a selection in 'Evaluate chunks in directory', there select the option 'Documents' and the R Markdown knitting engine will be accessing the global environment as plain R code does. Hope this helps those who search this info!
The thread is old but in case anyone's still looking for a solution (as I was):
You can pass an envir parameter to the render() (or knit() function) so that it can access objects from the environment it was called from.
rmarkdown::render(
input = input_rmd,
output_file = output_file,
envir = parent.frame()
)
I have the same problem myself. Some stuff is pretty time consuming to reproduce every time.
I think there could be another answer. What if you save your environment with the save.image() function to a different file than the standard .Rdata one. Then, bring it back with load().
To be sure you are using the same data, use the md5sum() from tools.
Cheers, Cord
I think I solved this problem by referring to the package explicitly in the code that is being knitted. Using the yarrr package, for example, I loaded the dataframe "pirates" using data(pirates). This worked fine at the console and within an Rstudio code chunk, but with knitr it failed following the pattern in the question above. If, however, I loaded the data into memory by creating an object using pirates <- yarrr::pirates, the document then knitted cleanly to HTML.
You can load the script in the desired environment as follows:
```{r, include=FALSE}
source("your-script.R", local = knitr::knit_global())
# or sys.source("your-script.R", envir = knitr::knit_global())
```
Next in the R Markdown document, you can use objects created in these scripts (e.g., data objects or functions).
https://bookdown.org/yihui/rmarkdown-cookbook/source-script.html
One option that I have not yet seen is the use of parameters.
This chapter goes through a simple example of how to do this.

Making knitr run a r script: do I use read_chunk or source?

I am running R version 2.15.3 with RStudio version 0.97.312. I have one script that reads my data from various sources and creates several data.tables. I then have another r script which uses the data.tables created in the first script. I wanted to turn the second script into a R markdown script so that the results of analysis can be outputted as a report.
I do not know the purpose of read_chunk, as opposed to source. My read_chunk is not working, but source is working. With either instance I do not get to see the objects in my workspace panel of RStudio.
Please explain the difference between read_chunk and source? Why would I use one or the other? Why will my .Rmd script not work
Here is ridiculously simplified sample
It does not work. I get the following message
Error: object 'z' not found
Two simple files...
test of source to rmd.R
x <- 1:10
y <- 3:4
z <- x*y
testing source.Rmd
Can I run another script from Rmd
========================================================
Testing if I can run "test of source to rmd.R"
```{r first part}
require(knitr)
read_chunk("test of source to rmd.R")
a <- z-1000
a
```
The above worked only if I replaced "read_chunk" with "source". I
can use the vectors outside of the code chunk as in inline usage.
So here I will tell you that the first number is `r a[1]`. The most
interesting thing is that I cannot see the variables in RStudio
workspace but it must be there somewhere.
read_chunk() only reads the source code (for future references); it does not evaluate code like source(). The purpose of read_chunk() was explained in this page as well as the manual.
There isn't an option to run a chunk interactively from within knitr AFAIK. However, this can be done easily enough with something like:
#' Run a previously loaded chunk interactively
#'
#' Takes labeled code loaded with load_chunk and runs it in the /global/ envir (unless otherwise specified)
#'
#' #param chunkName The name of the chunk as a character string
#' #param envir The environment in which the chunk is to be evaluated
run_chunk <- function(chunkName,envir=.GlobalEnv) {
chunkName <- unlist(lapply(as.list(substitute(.(chunkName)))[-1], as.character))
eval(parse(text=knitr:::knit_code$get(chunkName)),envir=envir)
}
NULL
In case it helps anyone else, I've found using read_chunk() to read a script without evaluating can be useful in two ways. First, you might have a script with many chunks and want control over which ones run where (e.g., a plot or a table in a specific place). I use source when I want to run everything in a script (for example, at the start of a document to load a standard set of packages or custom functions). I've started using read_chunk early in the document to load scripts and then selectively run the chunks I want where I need them.
Second, if you are working with an R script directly or interactively, you might want a long preamble of code that loads packages, data, etc. Such a preamble, however, could be unnecessary and slow if, for example, prior code chunks in the main document already loaded data.

Strange output from fread when called from knitr

I'm using the recently introduced fread function from data.table to read data files.
When I wrap my code into a knitr (Rmd) document, I noticed some strange output, namely lines like:
##
0%
even though the verbose option of fread was set to FALSE. I've used sink to hide this output, but I'd like to report the exact problem to the package author(s). Here's a minimal example,
library(knitr)
test = "```{r}
require(data.table)
fread('1 2 3\n')
```"
knit2html(text=test, output="test.html")
browseURL("test.html")
What is the 0% output?
It's a % progress counter. For me it prints 0%, 5%, 10%, ... 95%, 100% (for example) with a \r at the end to make it appear on one line just underneath the call to fread when typed at the prompt.
But when called from functions, batches and knitr this is undesirable. This has now been removed. From NEWS for v1.8.9 (rev 851) :
% progress console meter has been removed. The ouput was inconvenient in batch mode, log files and reports which don't handle \r. It was too difficult to detect where fread is being called from, plus, removing it speeds up fread a little by saving code inside the C for loop (which is why it wasn't made optional instead). Use your operating system's system monitor to confirm fread is progressing. Thanks to Baptiste for highlighting :
Strange output from fread when called from knitr
Just a quick reminder for completeness. From the top of ?fread :
This function is still under development. For example, dates are read
as character (they can be converted afterwards using the excellent
fasttime package or standard base functions) and embedded quotes ("\""
and """") have problems. There are other known issues that haven't
been fixed and features not yet implemented. But, you may find it
works in many cases. Please report problems to datatable-help or Stack
Overflow's data.table tag.
Not for production use yet. Not because it's unstable in the sense
that it crashes or is buggy (your testing will show whether it is
stable in your cases or not) but because fread's arguments and
behaviour is likely to change in future; i.e., we expect to make
(hopefully minor) non-backwards-compatible changes. Why has it been
released to CRAN then? Because a maintenance release was asked for by
CRAN maintainers to comply with new stricter tests in R-devel, and a
few Bioconductor packages depend on data.table and Bioconductor
requires packages to pass R-devel checks. It was quicker to leave
fread in and write these paragraphs, than take fread out.
It isn't a problem to be reported.
As stated by Matthew Dowle, this is a progress counter from fread
You can set results = 'hide' to avoid these results being included
library(knitr)
test = "```{r, results = 'hide'}
require(data.table)
fread('1 2 3\n')
```"
knit2html(text=test, output="test.html")
browseURL("test.html")
Look, no progress bar.
At a practical level, I think it would be sensible to have results = 'hide' or even include = FALSE for a step like this.
You will not want to repeat this kind of reading in step, practically, you only ever want to read the data in once, then you would serialize it (using save, saveRDS or similar), so you could use that next time (which would be faster).
Edit in light of the comment
I would split the processing up into a number of smaller chunks. You could then not include the reading in chunk, but include a dummy version that is not evaluated (so you can see the code, but not include the results)
```{r libraries}
require(data.table)
```
```{r loaddata, include = FALSE}
DT <- fread('yourfile')
```
```{r loaddummy, ref.label = 'loaddata', eval = FALSE, echo = TRUE}
```
```{r dostuff}
# doing other stuff
```
There is a parameter called showProgress in fread, if you set it to FALSE, then you will not see the progress output. (It's useful in making r markdown.)

Resources