Rmd re-run all blocks that read_csv - r

I have an Rmd document with several code blocks that call read_csv on many different csv's. The Rmd makes various graphs and tables and I use cache=TRUE to speed up the rendering.
Another program produces the CSVs and generates different results depending on different experiments and configurations. So I want the Rmd to reload those CSVs which have changed and use the cache for those that have not.
At the moment I have a block parameter {r lastrun=100} on each code block that has a read_csv, and i search for lastrun and replace e.g. to 101 for those blocks that have read_csv i think should be reread, but I'd like this to be automated somehow.
so right now my Rmd looks like:
```{r lastrun=100}
a<-map_dfr(paste0(1:10, ".a.csv"), read_csv)
```
lots of text and including code blocks
```{r}
#whatever
print(f(a))
```
```{r lastrun=100}
b<-map_dfr(paste0(1:20, ".b.csv"), read_csv)
```
So on rendering that, if later any of *.a.csv or *.b.csv change, then i have to search/replace lastrun or i'll just see the stale cached versions of a and b. I want a and b to update when the files change (i don't need to be able to identify the exact file and only reread that one, just reloading the block would be fine.)
How can this be done?
Thanks
-Neal

Here's a trick that might work:
---
title: 'nothing to report'
output: html_document
---
```{r setup}
files <- c("file1", "file2")
fileinfo <- file.info(files)
```
```{r test1, file=fileinfo['file1',-6], cache = TRUE}
Sys.sleep(5)
```
```{r test2, file=fileinfo['file2',-6], cache = TRUE}
Sys.sleep(5)
```
To test:
Render this, it should take 10 seconds.
Render it again, it should be instantaneous.
Create one or both of the files file1 and file2.
Render it again, it should take 5 seconds (if one created) or 10 seconds (both).
Touch one of the files (either updating its mtime or add/replace contents, doesn't matter).
Render it again, there should be a pause.
Notes:
files do not need to exist at first; NA row contents (in fileinfo) is just as changing when files "appear"
files can be in the current directory, a subdirectory, or any absolute path, the name of the row is the name set within my files variable
with [,-6], I'm omitting atime, which indicates when something "looks at the file"; this updates with each file.info, so obviously this would defeat the purpose of the cache (invalidating on each test)
other than that, it is currently invalidating on a change in: size, isdir, mode, mtime, ctime, and exe, all fields within file.info; you can easily choose specific fields (instead of omitting one, as I did above);
one could have a chunk depend on multiple files, you either need to be creative in the file= used in order to grab multiple rows, or your setup block should know how to aggregate summaries
if this is on a network drive, sometimes things like mtime and such are less reliable; in that case, one might consider using md5sum somehow
(You will likely want the code in the {r setup} block to be hidden.)

Related

Abstracts in rmarkdown that include numbers generated from the rmd itself?

So basically I have been writing a paper in Rmarkdown. The paper includes an abstract which has numbers/results that are generated from the code chunks within the markdown itself. Up to now, the workaround has been to place the abstract at the end of the paper, so that all the code chunks are run and the results generated before they are needed in the abstract.
Now that I am actually working on the final drafts, It would be ideal to have the abstract in the beginning. Is this even possible?
Thank you!
If your values won't change from run to run, one option is to use knitr::load_cache to load values from the cache of later chunks in your abstract section. The main downside is that this will only work on the second time knitting the document. The first time, load_cache will give NULL, then the later chunk will be run and the value cached. The second time, the cache will exist and will be used in the abstract.
```{r abstract}
y = knitr::load_cache('test-a', 'y')
print(y)
```
```{r test-a, cache=TRUE}
y = 2*pi
```
The first time you run it will give you this:
But knit it again and you'll see this:
This is kind of awkward, but was the recommended solution from yihui, the creator of rmarkdown. See this github issue: https://github.com/yihui/knitr/issues/868#issuecomment-68129294
You have to be careful with cached chunks – make sure that there is nothing that would change between runs and clear the cache before doing your final (2-step) knitting.

Can I generate separate HTML file for each header using rmarkdown::render()?

I generate reports using rmarkdown::render() function on a list of .Rmd files and I get one HTML file for each of them.
That was fine until my dataset got bigger and my reports now contain >100 figures... The HTML files often end-up being >100MB and I now have some very big ones (~500MB).
The .Rmd is separated in several chunks so one might think I have to split my .Rmd in smaller files (let's say one chunk per file).
This is not (easily) doable because the .Rmd defines a data-processing workflow (figures generated in chunk3 require processings made in chunk1 and chunk2).
I would like to know if it is possible to split the rendering in several HTML files automatically.
Ideally I dream about a 'splitHeader' argument in render() that would generate separate HTML file for each header of a specified level.
I guess an ugly solution is to manually add conditional statements for every chunk/header that I would like rendered (or not), and call render() several time with different arguments. But this is extremely inefficient (and ugly, I said that already)...
Would somebody have suggestions to achieve that ?
I am not sure if this solve (or at least help to solve): You can have multiple independent .Rmd files (childs) dividing the content as you like. In a "Mother" file, you can add the child using:
```{r child = "yourChild.Rmd"}
```
The child .Rmd files should no contain any header information. That is, delete the first lines in you .Rmd that are something like:
---
title: "Your Title"
author: "Your name"
output: html_notebook
---

Tangle knitr code blocks into not one but several files

I am a new user to knitr. I know that knitr can "tangle out" (taken from the Literate programming community) or extract source code blocks into an R script file. Being an org-mode-user, I am used to being able to specify a specific file for each code block, with potentially the same file for different blocks. When "tangling" or extracting source in org-mode, instead of having one output code file, several code files are produced (this helps with modularity in large projects).
I wonder if something similar is possible in knitr? Can I specify the output file in knitr on a block by block basis?
There are at least two different readings of your question, each requiring slightly different workflows.
If each chunk is going to be written into a separate output document, then to assist modularity, you should split the reporting part down into multiple documents. Since knitr supports child documents, you can always recombine these into larger documents in any combinations that you like.
If you want conditional execution of some chunks, and there are a few different combinations of conditions that can be run, use an R Markdown YAML header, and include a params element.
----
params:
report_type: "weekly" # should be "weekly" or "yearly"
----
You can set which chunks are run by setting the eval and include chunk options.
```{r, some_chunk, eval = params$report_type == "weekly", include = params$report_type == "weekly"}
# chunk contents
```

Put figure directly into Knitr document (without saving file of it in folder) Part 2

I am extending a question I recently posted here (Put figure directly into Knitr document (without saving file of it in folder)).
I am writing an R package that generates a .pdf file for users that outputs summarizations of data. I have a .Rnw script in the package (here, my MWE of it is called test.Rnw). The user can do:
1) knit("test.Rnw") to create a test.tex file
2) "pdflatex test.tex" to create the test.pdf summary output file.
The .Rnw file generates many images. Originally, these all got saved in the current directory. These images being saved to the directory (or maybe the .aux or .log files that get created upon calling pdflatex on the .tex file) just does not seem as tidy as it could be (since users must remember to delete these image files). Secondarily, I also worry that this untidiness may cause issues when scripts are run multiple time.
So, in my previous post, we improved the .Rnw file by saving the images to a temporary folder. I have been told the files in the temporary folder get deleted each time a new R session is opened. However, I still worry about certain things:
1) I feel I may need to insert a line, like the one on line 19:
system(sprintf("%s", paste0("rm -r ", temppath, "/*")))
to automatically delete the files in the temporary folder each time the .Rnw file is run (so that the images do not only get deleted each time R gets restarted). This will keep the current directory clean of the images, and the user will not have to remember to manually delete the images. However, I do not know if this "solution" will pass CRAN standards to have a line to delete files in the temporary folder. The reason is that it deletes files in the user's system, which could cause problems if other programs are writing files to the temporary folder. I feel I have read about CRAN not allowing files to be written/deleted from the user's computer for obvious reasons. How strict would CRAN be about such a practice? Is there a safe way to go about it?
2) If writing and deleting the image files in a temporary file will not work, what is another way to accomplish the same effect (run the script without having cumbersome image files created in the folder)? Is it possible to instead have the images directly embedded in the output file (not needing to be saved to any directory)? I am pretty sure this is not possible. However, I have been told it is possible to do so with .Rmd, and that I could convert my .Rnw to .Rmd. This may be difficult because the .Rnw file must follow certain formats (text and margins) for the correct output, and it is very long. Is it possible to make use of the .Rmd capability (of inserting images directly into the output) only for the chunks that generate images, without rewriting the entire .Rnw file?
Below is my MWE:
\documentclass[nohyper]{tufte-handout}
\usepackage{tabularx}
\usepackage{longtable}
\setcaptionfont{% changes caption font characteristics
\normalfont\footnotesize
\color{black}% <-- set color here
}
\begin{document}
<<setup, echo=FALSE>>=
library(knitr)
library(xtable)
library(ggplot2)
# Specify directory for figure output in a temporary directory
temppath <- tempdir()
# Erase all files in this temp directory first?
#system(sprintf("%s", paste0("rm -r ", temppath, "/*")))
opts_chunk$set(fig.path = temppath)
#
<<diamondData, echo=FALSE, fig.env = "marginfigure", out.width="0.95\\linewidth", fig.cap = "The diamond dataset has varibles depth and price.",fig.lp="mar:">>=
print(qplot(depth,price,data=diamonds))
#
<<echo=FALSE,results='asis'>>=
myDF <- data.frame(a = rnorm(1:10), b = letters[1:10])
print(xtable(myDF, caption= 'This data frame shows ten random variables from the distribution and a corresponding letter', label='tab:dataFrame'), floating = FALSE, tabular.environment = "longtable", include.rownames=FALSE)
#
Figure \ref{mar:diamondData} shows the diamonds data set, with the
variables price and depth.Table \ref{tab:dataFrame} shows letters a through j
corresponding to a random variable from a normal distribution.
\end{document}

How to cache knitr chunks across two (or more) files?

I want to use some R-Code in two different *.Rnw files and want to use caching across those files.
I read http://yihui.name/knitr/demo/externalization/
Caching in one file just works fine. But running the second one the whole code is executed again:
plain.R
## #knitr random1
a <- rnorm(10)
a
doc1.Rnw (and doc2.Rnw)
\documentclass{article}
<<set-options, echo=FALSE, cache=FALSE>>=
options(replace.assign=TRUE)
opts_chunk$set(external=TRUE, cache=TRUE, echo=FALSE, fig=TRUE)
read_chunk('plain.R')
#
\title{Doc 1}
\begin{document}
<<random1>>=
#
\end{document}
Is there a way to share the cache across several documents?
It is entirely possible to reuse the cache across multiple source documents. Please read the cache page carefully to understand when cache will be rebuilt. In your case, the cache is not supposed to be rebuilt unless your two documents have different chunk options (condition 1), or different getOption('width') (condition 3), since your code remains the same (condition 2).
You have to post a reproducible example, otherwise this is not considered a real question.
After completely resetting the example it turned out that the cache is reused by both files. I'm not sure what caused the problem before ....
But in a bigger project the chunks are not cached. So I'm not sure what causes the problem - maybe just a different count of spaces ....

Resources