Strange output from fread when called from knitr - r

I'm using the recently introduced fread function from data.table to read data files.
When I wrap my code into a knitr (Rmd) document, I noticed some strange output, namely lines like:
##
0%
even though the verbose option of fread was set to FALSE. I've used sink to hide this output, but I'd like to report the exact problem to the package author(s). Here's a minimal example,
library(knitr)
test = "```{r}
require(data.table)
fread('1 2 3\n')
```"
knit2html(text=test, output="test.html")
browseURL("test.html")
What is the 0% output?

It's a % progress counter. For me it prints 0%, 5%, 10%, ... 95%, 100% (for example) with a \r at the end to make it appear on one line just underneath the call to fread when typed at the prompt.
But when called from functions, batches and knitr this is undesirable. This has now been removed. From NEWS for v1.8.9 (rev 851) :
% progress console meter has been removed. The ouput was inconvenient in batch mode, log files and reports which don't handle \r. It was too difficult to detect where fread is being called from, plus, removing it speeds up fread a little by saving code inside the C for loop (which is why it wasn't made optional instead). Use your operating system's system monitor to confirm fread is progressing. Thanks to Baptiste for highlighting :
Strange output from fread when called from knitr
Just a quick reminder for completeness. From the top of ?fread :
This function is still under development. For example, dates are read
as character (they can be converted afterwards using the excellent
fasttime package or standard base functions) and embedded quotes ("\""
and """") have problems. There are other known issues that haven't
been fixed and features not yet implemented. But, you may find it
works in many cases. Please report problems to datatable-help or Stack
Overflow's data.table tag.
Not for production use yet. Not because it's unstable in the sense
that it crashes or is buggy (your testing will show whether it is
stable in your cases or not) but because fread's arguments and
behaviour is likely to change in future; i.e., we expect to make
(hopefully minor) non-backwards-compatible changes. Why has it been
released to CRAN then? Because a maintenance release was asked for by
CRAN maintainers to comply with new stricter tests in R-devel, and a
few Bioconductor packages depend on data.table and Bioconductor
requires packages to pass R-devel checks. It was quicker to leave
fread in and write these paragraphs, than take fread out.

It isn't a problem to be reported.
As stated by Matthew Dowle, this is a progress counter from fread
You can set results = 'hide' to avoid these results being included
library(knitr)
test = "```{r, results = 'hide'}
require(data.table)
fread('1 2 3\n')
```"
knit2html(text=test, output="test.html")
browseURL("test.html")
Look, no progress bar.
At a practical level, I think it would be sensible to have results = 'hide' or even include = FALSE for a step like this.
You will not want to repeat this kind of reading in step, practically, you only ever want to read the data in once, then you would serialize it (using save, saveRDS or similar), so you could use that next time (which would be faster).
Edit in light of the comment
I would split the processing up into a number of smaller chunks. You could then not include the reading in chunk, but include a dummy version that is not evaluated (so you can see the code, but not include the results)
```{r libraries}
require(data.table)
```
```{r loaddata, include = FALSE}
DT <- fread('yourfile')
```
```{r loaddummy, ref.label = 'loaddata', eval = FALSE, echo = TRUE}
```
```{r dostuff}
# doing other stuff
```

There is a parameter called showProgress in fread, if you set it to FALSE, then you will not see the progress output. (It's useful in making r markdown.)

Related

What does cache actually cache in RMarkdown? Errors in attachment

I have just started testing Rmarkdown for use in creating a codebook of a dataset, and I am quite puzzled by its behaviour when using cache = TRUE. I'm running it on using RStudio 1.1.463. rmarkdown_1.11, knitr_1.21 and the tidyverse_1.2.1.
Take the following sample code which includes some doc and chunk options I'm interested in, attaches all libraries I normally use (noting that I've added "|" in a couple of places for appropriate formatting on SO):
---
title: "Test"
date: 2019-03-11
output:
html_document
---
```{r header, echo= FALSE, include=FALSE, cache = TRUE, warning= FALSE}
attach(mtcars)
library(sf)
library(tidyverse)
library(knitr)
library(summarytools)
opts_chunk$set(echo = FALSE, error = TRUE)
|```
# mtcars dataset heading
## map of car purchases
## cyl variable
```{r}
kable(descr(cyl))
|```
When I hit the Knit button on RStudio for the first time (without an existing cache folder), the results are as expected. If I hit Knit again, the following happens:
cyl is not found
kable, descr both throw 'could not find function' errors
If the parent packages/dataframes are called explicitly, these problems disappear. If cache = FALSE there are no issues.
Why would cache = TRUE trigger this behaviour? For this codebook, I thought of attaching the final dataset and then present some summaries for each variable. I would also like to generate a couple of sf maps with many of the variables. I thought of processing everything in such a header chunk, and then call on various bits throughout the document. Should I think differently?
Incidentally, I don't quite understand why it is necessary to explicitly library(knitr) on an Rmarkdown document as I thought it was a key package to 'knit' the document... If I remove it, opts_chunk is not found.
Thanks for any help!
I believe cache = TRUE tries to cache the R objects created in a chunk. Your first chunk does a lot more than just create objects: the attach and require calls each have side effects: modifying the search list, loading packages, etc. Those side effects aren't cached, but they are needed for your document to work: since knitr sees no reason to run the chunk again your document fails on the second run.
You normally use cache = TRUE when the chunk does a long slow computation to produce a dataset for later plotting or summarizing, because then subsequent runs can skip the slow part of the computation.
You ask why require(knitr) is needed. Strictly speaking, it's not needed: you could have used knitr::opts_chunk instead. But more to the point, the idea is that an R Markdown document is a description of a standalone R session. Yes, you need knitr to process it, but it should give the same results as if you just ran the code on its own in an empty session. (This isn't exactly true: knitr's chunk options and hooks modify the behaviour a bit, but it's a convenient mental model of what's going on.)
Libraries are not time-consuming.
I usually pull apart libraries and setup from data in two differents chunks, and the last one set cache=TRUE.

print the data.table package's .onAttach messages with knitr

I have a bookdown rmd looking like...
Further introductory materials are offered when the package is loaded:
```{r dt-startup, echo=-1, message=TRUE, verbose=TRUE, hide=FALSE}
if ("data.table" %in% .packages()) detach("package:data.table")
library(data.table)
```
My intention was to show the reader the package's startup messages. However, they don't print. Is there some other chunk option to use here?
As you can see, I just threw several maybe-relevant chunk options at it to no good result. I'm not terribly familiar with management of output streams, so that's as far as I knew to go. I also tried calling directly with data.table:::.onAttach(), but no dice.
I'm not sure what else would be relevant here, but ...
Currently the package has not been loaded before this chunk. I just added the first line for robustness in case I rearrange the document.
My before_chapter_script contains nothing but knitr::opts_chunk$set(comment="#").
My knit header value is bookdown::render_book and the output is bookdown::html_book.
Don't. Anything hacked in for this would be fragile and arguably not terribly useful.
Yihui Xie (knitr's author) makes a good case. My synopsis:
This is not useful. You're writing a tutorial, so why include dynamic content (that may change when the package changes)? Moreover, why not point to resources directly rather than to the list of resources printed there?
This is very hard. It is not just a matter of output streams. The messages don't print because they are walled behind an interactive() check. It's not obvious how this should be overridden and, supposing it could be done, what weird side effects that might introduce.

Make Sweave or knitr put graphics suffix in `\includegraphics{}`

I just run into the (curious) problem that when submitting a (pdf)LaTeX manuscript to some Elsevier journal the filenames of figures needed to be complete in order to found by their pdf building and checking syste, i.e.:
\includegraphics{picture.pdf}
Is there any easy and convenient way to tell Sweave or knitr to do that?
Edit:
I'm familiar with sweave's include=FALSE option
I also feel quite capable to patch utils:::RweaveLatexRuncode
However, for the moment I'm hoping that there's something more convenient and elegant.
It's also about handing out the .Rnw files as supplementary material or vignettes. From a didactic point of view I don't like these tweaks that make the source code much more complicated for the new users of whom I hope they read it.
(Which is also why I really appreciate the recently introduced print=TRUE in Sweave)
You can modify the plot hook a little bit in knitr to add the file extension:
<<>>=
knit_hooks$set(plot = function(x, options) {
x = paste(x, collapse = '.') # x is file.ext now instead of c(file, ext)
paste0('\\end{kframe}', hook_plot_tex(x, options), '\\begin{kframe}')
})
#
See 033-file-extension.Rnw for a complete example. To understand what is going on behind the scene, see the source code of the default LaTeX hooks in knitr.
A brute force solution is to explicitly create the files yourself in the R snippet. Set the option for graphics etc to false but have the code evaluated so that the file is created, and then have latex call them with the very \includegraphics{} call you show.
I used similar schemes for simple caching: if the target file exists, skip the code creating.

what does it mean that "cacheSweave doesn't cache side-effects"?

I'm using cachesweave, but I don't think I get how everything works. I've tried to separate the code into simulation chunks and the plotting chunks, but some of the code is very long and written before I started the sweave document, so I instead use something like
<<foo,cache=TRUE>>
source("mainScript.R")
#
<<plot,fig=TRUE>>
a<- print(str(F1))
plot(F1)
#
The thing is mainScript.R is somewhat convoluted simulation code including plot functions and so on. I've read in cacheSweave vignette "cacheSweave doesn't cache side-effects" and plots are not cached, so I was wondering if the plotting functions in mainScript.R effect how the expressions are evaluated?
This might be an obvious question. Let's say I have another chunk after the two above. all of the results of the expressions in both "foo" and "plot" can be used in this new chunk, right? For example,
<<post-chunk>>
print(a)
print(str(F1))
#
See Wikipedia for a full explanation. Some common side-effects in R include: print() objects, draw plots, write files and load packages.
The cacheSweave package only enables you to skip computation, and you have to lose all side-effects. As Dason commented, the knitr package is much more natural in terms of caching -- what you see in an uncached chunk will be seen in the cached chunk. The caching of side-effects in knitr is explained in its manual and the cache page in the website.
BTW, knitr keeps compatibility with Sweave and cacheSweave, so hopefully you do not need to do anything for the transition; just call library(knitr); knit('file.Rnw').

Silencing a package load message in Sweave

I'm loading optmatch in a Sweave document as follows:
<<myCodeBlock, echo=FALSE>>=
library(optmatch, quietly=TRUE)
#
You're loading optmatch, by Ben Hansen, a package for flexible
and optimal matching. Important license information:
The optmatch package makes essential use of D. P. Bertsekas
and P. Tseng's RELAX-IV algorithm and code, as well as
Bertsekas' AUCTION algorithm and code.
Bertsekas and Tseng freely permit their software to be used for
research purposes, but non-research uses, including the use of it
to 'satisfy in any part commercial delivery requirements to
government or industry,' require a special agreement with them.
By extension, this requirement applies to any use of the
fullmatch() function. (If you are using another package that has
loaded optmatch, then you will probably be using fullmatch indirectly.)
For more information, enter relaxinfo() at the command line
As you can see, I've tried every way I can think of to silence the package load message, to no avail. I assume this is because they just used a straight-up cat() or something like that, but it's mighty annoying. Any thoughts on how to silence this so that those reading my final, beautiful, LaTeXified PDF don't have to read about RELAX-IV?
Other things that don't seem to work (take from Andrie's pointer to a related thread):
suppressMessages(library(optmatch))
suppressPackageStartupMessages(require("optmatch"))
I should note this is pretty obviously an R problem not a Sweave problem, as the messages pop up in R also.
Try loading the package in a hide results chunk:
<<packages,results=hide>>=
require(optmatch)
#
If you use the knitr package, you need to quote hide:
<<packages,results='hide'>>=
require(optmatch)
#
Here is an R solution to your problem. The package author uses cat to print the messages to the console, rather than using standard message statements. You can intercept these messages by using sink to divert console output to a temporary file:
<<myCodeBlock, echo=FALSE>>=
zz <- tempfile()
sink(file=zz)
library(optmatch, quietly=TRUE))
unlink(zz)
#
PS. The solution by #XuWang uses only SWeave, so is clearly much more suitable in your case.

Resources