I have a chunk of SQL in my R Markdown / Notebook Document:
```{sql output.var = "df"}
SELECT * FROM FakeData
WHERE Date >= '2017-01-01
```
It takes literally 5 minutes to run. Is there an easy way to cache the result of the query without Knitting the document or writing the file to a CSV.
I'd perhaps like the cache to live for a couple of hours, or maybe a day (is there a way to adjust this as well?)
If you put cache=TRUE in the chunk options and you are working in rStudio, you can select a section of code and run it directly using the green arrows in the upper right of the rMarkdown/knitr console.
{sql output.var = "df", cache=TRUE}
SELECT * FROM FakeData
WHERE Date >= '2017-01-01
Also, I tend to run a regular R script in another window with EVERYTHING I am going to use in knitR. I find that I have less issues with package availability and caching if the data is stored in the global environment.
If you do it this way, and run with cache=TRUE you will definitely be able to save the data on the first run and skip the waiting next time around.
Related
I am new to R Markdown. Apologies if the question has an obvious answer that I missed.
Context:
I am working with a large dataset in R Markdown (roughly 90 million rows) to produce a short report. While working on the file formatting, I want to knit the final HTML document frequently (e.g., after making a change) to look at the formatting.
Problem:
The problem is the dataset takes a long time to be load and so the each knit takes a long time to be executed (roughly five to ten minutes). I do need all of the data, so loading in a smaller file isn't a workable option. Of course, I am able to code the individual chunks since the data are loaded into the global environment, but formatting is in credibly onerous since it is difficult to visualize the result of formatting changes without looking at the knitted product.
Attempts to solve the issue:
After some research, I found and tried to use cache = TRUE and cache.extra = file.mtime('my-precious.csv') (as per this section of Yihui's Bookdown). However, this option didn't work as it resulted in the following:
Error in lazyLoadDBinsertVariable(vars[i], from, datafile, ascii, compress, :
long vectors not supported yet: connections.c:6073
Calls: <Anonymous> ... <Anonymous> -> <Anonymous> -> lazyLoadDBinsertVariable
To overcome this error, I added cache.lazy = FALSE into the chunk options (as mentioned here). Unfortunately, while the code worked, the time it took to knit the document did not go down.
My limited understanding of this process is that having cache = TRUE and cache.extra = file.mtime('my-precious.csv') will lead to a code chunk's executed results be cached so that the next time the file is knit, results from the previous run are loaded. However, because my file is too large, cache = TRUE doesn't work so I have to use cache.lazy = FALSE to turn reverse what is done by cache = TRUE. In the end, this means that the dataset is being loaded into my memory each time I run the file, thereby lengthening the time it take to knit the document.
Questions to which I seek answers from the R community:
Is there a way to cache the data-loading chunk in R Markdown when the file size is large (~90 million rows)?
Is there a (better) way to circumvent the time-intensive data-loading process every time I knit the R Markdown file?
Is my understanding of the cache = TRUE method of circumventing the time-intensive data-loading process correct? And if it isn't, why didn't the cache = TRUE method work for me?
Any help is appreciated.
Is there a (better) way to circumvent the time-intensive data-loading process every time I knit the R Markdown file?
Yes. Perform your computations outside of the Rmarkdown report.
Plots can be saved into files and included into the report via knitr::include_graphics(myfile)
Tables can be saved into smaller summary files, loaded via fread and displayed via kable.
Note that if you need to print tables in a loop, you should specify the result=asis chunk option.
```{r my_chunk_label, results='asis', echo=F}
for(i in 1:length(data_full)) {
print(kable(data_full[[i]]))
cat('\n')
}
```
Run your expensive computations once, save the results. Consume these results with a light Rmarkdown report that is easy to format.
If you still have large csv files to load, you should use data.table::fread which is much more efficient than base functions.
I actually posted a similar question not so long ago. You're not alone.
I am creating a document using knitr and I am finding it tedious to reload the data from disk every time I parse the document while I'm in development. I've subsetted that datafile for development to shorten the load time. I also have knitr cache set to on.
I tried assigning the data to the global environment using <<-, and using exists with where=globalenv(), but that did not work.
Anyone know how to use preloaded data from the environment in knitr or have other ideas to speed up development?
When a document is knitted, a new environment is created within R, and therefore any settings in the global environment will not be passed to the document. However, this is done intentionally, as accidentally referencing an object in the global environment is an easy thing to break a reproducible analysis, and therefore making a clean session each time means the RMarkdown file runs on its own, regardless of the global environment settings.
If you do have a use case which justifies preloading the data, there are a few things you can do.
Example Data
Firstly I have created a minimal Rmd file as below called "RenderTest.Rmd":
title: "Render"
author: "Michael Harper"
date: "7 November 2017"
output: pdf_document
---
```{r cars}
summary(cars2)
```
In this example, cars2 is a set of data I am referencing to from my global session. Run on its using the "Knit" command in RStudio, this will return the following error:
Error in summary(cars): object 'cars2' not found: ... withCallignHandlers -> withVisible -> eval -> eval -> summary
Execution halted
Option 1: Manually Call the render function
The render function from rmarkdown can be called from another R script. This by default does not create a fresh environment for the script to run in, so you can use any parameters already loaded. As an example:
# Build file
library(rmarkdown)
cars2<- cars
render("RenderTest.Rmd")
I would, however, be careful doing this. Firstly, the benefit of using RMarkdown is that it makes reproducibility of the script is incredibly easy. As soon as you start using external scripts, it makes things more complicated to replicate as all the settings are not contained within the file.
Option 2: Save data to an R object
If you have some analysis which takes time to run, you can save the result of the analysis as an R object, and then you can reload the final version of the data into the session. Using my above example:
```{r dataProcess, cache = TRUE}
cars2 <- cars
save(cars2, "carsData.RData") # saves the 'cars2' dataset
```
and then we can just reload the data into the session:
```{r}
load("carsData.RData") # reloads the 'cars2' dataset
```
I prefer this technique. The chunk dataProcess is cached, so is only run if there are changes made to the code. The results are saved to file, which are then loaded by the next chunk. The data still has to be loaded into the session, but you can save the finalised dataset if you need to do any data cleaning.
Option 3: Build the file less frequently
With the updates made to RStudio over the past few years, there is less of a need to continuously rebuild the file. Chunks can be run directly within the file, and the output window viewed. It will potentially save you a lot of time trying to optimise the script, only to save a couple of minutes on compiling (which normally makes a good time to get a hot drink anyway!).
I am creating a document using knitr and I am finding it tedious to reload the data from disk every time I parse the document while I'm in development. I've subsetted that datafile for development to shorten the load time. I also have knitr cache set to on.
I tried assigning the data to the global environment using <<-, and using exists with where=globalenv(), but that did not work.
Anyone know how to use preloaded data from the environment in knitr or have other ideas to speed up development?
When a document is knitted, a new environment is created within R, and therefore any settings in the global environment will not be passed to the document. However, this is done intentionally, as accidentally referencing an object in the global environment is an easy thing to break a reproducible analysis, and therefore making a clean session each time means the RMarkdown file runs on its own, regardless of the global environment settings.
If you do have a use case which justifies preloading the data, there are a few things you can do.
Example Data
Firstly I have created a minimal Rmd file as below called "RenderTest.Rmd":
title: "Render"
author: "Michael Harper"
date: "7 November 2017"
output: pdf_document
---
```{r cars}
summary(cars2)
```
In this example, cars2 is a set of data I am referencing to from my global session. Run on its using the "Knit" command in RStudio, this will return the following error:
Error in summary(cars): object 'cars2' not found: ... withCallignHandlers -> withVisible -> eval -> eval -> summary
Execution halted
Option 1: Manually Call the render function
The render function from rmarkdown can be called from another R script. This by default does not create a fresh environment for the script to run in, so you can use any parameters already loaded. As an example:
# Build file
library(rmarkdown)
cars2<- cars
render("RenderTest.Rmd")
I would, however, be careful doing this. Firstly, the benefit of using RMarkdown is that it makes reproducibility of the script is incredibly easy. As soon as you start using external scripts, it makes things more complicated to replicate as all the settings are not contained within the file.
Option 2: Save data to an R object
If you have some analysis which takes time to run, you can save the result of the analysis as an R object, and then you can reload the final version of the data into the session. Using my above example:
```{r dataProcess, cache = TRUE}
cars2 <- cars
save(cars2, "carsData.RData") # saves the 'cars2' dataset
```
and then we can just reload the data into the session:
```{r}
load("carsData.RData") # reloads the 'cars2' dataset
```
I prefer this technique. The chunk dataProcess is cached, so is only run if there are changes made to the code. The results are saved to file, which are then loaded by the next chunk. The data still has to be loaded into the session, but you can save the finalised dataset if you need to do any data cleaning.
Option 3: Build the file less frequently
With the updates made to RStudio over the past few years, there is less of a need to continuously rebuild the file. Chunks can be run directly within the file, and the output window viewed. It will potentially save you a lot of time trying to optimise the script, only to save a couple of minutes on compiling (which normally makes a good time to get a hot drink anyway!).
I wonder whether I can use knitr markdown to just create a report on the fly with objects stemming from my current workspace. Reproducibility is not the issue here. I also read this very fine thread here.
But still I get an error message complaining that the particular object could not be found.
1) Suppose I open a fresh markdown document and save it.
2) write a chunk that refers to some lm object in my workspace. call summary(mylmobject)
3) knitr it.
Unfortunately the report is generated but the regression output cannot be shown because the object could not be found. Note, it works in general if i just save the object to .Rdata and then load it directly from the markdown file.
Is there a way to use objects in R markdown that are in the current workspace?
This would be really nice to show non R people some output while still working.
RStudio opens a new R session to knit() your R Markdown file, so the objects in your current working space will not be available to that session (they are two separate sessions). Two solutions:
file a feature request to RStudio, asking them to support knitting in the current R session instead of forcibly starting a new session;
knit manually by yourself: library(knitr); knit('your_file.Rmd') (or knit2html() if you want HTML output in one step, or rmarkdown::render() if you are using R Markdown v2)
Might be easier to save you data from your other session using:
save.image("C:/Users/Desktop/example_candelete.RData")
and then load it into your MD file:
load("C:/Users/Desktop/example_candelete.RData")
The Markdownreports package is exactly designed for parsing a markdown document on the fly.
As Julien Colomb commented, I've found the best thing to do in this situation is to save the large objects and then load them explicitly while I'm tailoring the markdown. This is a must if your data is coming through an ODBC and you don't want to run the entire queries repeatedly as you tinker with fonts and themes.
my question(s) might be less general than the title suggests. I am running R on Mac OS X with a MySQL database to store the data. I have been working with the Komodo / Sciviews-R for some time. Recently I had the need for auto-generated reports and looked into Sweave. I guess StatET / Eclipse appears to be the "standard" solution for Sweavers.
1) Is it reasonable to switch from Komodo to StatET Eclipse? I tried StatET before but chose Komodo over StatET because I liked the calltip / autosuggest and the more convenient config from Komodo so much.
2) What´s a reasonable workflow to generate Sweave files? Usually I develop my R code first and then care about the report later. I just learned today that there is one file in Sweave that contains R code and Latex code at once and that from this file the .tex document is created. While the example files look handily and can't really imagine how to enter my 250 + lines of R code to a file and mixed it up with Latex.
Is it possible to just enter the qplot() and ggplot() statements to a such a document and source the functionality like database connection and intermediate results somehow?
Or is it just a matter of being used to the mix of Latex and R code?
Thx for any suggestions, hints, links and back-to-the-roots-shout-outs…
You've asked several questions, so here's several answers;
Is StatEt/Eclipse the right way to do Sweave ?
Not nessarily (note: I'm an avid StatEt/Eclipse user, and use it for both pure R and Sweave/R and love it, I haven't used Komodo / sciviews-R). You should be able to run the sweave command from any R command line which will generate a .tex file. You can then turn the .tex file into something readable (like pdf) from any tex environment.
What's a good Sweave workflow ?
When I have wanted to turn an r script into a sweave report I generaly start with an empty sweave template and copy/paste my entire R script into a sweave R block just after the title, i.e;
<<label=myEntireRScript, echo=false, include=false>>
#Insert code here
myTable<-dataframe(...)
myPlot<-qplot(....)
#
Then I go through and find the parts I want to report. For instance, if i want to put a table into the report, I'll cut the R block and put an xtable block in, and the same for variables and plots.
<<label=myEntireRScript, echo=false, include=false>>=
#Insert code here
#
Put any text I want before my table here, maybe with a \Sexpr{print(variable)} named variable
<<label=myTable, result=Tex>>=
myTable<-dataframe(...)
print(xtable(mytable,...),...)
#
Any text I want before my figure
<label=myplot, result=figure>>=
myPlot<-qplot(....)
print(qplot)
#
You may want to look at these related SO posts. The rest of my post relates to your question 2.
When creating reports with Sweave, I usually keep most of the R code and the report text separate. If the R code is fast to run, then I prefer I will include something like the following at the start of the .Rnw file:
<<>>
source('/path/to/script.r')
#
On the other hand, if the R code takes a long time, I will often include something like the following at the end of the R script:
Sweave('/path/to/report.Rnw'); system('pdflatex report.tex')
That way, I can re-generate the report quickly, without needing to run all the R code again. Then, the only work R has to do in the Sweave file is print tables, make graphs and maybe extract a few figures.
Like nullglob, I prefer to keep the R and Sweave files separate, but I prefer to save the workspace with save.image() rather than to source() the file. This avoids running the R calculations with each .Rnw file compiling (and I always end up tinkering with the typesetting more than I'd like).
My general work flow is to do each paper/project in it's own folder with it's own R file(s). When the calculation side is "done", I save.image() to store all the workspace variables as-is.
Then, in the .Rnw file in the same directory I set the working directory with setwd() and load all variables with load(".Rdata"). Of course, you can change the name you use for your workspace, but I do one workspace per folder and keep the default name. Oh, and if you tinker with the R file, be sure save the workspace image and watch out for variables that linger in the workspace and .Rnw file, but are no longer part of the R file... this is where the save.image() approach can cause some headaches.
I am on a Mac and I suggest TextMate if you're mildly geeky and emacs/ess if you're really geeky. I use vim and command line R, but emacs/ess works best for most. If you're in this for the long haul, I doubt you'll regret learning emacs/ess for R, Sweave, and LaTeX.