R Markdown file slow to knit due to large dataset - r

I am new to R Markdown. Apologies if the question has an obvious answer that I missed.
Context:
I am working with a large dataset in R Markdown (roughly 90 million rows) to produce a short report. While working on the file formatting, I want to knit the final HTML document frequently (e.g., after making a change) to look at the formatting.
Problem:
The problem is the dataset takes a long time to be load and so the each knit takes a long time to be executed (roughly five to ten minutes). I do need all of the data, so loading in a smaller file isn't a workable option. Of course, I am able to code the individual chunks since the data are loaded into the global environment, but formatting is in credibly onerous since it is difficult to visualize the result of formatting changes without looking at the knitted product.
Attempts to solve the issue:
After some research, I found and tried to use cache = TRUE and cache.extra = file.mtime('my-precious.csv') (as per this section of Yihui's Bookdown). However, this option didn't work as it resulted in the following:
Error in lazyLoadDBinsertVariable(vars[i], from, datafile, ascii, compress, :
long vectors not supported yet: connections.c:6073
Calls: <Anonymous> ... <Anonymous> -> <Anonymous> -> lazyLoadDBinsertVariable
To overcome this error, I added cache.lazy = FALSE into the chunk options (as mentioned here). Unfortunately, while the code worked, the time it took to knit the document did not go down.
My limited understanding of this process is that having cache = TRUE and cache.extra = file.mtime('my-precious.csv') will lead to a code chunk's executed results be cached so that the next time the file is knit, results from the previous run are loaded. However, because my file is too large, cache = TRUE doesn't work so I have to use cache.lazy = FALSE to turn reverse what is done by cache = TRUE. In the end, this means that the dataset is being loaded into my memory each time I run the file, thereby lengthening the time it take to knit the document.
Questions to which I seek answers from the R community:
Is there a way to cache the data-loading chunk in R Markdown when the file size is large (~90 million rows)?
Is there a (better) way to circumvent the time-intensive data-loading process every time I knit the R Markdown file?
Is my understanding of the cache = TRUE method of circumventing the time-intensive data-loading process correct? And if it isn't, why didn't the cache = TRUE method work for me?
Any help is appreciated.

Is there a (better) way to circumvent the time-intensive data-loading process every time I knit the R Markdown file?
Yes. Perform your computations outside of the Rmarkdown report.
Plots can be saved into files and included into the report via knitr::include_graphics(myfile)
Tables can be saved into smaller summary files, loaded via fread and displayed via kable.
Note that if you need to print tables in a loop, you should specify the result=asis chunk option.
```{r my_chunk_label, results='asis', echo=F}
for(i in 1:length(data_full)) {
print(kable(data_full[[i]]))
cat('\n')
}
```
Run your expensive computations once, save the results. Consume these results with a light Rmarkdown report that is easy to format.
If you still have large csv files to load, you should use data.table::fread which is much more efficient than base functions.
I actually posted a similar question not so long ago. You're not alone.

Related

How to knit Rmarkdown files without the need to run the codes

I am not sure how to express my question in a well-understood manner. Anyway, my problem is that when I knit the Rmarkdown file, R rerun everything in the file (import data, run models, etc.), which takes a lot of time. Is there a way I can have the output of the models, data frames, graphs, or tables and save that as objects then use these objects as they are without running the process that generated them again during knitting?
Thanks
I believe that your best option is to use the cache capabilities in RMarkdown: {r cache=TRUE}.
Se more here: https://bookdown.org/yihui/rmarkdown-cookbook/cache.html
I find it's effective to do the data preparation and model fitting in a separate .Rmd or .R file and save the resulting data frames and model objects with save.
The notebook I create with figures and tables simply loads the objects in the first chunk with load. That way I can easily iterate on the visualizations and tables without having to re-run the models every time.
Take a look at R Notebooks:
https://bookdown.org/yihui/rmarkdown/notebook.html
Notebooks are just like markdown, but with exactly the feature you are looking for.

Is there a way to temporarily turn off computationally intense functions when knitting R Markdown files? [duplicate]

I am creating a document using knitr and I am finding it tedious to reload the data from disk every time I parse the document while I'm in development. I've subsetted that datafile for development to shorten the load time. I also have knitr cache set to on.
I tried assigning the data to the global environment using <<-, and using exists with where=globalenv(), but that did not work.
Anyone know how to use preloaded data from the environment in knitr or have other ideas to speed up development?
When a document is knitted, a new environment is created within R, and therefore any settings in the global environment will not be passed to the document. However, this is done intentionally, as accidentally referencing an object in the global environment is an easy thing to break a reproducible analysis, and therefore making a clean session each time means the RMarkdown file runs on its own, regardless of the global environment settings.
If you do have a use case which justifies preloading the data, there are a few things you can do.
Example Data
Firstly I have created a minimal Rmd file as below called "RenderTest.Rmd":
title: "Render"
author: "Michael Harper"
date: "7 November 2017"
output: pdf_document
---
```{r cars}
summary(cars2)
```
In this example, cars2 is a set of data I am referencing to from my global session. Run on its using the "Knit" command in RStudio, this will return the following error:
Error in summary(cars): object 'cars2' not found: ... withCallignHandlers -> withVisible -> eval -> eval -> summary
Execution halted
Option 1: Manually Call the render function
The render function from rmarkdown can be called from another R script. This by default does not create a fresh environment for the script to run in, so you can use any parameters already loaded. As an example:
# Build file
library(rmarkdown)
cars2<- cars
render("RenderTest.Rmd")
I would, however, be careful doing this. Firstly, the benefit of using RMarkdown is that it makes reproducibility of the script is incredibly easy. As soon as you start using external scripts, it makes things more complicated to replicate as all the settings are not contained within the file.
Option 2: Save data to an R object
If you have some analysis which takes time to run, you can save the result of the analysis as an R object, and then you can reload the final version of the data into the session. Using my above example:
```{r dataProcess, cache = TRUE}
cars2 <- cars
save(cars2, "carsData.RData") # saves the 'cars2' dataset
```
and then we can just reload the data into the session:
```{r}
load("carsData.RData") # reloads the 'cars2' dataset
```
I prefer this technique. The chunk dataProcess is cached, so is only run if there are changes made to the code. The results are saved to file, which are then loaded by the next chunk. The data still has to be loaded into the session, but you can save the finalised dataset if you need to do any data cleaning.
Option 3: Build the file less frequently
With the updates made to RStudio over the past few years, there is less of a need to continuously rebuild the file. Chunks can be run directly within the file, and the output window viewed. It will potentially save you a lot of time trying to optimise the script, only to save a couple of minutes on compiling (which normally makes a good time to get a hot drink anyway!).

R markdown file takes forever to knit to HTML

I have long complicated functions included in my code.
When I try to knit the Markdown file to HTML document, it takes a very long time and still nothing happens.
I tried to use cache=TRUE and updating my R/RStudio but it still doesn't work.
Does anyone have any idea what else I could try? Thanks
I am familiar with the situation. I am using Markdown to show some graphs with notes. When compiling Markdown all the code is executed. Also the computational expensive machine learning. To speed up my process I save the outcomes of my model to dataframes with the save function. The file type I use is .Rdata. In the Markdown document I use load to load the dataframes in the Markdown environment.

R markdown to use variables existing in the environment and not run code again

Perhaps I am not using R markdown properly, but my first line of code load a very large data set and then does analysis. Every time I knit the pdf to see what it looks like, it runs all the code again, this takes quite a while. The data is already stored in the environment so is there a way of getting R to not run all the code again but display the pdf with the alterations made?
In case loading your very large data set is the problem, try special packages for reading your data like readr.
Alternative, since you working on the design or representation in you PDF, you can work on a subset of your data like only on the first 100000 rows.
Otherwise, I use the following code in my first code chunk
library(knitr)
# global setting to create this document
opts_chunk$set(cache=TRUE,
echo=TRUE, # set to FALSE to remove the code output
warning=FALSE,
message=FALSE)
so I don't need to set cache=TRUE in each chunk.
Hope this helps.
My set of tricks is evolving:
a parameter to use on the chunk's eval=params$do.readdata
this type of construct:
if (exists('<name of data table>') {
load(<file>, verbose=TRUE) or st_read_feather...
} else {
<read in data>
}
Until recently, i was using a cache option on the chunk, but recently learned that putting the import and processing of data in a function in order to remove it from the global environment (if memory is an issue) and return a smaller subset or result that is needed later. So, caching may not be an option.

Avoid loading data every time in knitr

I am creating a document using knitr and I am finding it tedious to reload the data from disk every time I parse the document while I'm in development. I've subsetted that datafile for development to shorten the load time. I also have knitr cache set to on.
I tried assigning the data to the global environment using <<-, and using exists with where=globalenv(), but that did not work.
Anyone know how to use preloaded data from the environment in knitr or have other ideas to speed up development?
When a document is knitted, a new environment is created within R, and therefore any settings in the global environment will not be passed to the document. However, this is done intentionally, as accidentally referencing an object in the global environment is an easy thing to break a reproducible analysis, and therefore making a clean session each time means the RMarkdown file runs on its own, regardless of the global environment settings.
If you do have a use case which justifies preloading the data, there are a few things you can do.
Example Data
Firstly I have created a minimal Rmd file as below called "RenderTest.Rmd":
title: "Render"
author: "Michael Harper"
date: "7 November 2017"
output: pdf_document
---
```{r cars}
summary(cars2)
```
In this example, cars2 is a set of data I am referencing to from my global session. Run on its using the "Knit" command in RStudio, this will return the following error:
Error in summary(cars): object 'cars2' not found: ... withCallignHandlers -> withVisible -> eval -> eval -> summary
Execution halted
Option 1: Manually Call the render function
The render function from rmarkdown can be called from another R script. This by default does not create a fresh environment for the script to run in, so you can use any parameters already loaded. As an example:
# Build file
library(rmarkdown)
cars2<- cars
render("RenderTest.Rmd")
I would, however, be careful doing this. Firstly, the benefit of using RMarkdown is that it makes reproducibility of the script is incredibly easy. As soon as you start using external scripts, it makes things more complicated to replicate as all the settings are not contained within the file.
Option 2: Save data to an R object
If you have some analysis which takes time to run, you can save the result of the analysis as an R object, and then you can reload the final version of the data into the session. Using my above example:
```{r dataProcess, cache = TRUE}
cars2 <- cars
save(cars2, "carsData.RData") # saves the 'cars2' dataset
```
and then we can just reload the data into the session:
```{r}
load("carsData.RData") # reloads the 'cars2' dataset
```
I prefer this technique. The chunk dataProcess is cached, so is only run if there are changes made to the code. The results are saved to file, which are then loaded by the next chunk. The data still has to be loaded into the session, but you can save the finalised dataset if you need to do any data cleaning.
Option 3: Build the file less frequently
With the updates made to RStudio over the past few years, there is less of a need to continuously rebuild the file. Chunks can be run directly within the file, and the output window viewed. It will potentially save you a lot of time trying to optimise the script, only to save a couple of minutes on compiling (which normally makes a good time to get a hot drink anyway!).

Resources