Avoid loading data every time in knitr - r

I am creating a document using knitr and I am finding it tedious to reload the data from disk every time I parse the document while I'm in development. I've subsetted that datafile for development to shorten the load time. I also have knitr cache set to on.
I tried assigning the data to the global environment using <<-, and using exists with where=globalenv(), but that did not work.
Anyone know how to use preloaded data from the environment in knitr or have other ideas to speed up development?

When a document is knitted, a new environment is created within R, and therefore any settings in the global environment will not be passed to the document. However, this is done intentionally, as accidentally referencing an object in the global environment is an easy thing to break a reproducible analysis, and therefore making a clean session each time means the RMarkdown file runs on its own, regardless of the global environment settings.
If you do have a use case which justifies preloading the data, there are a few things you can do.
Example Data
Firstly I have created a minimal Rmd file as below called "RenderTest.Rmd":
title: "Render"
author: "Michael Harper"
date: "7 November 2017"
output: pdf_document
---
```{r cars}
summary(cars2)
```
In this example, cars2 is a set of data I am referencing to from my global session. Run on its using the "Knit" command in RStudio, this will return the following error:
Error in summary(cars): object 'cars2' not found: ... withCallignHandlers -> withVisible -> eval -> eval -> summary
Execution halted
Option 1: Manually Call the render function
The render function from rmarkdown can be called from another R script. This by default does not create a fresh environment for the script to run in, so you can use any parameters already loaded. As an example:
# Build file
library(rmarkdown)
cars2<- cars
render("RenderTest.Rmd")
I would, however, be careful doing this. Firstly, the benefit of using RMarkdown is that it makes reproducibility of the script is incredibly easy. As soon as you start using external scripts, it makes things more complicated to replicate as all the settings are not contained within the file.
Option 2: Save data to an R object
If you have some analysis which takes time to run, you can save the result of the analysis as an R object, and then you can reload the final version of the data into the session. Using my above example:
```{r dataProcess, cache = TRUE}
cars2 <- cars
save(cars2, "carsData.RData") # saves the 'cars2' dataset
```
and then we can just reload the data into the session:
```{r}
load("carsData.RData") # reloads the 'cars2' dataset
```
I prefer this technique. The chunk dataProcess is cached, so is only run if there are changes made to the code. The results are saved to file, which are then loaded by the next chunk. The data still has to be loaded into the session, but you can save the finalised dataset if you need to do any data cleaning.
Option 3: Build the file less frequently
With the updates made to RStudio over the past few years, there is less of a need to continuously rebuild the file. Chunks can be run directly within the file, and the output window viewed. It will potentially save you a lot of time trying to optimise the script, only to save a couple of minutes on compiling (which normally makes a good time to get a hot drink anyway!).

Related

R Markdown file slow to knit due to large dataset

I am new to R Markdown. Apologies if the question has an obvious answer that I missed.
Context:
I am working with a large dataset in R Markdown (roughly 90 million rows) to produce a short report. While working on the file formatting, I want to knit the final HTML document frequently (e.g., after making a change) to look at the formatting.
Problem:
The problem is the dataset takes a long time to be load and so the each knit takes a long time to be executed (roughly five to ten minutes). I do need all of the data, so loading in a smaller file isn't a workable option. Of course, I am able to code the individual chunks since the data are loaded into the global environment, but formatting is in credibly onerous since it is difficult to visualize the result of formatting changes without looking at the knitted product.
Attempts to solve the issue:
After some research, I found and tried to use cache = TRUE and cache.extra = file.mtime('my-precious.csv') (as per this section of Yihui's Bookdown). However, this option didn't work as it resulted in the following:
Error in lazyLoadDBinsertVariable(vars[i], from, datafile, ascii, compress, :
long vectors not supported yet: connections.c:6073
Calls: <Anonymous> ... <Anonymous> -> <Anonymous> -> lazyLoadDBinsertVariable
To overcome this error, I added cache.lazy = FALSE into the chunk options (as mentioned here). Unfortunately, while the code worked, the time it took to knit the document did not go down.
My limited understanding of this process is that having cache = TRUE and cache.extra = file.mtime('my-precious.csv') will lead to a code chunk's executed results be cached so that the next time the file is knit, results from the previous run are loaded. However, because my file is too large, cache = TRUE doesn't work so I have to use cache.lazy = FALSE to turn reverse what is done by cache = TRUE. In the end, this means that the dataset is being loaded into my memory each time I run the file, thereby lengthening the time it take to knit the document.
Questions to which I seek answers from the R community:
Is there a way to cache the data-loading chunk in R Markdown when the file size is large (~90 million rows)?
Is there a (better) way to circumvent the time-intensive data-loading process every time I knit the R Markdown file?
Is my understanding of the cache = TRUE method of circumventing the time-intensive data-loading process correct? And if it isn't, why didn't the cache = TRUE method work for me?
Any help is appreciated.
Is there a (better) way to circumvent the time-intensive data-loading process every time I knit the R Markdown file?
Yes. Perform your computations outside of the Rmarkdown report.
Plots can be saved into files and included into the report via knitr::include_graphics(myfile)
Tables can be saved into smaller summary files, loaded via fread and displayed via kable.
Note that if you need to print tables in a loop, you should specify the result=asis chunk option.
```{r my_chunk_label, results='asis', echo=F}
for(i in 1:length(data_full)) {
print(kable(data_full[[i]]))
cat('\n')
}
```
Run your expensive computations once, save the results. Consume these results with a light Rmarkdown report that is easy to format.
If you still have large csv files to load, you should use data.table::fread which is much more efficient than base functions.
I actually posted a similar question not so long ago. You're not alone.

When knitting a markdown file I have an error "object not found" [duplicate]

I have an uncleaned dataset. So, I have imported it to my R studio.Then when I run nrow(adult) in the rmarkdown file and press ctrl+Enter it works, but when i press the knit the following error appears:'
When you knit something it gets executed in a new environment.
The object adult is in your environment at the moment, but not in the new one knit creates.
You probably did not include the code to read or load adult in the knit.
If you clear your workspace, as per #sebastian-c comment, you will see that even ctrl+enter does not work.
You have to create the adult object inside your knit. For example, if your data in from a csv add
adult <- read.csv2('Path/to/file')
in the first chunk.
Hope this is clear enough.
Another option, in the same way as the previous, but really useful in case you have a lot of diferent data
Once you have all your data generated from your R scripts, write in your "normal code" ( any of your R scripts):
save.image (file = "my_work_space.RData")
And then, in your R-Markdown script, load the image of the data saved previously and the libraries you need.
```{r , include=FALSE}
load("my_work_space.RData")
library (tidyverse)
library (skimr)
library(incidence)
```
NOTE: Make sure to save your data after any modification and before running knitr.
Because usually I've a lot of code that prepares the data variables effectively used in the knitr documents my workaround use two steps:
In the global environment, I save all the objects on a file using
save()
In the knitr code I load the objects from the file using load()
Is no so elegant but is the only one that I've found.
I've also tried to access to the global environment variables using the statement get() but without success
If you have added eval = FALSE the earlier R code won't execute in which you have created your object.
So when you again use that object in a different chunk it will fail with object not found message.
When knitting to PDF
```{r setup}
knitr::opts_chunk$set(cache =TRUE)
```
Worked fine.
But not when knitting to Word.
I am rendering in word. Here's what finally got my data loaded from the default document directory. I put this in the first line of my first chunk.
load("~/filename.RData")

Is there a way to temporarily turn off computationally intense functions when knitting R Markdown files? [duplicate]

I am creating a document using knitr and I am finding it tedious to reload the data from disk every time I parse the document while I'm in development. I've subsetted that datafile for development to shorten the load time. I also have knitr cache set to on.
I tried assigning the data to the global environment using <<-, and using exists with where=globalenv(), but that did not work.
Anyone know how to use preloaded data from the environment in knitr or have other ideas to speed up development?
When a document is knitted, a new environment is created within R, and therefore any settings in the global environment will not be passed to the document. However, this is done intentionally, as accidentally referencing an object in the global environment is an easy thing to break a reproducible analysis, and therefore making a clean session each time means the RMarkdown file runs on its own, regardless of the global environment settings.
If you do have a use case which justifies preloading the data, there are a few things you can do.
Example Data
Firstly I have created a minimal Rmd file as below called "RenderTest.Rmd":
title: "Render"
author: "Michael Harper"
date: "7 November 2017"
output: pdf_document
---
```{r cars}
summary(cars2)
```
In this example, cars2 is a set of data I am referencing to from my global session. Run on its using the "Knit" command in RStudio, this will return the following error:
Error in summary(cars): object 'cars2' not found: ... withCallignHandlers -> withVisible -> eval -> eval -> summary
Execution halted
Option 1: Manually Call the render function
The render function from rmarkdown can be called from another R script. This by default does not create a fresh environment for the script to run in, so you can use any parameters already loaded. As an example:
# Build file
library(rmarkdown)
cars2<- cars
render("RenderTest.Rmd")
I would, however, be careful doing this. Firstly, the benefit of using RMarkdown is that it makes reproducibility of the script is incredibly easy. As soon as you start using external scripts, it makes things more complicated to replicate as all the settings are not contained within the file.
Option 2: Save data to an R object
If you have some analysis which takes time to run, you can save the result of the analysis as an R object, and then you can reload the final version of the data into the session. Using my above example:
```{r dataProcess, cache = TRUE}
cars2 <- cars
save(cars2, "carsData.RData") # saves the 'cars2' dataset
```
and then we can just reload the data into the session:
```{r}
load("carsData.RData") # reloads the 'cars2' dataset
```
I prefer this technique. The chunk dataProcess is cached, so is only run if there are changes made to the code. The results are saved to file, which are then loaded by the next chunk. The data still has to be loaded into the session, but you can save the finalised dataset if you need to do any data cleaning.
Option 3: Build the file less frequently
With the updates made to RStudio over the past few years, there is less of a need to continuously rebuild the file. Chunks can be run directly within the file, and the output window viewed. It will potentially save you a lot of time trying to optimise the script, only to save a couple of minutes on compiling (which normally makes a good time to get a hot drink anyway!).

Knit error. Object not found

I have an uncleaned dataset. So, I have imported it to my R studio.Then when I run nrow(adult) in the rmarkdown file and press ctrl+Enter it works, but when i press the knit the following error appears:'
When you knit something it gets executed in a new environment.
The object adult is in your environment at the moment, but not in the new one knit creates.
You probably did not include the code to read or load adult in the knit.
If you clear your workspace, as per #sebastian-c comment, you will see that even ctrl+enter does not work.
You have to create the adult object inside your knit. For example, if your data in from a csv add
adult <- read.csv2('Path/to/file')
in the first chunk.
Hope this is clear enough.
Another option, in the same way as the previous, but really useful in case you have a lot of diferent data
Once you have all your data generated from your R scripts, write in your "normal code" ( any of your R scripts):
save.image (file = "my_work_space.RData")
And then, in your R-Markdown script, load the image of the data saved previously and the libraries you need.
```{r , include=FALSE}
load("my_work_space.RData")
library (tidyverse)
library (skimr)
library(incidence)
```
NOTE: Make sure to save your data after any modification and before running knitr.
Because usually I've a lot of code that prepares the data variables effectively used in the knitr documents my workaround use two steps:
In the global environment, I save all the objects on a file using
save()
In the knitr code I load the objects from the file using load()
Is no so elegant but is the only one that I've found.
I've also tried to access to the global environment variables using the statement get() but without success
If you have added eval = FALSE the earlier R code won't execute in which you have created your object.
So when you again use that object in a different chunk it will fail with object not found message.
When knitting to PDF
```{r setup}
knitr::opts_chunk$set(cache =TRUE)
```
Worked fine.
But not when knitting to Word.
I am rendering in word. Here's what finally got my data loaded from the default document directory. I put this in the first line of my first chunk.
load("~/filename.RData")

R markdown to use variables existing in the environment and not run code again

Perhaps I am not using R markdown properly, but my first line of code load a very large data set and then does analysis. Every time I knit the pdf to see what it looks like, it runs all the code again, this takes quite a while. The data is already stored in the environment so is there a way of getting R to not run all the code again but display the pdf with the alterations made?
In case loading your very large data set is the problem, try special packages for reading your data like readr.
Alternative, since you working on the design or representation in you PDF, you can work on a subset of your data like only on the first 100000 rows.
Otherwise, I use the following code in my first code chunk
library(knitr)
# global setting to create this document
opts_chunk$set(cache=TRUE,
echo=TRUE, # set to FALSE to remove the code output
warning=FALSE,
message=FALSE)
so I don't need to set cache=TRUE in each chunk.
Hope this helps.
My set of tricks is evolving:
a parameter to use on the chunk's eval=params$do.readdata
this type of construct:
if (exists('<name of data table>') {
load(<file>, verbose=TRUE) or st_read_feather...
} else {
<read in data>
}
Until recently, i was using a cache option on the chunk, but recently learned that putting the import and processing of data in a function in order to remove it from the global environment (if memory is an issue) and return a smaller subset or result that is needed later. So, caching may not be an option.

Resources