Knitr compiling and running all at the same time in RStudio - r

For running an Rnw file in RStudio, one can compile or run all. Compiling does not see the variables in the current environment, and the current environment does not see the variables created while compiling. I would like to see how the output would look when I compile, and I debug the code using the environment. This requires me to compile and run, which performs the same calculations twice, which is very impractical for large projects. Is there a way to compile and have the output be seen in the environment?

When you knit a document, the work happens in a different R session, which is why you can't examine the results in the current session.
But you have a lot of choices besides run all. Take a look at the Run button: it allows you to run chunks one at a time, or run all previous chunks, etc.
If some of your chunks take too long to run, then you should consider organizing your work differently. Put the long computations into their own script, and save the results of that script using save(). Run it once, then spend time editing the display of those results in multiple runs in the main .Rnw document.
Finally, if you really want to see variables at the end of a run of your vignette, you can add save.image(file = 'vignette.RData') at the end, and in your interactive session, use load('vignette.RData') to load the values for examination. This won't necessarily give you an accurate view of the state of things at the end of the run, because it will load the values in addition to anything you've already got in your workspace, it won't load option settings or attach packages, but it might be enough for debugging.

Related

Do .Rout files preserve the R working environment?

I recently started looking into Makefiles to keep track of the scripts inside my research project. To really understand what is going on, I would like to understand the contents of .Rout files produced by R CMD BATCH a little better.
Christopher Gandrud is using a Makefile for his book Reproducible research with R and RStudio. The sample project (https://github.com/christophergandrud/rep-res-book-v3-examples/tree/master/data) has only three .R files: two of them download and clean data, the third one merges both datasets. They are invoked by the following lines of the Makefile:
# Key variables to define
RDIR = .
# Run the RSOURCE files
$(RDIR)/%.Rout: $(RDIR)/%.R
R CMD BATCH $<
None of the first two files outputs data; nor does the merge script explicitly import data - it just uses the objects created in the first two scripts. So how is the data preserved between the scripts?
To me it seems like the batch execution happens within the same R environment, preserving both objects and loaded packages. Is this really the case? And is it the .Rout file that transfers the objects from one script to the other or is it a property of the batch execution itself?
If the working environment is really preserved between the scripts, I see a lot of potential for issues if there are objects with the same names or functions with the same names from different packages. Another issue of this setup seems to be that the Makefile cannot propagate changes in the first two files downstream because there is no explicit input/prerequisite for the merge script.
I would appreciate to learn if my intuition is right and if there are better ways to execute R files in a Makefile.
By default R CMD BATCH will save your workspace to a hidden .Rdata file after running unless you choose --no-save. That's why it's not really the recommended way to run R script. The recommended way is with Rscript which will not save by default. You must write code explicitly to save if that's what you want. This is different than the Rout file which should only have the output from the commands run in the script.
In this case, execution doesn't happen in the exact same environment. R is still called three times, but that environment is serialized and reloaded between each run.
You are correct that there may be a lot of problems with saving and re-loading workspaces by default. That's why most people recommend you do not do that. But in this cause, the author just figured it made things easier for their workflow so they used it. It would be better to be more explicit about input and output files in general though.

Very simple question on Console vs Script in R

I have just started to learn to code on R, so I apologize for the very simple question. I understand it is best to type your code in as a Script so you can edit and save it. However, when I try to make an object in the script section, it does not work. If I make an object in the console, R saves the object and it appears in my environment. I am typing in a very simple code to try a quick exercise on rolling dice:
die <- 1:6
But it only works in the console and not when typed as a script. Any help/explanation appreciated!
Essentially, you interact with R environment differently when running an .R script via RScript.exe or via console with R.exe, Rterm, etc. and in GUI IDEs like RGui or RStudio. (This applies to any programming language with interactive compilers not just R).
The script does save thedie object in R environment but only during the run or lifetime of that script (i.e., from beginning to end of code lines). Your code line is simply an assignment of object. You do nothing with it. Apply some function, output results, and other actions in that script to see.
On the console, the R environment persists interactively until you quit it with q(). So assigned objects remains for lifetime of your console session. After assigning, you can afterwards apply function, output results, or other actions in line by line calls.
Ultimately, scripts gathers all line by line code in advance of run for automated execution without relying on user to supply lines. Imagine running 1,000 lines of code with nested if/then or for/while loops, apply functions on console! Therefore, have all your R coding needs summarily handled in scripts.
It is always better to have the script, as you say, you can save edit correct, without having to rewrite the code to change a variable or number.
I recommend using Rstudio, it is very practical and will help you to program more efficiently and allows you to see, among other things, the different objects that you have created.

Why is source speed different from RStudio console line code?

I have a script with self-written functions (no plots). When I copy-paste that script into the R-Studio console, it takes ages to execute, but when I use source("Helperfunctions.R") it doesn't take more than a second.
Question: Where does the difference in speed come from?
I am aware of two differences between running code via the source() function vs. entering code at the R-Studio console:
From ?source:
Since expressions are not executed at the top level, auto-printing is not done.
The way I understand this: source() will not plot graphs (unless made specific with e.g. print(plot)), while the R Studio console codes will always plot graphs. I'm sure this will affect the speed of execution to a certain degree, but this seems irrelevant in my case, because there are barely any plot calls.
And:
(...) the complete file is parsed before any of it is run
I have been working with R for a while now, but I'm not sure whether this relevant for the speed-issue I'm having. Is it possible that completely parsing all code "before any of it is run" speeds up the execution of my helper functions script by a factor of a hundred?
Edit: I'm using R version 3.2.3.
The issue is not source() vs. console line code. Instead, it is an issue of how RStudio sends code from the source pane to the console.
When I copy the content of Helperfunctions.R and run it in RGui (instead of RStudio), the code is executed with nearly the same speed as when I use source("Helperfunctions.R") in RStudio.
Apparently, lines of code always (?) require more execution time in RStudio than in RGui. Even though you may usually not notice the time-difference when executing a couple of lines in the console, it seems to make a huge difference when, say, 3.000 lines of code are being executed in the R Studio console at once.
My understanding is that upon using source("Helperfunctions.R") in the RStudio source pane, the code is not actually sent to the RStudio console (which would have been slow), but is actually executed directly in the R language.

knitr: keep cache when I make small change in chunk

I understandably broke cache when updating a chunk (however the result should be the same, it was cosmetic changes). However, I do not want to run the chunk again because it takes 1 week to run. How can I change the cache so that the new code thinks the cache holds?
I think I just need to change the file names in the cache folder. But I don't know what to change them to without running the code because knitr only writes the files after successful completion of the chunk.
Another motivation is that knitr cache can be invalidated when using different knitr versions. This happened to me between 1.5 and 1.5.33, the development versions. Also see here: R knitr: is it possible to use cached results across different machines?. I think if I find a solution to the above that can help with this.
Using the knitr cache to store the results of a week-long simulation sounds a bit crazy susceptible to disaster.
My suggestion for a safer workflow is:
Run the simulation and store the results in a file (csv, rda, whatever is suitable).
Load that data inside a chunk (probably with echo = FALSE) near the start of your knitr report.
Now simulating and reporting are decoupled.

How to continuously restart/loop R script

I want an R script to continuously run and check for files in a folder and do something with those files.
The code simply checks for a file, then moves the file to somewhere else and renames it, deleting the old file (in reality it's a bit more elabore than this).
If I run the script it works fine, however I want R to automatically detect for the files. In other words, is there a way to have R run the script continuously so that I don't have to run the script if I put files in that folder?
In pure R you just need an infinite repeat loop...
repeat {
print('Checking files')
# Your code to do file manipulation
Sys.sleep(time=5) # to stop execution for 5 sec
}
However there may be better tools suitable to do this kind of file manipulation depending on your OS.
You can use the function tclTaskSchedule from the tcltk2 package to schedule a function or expression to run on a regular interval. You can have multiple such tasks scheduled and still work in the R session (just be careful not to modify something that the scheduled task could also modify or you can get unpredictable results).
Though an OS based solution that runs a given rscript may still be a better approach.

Resources