Best way to utilize bash script (Ubuntu) Rscript, pdflatx, and API - r

I'm currently writing some code that
connect to a server via API and fetches a bunch of data,
organizes that data by case ID,
generates an individual case report,
creates one pdf (case overview) file per case, and finally
pushes these files back to the server.
I'm quite familiar with R and somewhat familiar with pdflatex. I've just found out about bash scripts-as I have started to work in a Ubuntu environment-and I am now starting to realize that it is not straightforward which programs are best suited for the job.
My current plan is to fetch the data using RCrul in R, organize the data in R and generate a bunch of .tex-files. Hereafter I plan to use pdflatex to create teh pdf-files, and finally use R again to push the newly create pdf files back to the server. I’ve started writing a small bash script,
for f in *Rnw
do
# do something on ${f%%.*}
Rscript -e “source("fetch.data.and.generate.Rnw.R")” # 1 through 3
Rscript -e "library(knitr); knit('${f%%.*}.Rnw')" # 4
pdflatex "${f%%.*}.tex" # 4 continued
rm "${f%%.*}.tex" "${f%%.*}.aux" "${f%%.*}.log" "${f%%.*}.out" # cleanup after 4
Rscript -e “source("push.pdf.R")” # 5
done
I hoped someone out there could advise me about what software is best suited for the individual part of the job and what would give my the best performance.
The data is not that extensive, I will be working with about 500 to 2000 cases and approximately 20 to 30 variables.

#flodel and #shellter make excellent points. I'll only add that, if you decide to keep using bash in your solution, you might find it easier to calculate your filename variable once and then use that elsewhere:
for f in *Rnw; do
stem="${f%%.*}"
Rscript commands with $stem
pdflatex command involving $stem
Rscript commands for pushing $stem.pdf
rm $stem.*
end

Related

Do .Rout files preserve the R working environment?

I recently started looking into Makefiles to keep track of the scripts inside my research project. To really understand what is going on, I would like to understand the contents of .Rout files produced by R CMD BATCH a little better.
Christopher Gandrud is using a Makefile for his book Reproducible research with R and RStudio. The sample project (https://github.com/christophergandrud/rep-res-book-v3-examples/tree/master/data) has only three .R files: two of them download and clean data, the third one merges both datasets. They are invoked by the following lines of the Makefile:
# Key variables to define
RDIR = .
# Run the RSOURCE files
$(RDIR)/%.Rout: $(RDIR)/%.R
R CMD BATCH $<
None of the first two files outputs data; nor does the merge script explicitly import data - it just uses the objects created in the first two scripts. So how is the data preserved between the scripts?
To me it seems like the batch execution happens within the same R environment, preserving both objects and loaded packages. Is this really the case? And is it the .Rout file that transfers the objects from one script to the other or is it a property of the batch execution itself?
If the working environment is really preserved between the scripts, I see a lot of potential for issues if there are objects with the same names or functions with the same names from different packages. Another issue of this setup seems to be that the Makefile cannot propagate changes in the first two files downstream because there is no explicit input/prerequisite for the merge script.
I would appreciate to learn if my intuition is right and if there are better ways to execute R files in a Makefile.
By default R CMD BATCH will save your workspace to a hidden .Rdata file after running unless you choose --no-save. That's why it's not really the recommended way to run R script. The recommended way is with Rscript which will not save by default. You must write code explicitly to save if that's what you want. This is different than the Rout file which should only have the output from the commands run in the script.
In this case, execution doesn't happen in the exact same environment. R is still called three times, but that environment is serialized and reloaded between each run.
You are correct that there may be a lot of problems with saving and re-loading workspaces by default. That's why most people recommend you do not do that. But in this cause, the author just figured it made things easier for their workflow so they used it. It would be better to be more explicit about input and output files in general though.

Run multiple R scripts with exiting/restarting in between on Linux

I have a series of R scripts for doing the multiple steps of data
analysis that I require. Some of these take a very long time and create really
large objects. I've noticed that if I just source all of them in a row (via a main.R script), the
processing for later steps takes much longer than if I source one script, save
what I need, and restart R for the next step (loading the data I need).
I was wondering if there was a
way, via Rscript or a Bash script perhaps, that I could carry this out.
There would need to be objects that persist for the first 2 scripts (which load
my external data and create the objects that will be used for all further
steps). I suppose I could also just save those and load them in further scripts.
(I would also like to pass a number of named arguments to this script, which I think I can find on other SO posts and can use something like optparse.)
So, the script would look something like this, I think:
#! /bin/bash
Rscript 01_load.R # Objects would persist, ideally
Rscript 02_create_graphs.R # Objects would persist, ideally
Rscript 03_random_graphs.R # contains code to save objects
#exit R
Rscript 04_permutation_analysis.R # would have to contain code to load data
#exit
And so on. Is there a solution to this? I'm using R 3.2.2 on 64-bit CentOS 6. Thanks.
Chris,
it sounds you should do some manual housekeeping between (or within) your steps by using gc() and maybe also rm(). For more details see help(gc) and help(rm).
So instead of exit R and restart it again you could do:
rm(list = ls())
gc()
But please note: rm(list = ls()) would throw away all your objects. Better you create a suitable list of objects you really want to throw away and pass this list to rm().

R parallel system call on files

I have to convert a large number of RAW images and am using the program DCRAW to do that. Since this program is only using one core I want to parallelize this in R. To call this function I use:
system("dcraw.exe -4 -T image.NEF")
This results in outputting a file called image.tiff in the same folder as the NEF file, which is totally fine. Now I tried multiple R packages to parallelize this but I only get nonsensical returns (probably caused by me). I want to run a large list (1000+ files) through this system call in r , obtained by list.files()
I could only find info on parallel programming for variables within R but not for system calls. Anybody got any ideas? Thanks!
It doesnt' matter if you use variables or system. Assuming you're not on Windows (which doesn't support parallel), on any decent system you can run
parallel::mclapply(Sys.glob("*.NEF"),
function(fn) system(paste("dcraw.exe -4 -T", shQuote(fn))),
mc.cores=8, mc.preschedule=F)
It will run 8 jobs in parallel. But then you may as well not use R and use instead
ls *.NEF | parallel -u -j8 'dcraw.exe -4 -T {}'
instead (using GNU parallel).
On Windows I use a modification of this solution (the top voted one) to run many commands with no more than, say, 4 or 8 simultaneously:
Parallel execution of shell processes
It's not an R solution, but I like it.

How to drop into R shell after executing commands from file in R

In Python, running the interpreter with the -i flag first executes the script, then drops back into the interpreter
python -i hello.py
Hello world
>>> print("Python ftw")
Python ftw
>>>
which allows me to type commands and reach the variables after execution.
With R, this seems to be of great difficulty. I have been searching online for some time, and am surprised to see there is not so many results with the keywords "R run file shell interpreter".
With R, you can use
$ R -f myfile.R which executes and then exits the interpreter
$ Rscript myfile.R which still does the same thing.
Even worse, it does not plot when run like this and just exits without showing any signs that something has been plotted.
So, to repeat my question:
How do I make R to drop into the R shell after running commands from a file, a.k.a. a script?
Concurrently, how can I make R really plot the plots and not close them off immediately?
I can do these with Python, MATLAB, Octave, Ruby and many others, and should be able to do with R too.
I will answer your two questions separately:
How do I drop into a shell after my script has executed?
The function "browser" called with no arguments will allow to to drop into a shell on the line that it's called. Appending this to your script should do the trick.
How do I save graphics when not run in interactive mode?
First, check that there isn't a pdf file being created in your working directory. Depending on how you're running R, I believe it may be named "Rplots.pdf". Personally, however, I prefer to explicitly save graphics to a particular file, as such:
pdf("temp.pdf")
plot(rnorm(100))
dev.off()
which will save the plot in a new file called temp.pdf (and will overwrite any existing file by that name, so watch out).
Functions analagous to "pdf" exist for other image formats if you would prefer that.

Using knitr with cluster computing

I have a R script which needs to be run repeatedly. (For concreteness, I am talking about 500-1000 independent computationally-intensive MCMC chains which I want to summarize in just a few key plots at the end.) My school has a server available that uses a queuing system that makes these computations feasible. Right now I submit multiple jobs to the "short" queue since it is less overburdened than the "multicore" or "long" job queues. I have been running it by having the R script called multiple times, so I am submitting 50 jobs of 10 chains apiece and saving the results to a single output file by appending. This is my job submission code:
for ARRAYVAR in `seq 1 1 50`
do
bsub -q short -u me#school.edu R CMD BATCH "CODE.R --args arg1 = $ARRAYVAR"
done
ARRAYVAR is used only for setting the random number seed. Once the jobs have all completed, the plotting is then done in a separate script.
For homework assignments and previous research, I have used knitr with Rstudio to combine LaTeX notes with my R code. The end result is a single .Rnw that generates a reproducible document containing all notes, code, and results. I liked that approach much better since I could always be sure the plots/results corresponded to the code version I saw in front of me. Is it possible to do something similar here so that there is one file that I could re-run to reproduce my findings? I am new to using the cluster and R without Rstudio.

Resources