R parallel system call on files - r

I have to convert a large number of RAW images and am using the program DCRAW to do that. Since this program is only using one core I want to parallelize this in R. To call this function I use:
system("dcraw.exe -4 -T image.NEF")
This results in outputting a file called image.tiff in the same folder as the NEF file, which is totally fine. Now I tried multiple R packages to parallelize this but I only get nonsensical returns (probably caused by me). I want to run a large list (1000+ files) through this system call in r , obtained by list.files()
I could only find info on parallel programming for variables within R but not for system calls. Anybody got any ideas? Thanks!

It doesnt' matter if you use variables or system. Assuming you're not on Windows (which doesn't support parallel), on any decent system you can run
parallel::mclapply(Sys.glob("*.NEF"),
function(fn) system(paste("dcraw.exe -4 -T", shQuote(fn))),
mc.cores=8, mc.preschedule=F)
It will run 8 jobs in parallel. But then you may as well not use R and use instead
ls *.NEF | parallel -u -j8 'dcraw.exe -4 -T {}'
instead (using GNU parallel).

On Windows I use a modification of this solution (the top voted one) to run many commands with no more than, say, 4 or 8 simultaneously:
Parallel execution of shell processes
It's not an R solution, but I like it.

Related

Run multiple R scripts with exiting/restarting in between on Linux

I have a series of R scripts for doing the multiple steps of data
analysis that I require. Some of these take a very long time and create really
large objects. I've noticed that if I just source all of them in a row (via a main.R script), the
processing for later steps takes much longer than if I source one script, save
what I need, and restart R for the next step (loading the data I need).
I was wondering if there was a
way, via Rscript or a Bash script perhaps, that I could carry this out.
There would need to be objects that persist for the first 2 scripts (which load
my external data and create the objects that will be used for all further
steps). I suppose I could also just save those and load them in further scripts.
(I would also like to pass a number of named arguments to this script, which I think I can find on other SO posts and can use something like optparse.)
So, the script would look something like this, I think:
#! /bin/bash
Rscript 01_load.R # Objects would persist, ideally
Rscript 02_create_graphs.R # Objects would persist, ideally
Rscript 03_random_graphs.R # contains code to save objects
#exit R
Rscript 04_permutation_analysis.R # would have to contain code to load data
#exit
And so on. Is there a solution to this? I'm using R 3.2.2 on 64-bit CentOS 6. Thanks.
Chris,
it sounds you should do some manual housekeeping between (or within) your steps by using gc() and maybe also rm(). For more details see help(gc) and help(rm).
So instead of exit R and restart it again you could do:
rm(list = ls())
gc()
But please note: rm(list = ls()) would throw away all your objects. Better you create a suitable list of objects you really want to throw away and pass this list to rm().

Using knitr with cluster computing

I have a R script which needs to be run repeatedly. (For concreteness, I am talking about 500-1000 independent computationally-intensive MCMC chains which I want to summarize in just a few key plots at the end.) My school has a server available that uses a queuing system that makes these computations feasible. Right now I submit multiple jobs to the "short" queue since it is less overburdened than the "multicore" or "long" job queues. I have been running it by having the R script called multiple times, so I am submitting 50 jobs of 10 chains apiece and saving the results to a single output file by appending. This is my job submission code:
for ARRAYVAR in `seq 1 1 50`
do
bsub -q short -u me#school.edu R CMD BATCH "CODE.R --args arg1 = $ARRAYVAR"
done
ARRAYVAR is used only for setting the random number seed. Once the jobs have all completed, the plotting is then done in a separate script.
For homework assignments and previous research, I have used knitr with Rstudio to combine LaTeX notes with my R code. The end result is a single .Rnw that generates a reproducible document containing all notes, code, and results. I liked that approach much better since I could always be sure the plots/results corresponded to the code version I saw in front of me. Is it possible to do something similar here so that there is one file that I could re-run to reproduce my findings? I am new to using the cluster and R without Rstudio.

How to re-use Rcpp compiled dll

Suppose I have compiled a code with two functions f1 and f2 using Rcpp::sourceCpp('myPath/myCode.cpp') and I located the sourceCpp_123123.dll that was created.
Now suppose I have two different batch files on Windows 7, both of which run RScript -e "source('myRCode1.r')" and RScript -e "source('myRCode2.r')" respectively. I want my two functions f1 and f2 to be available with each run of RScript.
I can certainly put in my code myRCode1.r and myRCode2.r to do Rcpp::sourceCpp('myPath/myCode.cpp') before rest of the code is run. Another alternative is to convert my two functions f1 and f2 into a package which is a little more involved process.
Is there any easy way to simply load the sourceCpp_123123.dll within myRCode1.r and myRCode2.r ?
I tried dyn.load("myDllPath\sourceCpp_123123.dll") with various permutations and combinations of now=TRUE, local=TRUE, now=FALSE, local=FALSE, but none of the options loaded the two functions.
However, when I tried getLoadedDLLs, I see that sourceCpp_123123.dll has been loaded!
As I commented earlier, this is where you want to use a package.
Or if you are one of those people opposed to packages on whichever grounds (none of which typically convince an experienced R user / programmer who will almost always advocate use of package), then you could cheat and just combine your two files into one.
Which would couple the two function sets. And you may already know that you can mix C++ and R code in one file as we do all the time over at the Rcpp Gallery...

Best way to utilize bash script (Ubuntu) Rscript, pdflatx, and API

I'm currently writing some code that
connect to a server via API and fetches a bunch of data,
organizes that data by case ID,
generates an individual case report,
creates one pdf (case overview) file per case, and finally
pushes these files back to the server.
I'm quite familiar with R and somewhat familiar with pdflatex. I've just found out about bash scripts-as I have started to work in a Ubuntu environment-and I am now starting to realize that it is not straightforward which programs are best suited for the job.
My current plan is to fetch the data using RCrul in R, organize the data in R and generate a bunch of .tex-files. Hereafter I plan to use pdflatex to create teh pdf-files, and finally use R again to push the newly create pdf files back to the server. I’ve started writing a small bash script,
for f in *Rnw
do
# do something on ${f%%.*}
Rscript -e “source("fetch.data.and.generate.Rnw.R")” # 1 through 3
Rscript -e "library(knitr); knit('${f%%.*}.Rnw')" # 4
pdflatex "${f%%.*}.tex" # 4 continued
rm "${f%%.*}.tex" "${f%%.*}.aux" "${f%%.*}.log" "${f%%.*}.out" # cleanup after 4
Rscript -e “source("push.pdf.R")” # 5
done
I hoped someone out there could advise me about what software is best suited for the individual part of the job and what would give my the best performance.
The data is not that extensive, I will be working with about 500 to 2000 cases and approximately 20 to 30 variables.
#flodel and #shellter make excellent points. I'll only add that, if you decide to keep using bash in your solution, you might find it easier to calculate your filename variable once and then use that elsewhere:
for f in *Rnw; do
stem="${f%%.*}"
Rscript commands with $stem
pdflatex command involving $stem
Rscript commands for pushing $stem.pdf
rm $stem.*
end

Submit jobs to a slave node from within an R script?

To get myscript.R to run on a cluster slave node using a job scheduler (specifically, PBS)
Currently, I submit an R script to a slave node using the following command
qsub -S /bin/bash -p -1 -cwd -pe mpich 1 -j y -o output.log ./myscript.R
Are there functions in R that would allow me to run myscript.R on the head node and send individual tasks to the slave nodes? Something like:
foreach(i=c('file1.csv', 'file2.csv', pbsoptions = list()) %do% read.csv(i)
Update: alternative solution to the qsub command is to remove #/usr/bin/Rscript from the first line of myscript.R and call it directly, as pointed out by #Josh
qsub -S /usr/bin/Rscript -p -1 -cwd -pe mpich 1 -j y -o output.log myscript.R
If you want to submit jobs from within an R script, I suggest that you look at the "BatchJobs" package. Here is a quote from the DESCRIPTION file:
Provides Map, Reduce and Filter variants to generate jobs on batch computing systems like PBS/Torque, LSF, SLURM and Sun Grid Engine.
BatchJobs appears to be more sophisticated than previous, similar packages, such as Rsge and Rlsf. There are functions for registering, submitting, and retrieving the results of jobs. Here's a simple example:
library(BatchJobs)
reg <- makeRegistry(id='test')
batchMap(reg, sqrt, x=1:10)
submitJobs(reg)
y <- loadResults(reg)
You need to configure BatchJobs to use your batch queueing system. The submitJobs "resource" argument can be used to request appropriate resources for the jobs.
This approach is very useful if your cluster doesn't allow very long running jobs, or if it severely restricts the number of long running jobs. BatchJobs allows you to get around those restrictions by breaking up your work into multiple jobs while hiding most of the work associated with doing that manually.
Documentation and examples are available at the project website.
For most of our work we do run multiple R sessions in parallel using qsub (instead).
If it is for multiple files I normally do:
while read infile rest
do
qsub -v infile=$infile call_r.sh
done < list_of_infiles.txt
call_r.sh:
...
R --vanilla -f analyse_file.R $infile
...
analyse_file.R:
args <- commandArgs()
infile=args[5]
outfile=paste(infile,".out",sep="")...
Then I combine all the output afterwards...
The R package Rsge allows job submission to SGE managed clusters. It basically saves the required environment to disk, builds job submission scripts, executes them via qsub and then collates the results and returns them to you.
Because it basically wraps calls to qsub, it should work with PBS too (although since I don't know PBS, I can't guarantee it). You can alter the qsub command and the options used by altering the Rsge associated global options (prefixed sge. in the options() output)
Its is no longer on CRAN, but it is availible from github: https://github.com/bodepd/Rsge, although it doesn't look like its maintained any more.
To use it use one of the apply type functions supplied with the package: sge.apply , sge.parRapply, sge.parCapply, sge.parLapply and sge.parSapply, which are parallel equivalents to apply, rapply, rapply(t(x),…), lapply and sapply respectively. In addition to the standard parameters passed to the non-parallel functions a couple of other parameters are needed:
njobs: Number of parallel jobs to use
global.savelist: Character vector giving the names of variables
from the global environment that should be imported.
function.savelist: Character vector giving the variables to save from
the local environment.
packages: List of library packages to be loaded by each worker process
before computation is started.
The two savelist parameters and the packages parameters basically specify what variables, functions and packages should be loaded into the new instances of R running on the cluster machines before your code is executed. The different components of X (either list items or data.frame rows/columns) are divided between njobs different jobs and submitted as a job array to SGE. Each node starts an instance of R loads the specified variables, functions and packages, executes the code, saves and save the results to a tmp file. The controlling R instance checks when the jobs are complete, loads the data from the tmp files and joins the results back together to get the final results.
For example computing a statistic on a random sample of a gene list:
library(Rsge)
library(some.bioc.library)
gene.list <- read.delim(“gene.list.tsv”)
compute.sample <- function(gene.list) {
gene.list.sample <- sample(1000, gene.list)
statistic <- some.slow.bioc.function(gene.list.sample)
return (statistic)
}
results <- sge.parSapply(1:10000, function(x) compute.sample,
njobs = 100,
global.savelist = c(“gene.list”),
function.savelist(“compute.sample”),
packages = c(“some.bioc.library”))
If you like to send tasks to slave nodes as you go along with a script on the head node, I believe your options are the following:
Pre-allocate all slave nodes and them and keep them in standby when they are not needed (as I suggested in my first answer).
Launch new jobs when the slave nodes are needed and have them save their results to disk. Put the main process on hold until the slaves have completed their tasks and then assemble their output files.
Option 2 is definitely possible but will take a lot longer to implement (I've actually done it myself several times). #pallevillesen's answer is pretty much spot on.
Original answer, with missinterpreted question
I have never worked with PBS myself, but it appears that you can use it to submit MPI jobs. You might need to load an MPI module before executing the R script, having a shell script along these lines sent to qsub.
#!/bin/bash
#PBS -N my_job
#PBS -l cput=10:00:00,ncpus=4,mem=2gb
module load openmpi
module load R
R -f myscript.R
You should then be able to use doSNOW to execute your foraech loop in parallel.
n.slaves <- 4
library(doSNOW)
cl <- makeMPIcluster(n.slaves)
registerDoSNOW(cl)
foreach(i=c('file1.csv', 'file2.csv'), pbsoptions = list()) %dopar% read.csv(i)

Resources