Submit jobs to a slave node from within an R script? - r

To get myscript.R to run on a cluster slave node using a job scheduler (specifically, PBS)
Currently, I submit an R script to a slave node using the following command
qsub -S /bin/bash -p -1 -cwd -pe mpich 1 -j y -o output.log ./myscript.R
Are there functions in R that would allow me to run myscript.R on the head node and send individual tasks to the slave nodes? Something like:
foreach(i=c('file1.csv', 'file2.csv', pbsoptions = list()) %do% read.csv(i)
Update: alternative solution to the qsub command is to remove #/usr/bin/Rscript from the first line of myscript.R and call it directly, as pointed out by #Josh
qsub -S /usr/bin/Rscript -p -1 -cwd -pe mpich 1 -j y -o output.log myscript.R

If you want to submit jobs from within an R script, I suggest that you look at the "BatchJobs" package. Here is a quote from the DESCRIPTION file:
Provides Map, Reduce and Filter variants to generate jobs on batch computing systems like PBS/Torque, LSF, SLURM and Sun Grid Engine.
BatchJobs appears to be more sophisticated than previous, similar packages, such as Rsge and Rlsf. There are functions for registering, submitting, and retrieving the results of jobs. Here's a simple example:
library(BatchJobs)
reg <- makeRegistry(id='test')
batchMap(reg, sqrt, x=1:10)
submitJobs(reg)
y <- loadResults(reg)
You need to configure BatchJobs to use your batch queueing system. The submitJobs "resource" argument can be used to request appropriate resources for the jobs.
This approach is very useful if your cluster doesn't allow very long running jobs, or if it severely restricts the number of long running jobs. BatchJobs allows you to get around those restrictions by breaking up your work into multiple jobs while hiding most of the work associated with doing that manually.
Documentation and examples are available at the project website.

For most of our work we do run multiple R sessions in parallel using qsub (instead).
If it is for multiple files I normally do:
while read infile rest
do
qsub -v infile=$infile call_r.sh
done < list_of_infiles.txt
call_r.sh:
...
R --vanilla -f analyse_file.R $infile
...
analyse_file.R:
args <- commandArgs()
infile=args[5]
outfile=paste(infile,".out",sep="")...
Then I combine all the output afterwards...

The R package Rsge allows job submission to SGE managed clusters. It basically saves the required environment to disk, builds job submission scripts, executes them via qsub and then collates the results and returns them to you.
Because it basically wraps calls to qsub, it should work with PBS too (although since I don't know PBS, I can't guarantee it). You can alter the qsub command and the options used by altering the Rsge associated global options (prefixed sge. in the options() output)
Its is no longer on CRAN, but it is availible from github: https://github.com/bodepd/Rsge, although it doesn't look like its maintained any more.
To use it use one of the apply type functions supplied with the package: sge.apply , sge.parRapply, sge.parCapply, sge.parLapply and sge.parSapply, which are parallel equivalents to apply, rapply, rapply(t(x),…), lapply and sapply respectively. In addition to the standard parameters passed to the non-parallel functions a couple of other parameters are needed:
njobs: Number of parallel jobs to use
global.savelist: Character vector giving the names of variables
from the global environment that should be imported.
function.savelist: Character vector giving the variables to save from
the local environment.
packages: List of library packages to be loaded by each worker process
before computation is started.
The two savelist parameters and the packages parameters basically specify what variables, functions and packages should be loaded into the new instances of R running on the cluster machines before your code is executed. The different components of X (either list items or data.frame rows/columns) are divided between njobs different jobs and submitted as a job array to SGE. Each node starts an instance of R loads the specified variables, functions and packages, executes the code, saves and save the results to a tmp file. The controlling R instance checks when the jobs are complete, loads the data from the tmp files and joins the results back together to get the final results.
For example computing a statistic on a random sample of a gene list:
library(Rsge)
library(some.bioc.library)
gene.list <- read.delim(“gene.list.tsv”)
compute.sample <- function(gene.list) {
gene.list.sample <- sample(1000, gene.list)
statistic <- some.slow.bioc.function(gene.list.sample)
return (statistic)
}
results <- sge.parSapply(1:10000, function(x) compute.sample,
njobs = 100,
global.savelist = c(“gene.list”),
function.savelist(“compute.sample”),
packages = c(“some.bioc.library”))

If you like to send tasks to slave nodes as you go along with a script on the head node, I believe your options are the following:
Pre-allocate all slave nodes and them and keep them in standby when they are not needed (as I suggested in my first answer).
Launch new jobs when the slave nodes are needed and have them save their results to disk. Put the main process on hold until the slaves have completed their tasks and then assemble their output files.
Option 2 is definitely possible but will take a lot longer to implement (I've actually done it myself several times). #pallevillesen's answer is pretty much spot on.
Original answer, with missinterpreted question
I have never worked with PBS myself, but it appears that you can use it to submit MPI jobs. You might need to load an MPI module before executing the R script, having a shell script along these lines sent to qsub.
#!/bin/bash
#PBS -N my_job
#PBS -l cput=10:00:00,ncpus=4,mem=2gb
module load openmpi
module load R
R -f myscript.R
You should then be able to use doSNOW to execute your foraech loop in parallel.
n.slaves <- 4
library(doSNOW)
cl <- makeMPIcluster(n.slaves)
registerDoSNOW(cl)
foreach(i=c('file1.csv', 'file2.csv'), pbsoptions = list()) %dopar% read.csv(i)

Related

How do I silence this specific message written to terminal/file (when calling an 'Rscript' via a bash script)

I'm writing a bash script (to be called from the terminal -- on a Linux system) that creates a log-file prior to initiating an 'rscript' using some simple user input. I'm however running into problems in controlling what messages are included in the log file (or sent to the terminal), and can't find any solution for excluding one specific R package-load message:
Package WGCNA 1.66 loaded.
In other words I need a way to (only) silence this specific message, which is printed when the WGCNA package is successfully loaded.
I will try to keep the code non-specific to hopefully make it easier to follow.
The below block is a skeleton (excluding some irrelevant code), and will be followed by some different variants I've tried. Where, originally i tried controlling the output from the R script using sink() and suppressPackageStartupMessages(), which - I thought - should have been enough.
bash script:
#!/usr/bin/env bash
read RDS
DATE=`date +%F-%R`
LOG=~/path/log/$DATE.log
touch $LOG
export ALLOW_WGCNA_THREADS=4
Rscript ~/path/analysis.R $RDS $DATE $LOG
R script:
#!/usr/bin R
# object set-up
rds.path <- "~/path/data/"
temp.path <- "~/path/temp/"
pp.data <- readRDS(paste0(rds.path, commandArgs(T)[1]))
file.date <- paste0(commandArgs(T)[2], "_")
# set up error logging
log.file <- file(commandArgs(T)[3], open="a")
sink(log.file, append=TRUE, type="message")
sink(log.file, append=TRUE, type="output")
# main pkg call
if(suppressPackageStartupMessages(!require(thePKG))){
stop("\nPlease follow the below link to install the requested package (thePKG) with relevant dependencies\n https://link.address")
}
# thePKG method call
cat("> Running "method"\n", append=TRUE)
module <- method(thePKG_input = pp.data, ppi_network = ppi_network)
# reset sink and close file connection
sink(type="message")
sink(type="output")
close(log.file)
This doesn't output anything to the terminal (which is good), including the following in the log-file:
Package WGCNA 1.66 loaded.
> Running "method"
Error: added error to verify that it's correctly printed to the file
I want to keep my log files as clean and on point as possible (and avoid clutter in the terminal), and therefor wish to remove the package-load message. I've tried the following...
i. Omitting the sink() call from the R script, and adding
&>> $LOG
to the bash 'Rscript' call. Resulting in the same file output as above.
ii. Same as i but substituting suppressPackageStartupMessages() with suppressMessages(), which results in the same file output as above.
iii. Same as i but added
... require(thePKG, quietly=TRUE)
in the R script # main pkg call, with the same results.
These where the potential solutions I came across, and tried in different variations with no positive results.
I also wondered if the WGCNA package was loaded "outside" of the !require-loop of thePKG, since it's not affected by suppressMessages() for that call. But, introducing an intentional error (which terminated the process) prior to the if-require(thePKG)-call removed the message -- hinting at its initiation inside the loop.
I also tried calling WGCNA by itself at the start of the R script, with an added suppressMessages() to it, but that didn't work either.
The export function used in the bash script doesn't effect the outcome (to my knowledge) and removing it extends the load message to include the following (truncated to save space):
Package WGCNA 1.66 loaded.
Important note: It appears that your system supports multi-threading, but it is not enabled within WGCNA in R.
To allow multi-threading within WGCNA with all available cores, use allowWGCNAThreads() within R. Use disableWGCNAThreads() to disable threading if necessary.
(...)
I'm aware that I could send the output (only) to /dev/null, but there's other output printed to the file (e.g. > Running "method") that I still want.
Does anyone have a suggestion for how to remove this message? I'm very new to programming in R, and just started using Linux (Ubuntu LTS), so please keep that in mind when answering.
Thanks in advance.
I could gather the following execution flow from your question:
Bash_Script ----> R_Script ----> Output_to_terminal (and also a log file probably)
For the requirements stated, following commands (in bash) could help you:
grep - This command helps you search for patterns
Pipe (|) - This is an operator in Linux which helps you redirect output of one command as input to another for further processing (For eg. A | B )
tee - This command takes any input and forwards it to a file + standard output (terminal)
So, you could mix and match the above commands in your bash file to get the desired output:
You can extend
Rscript ~/path/analysis.R $RDS $DATE $LOG
with
Rscript ~/path/analysis.R "$RDS" "$DATE" "$LOG" | grep -v "Package WGCNA" | tee "output.log"
What each command does ?
a) | -> This redirects output of preceding command as input to the next. It's more of a connector
b) grep -> Normally used to search for patterns. In this example, we use it with -v option to do a invert-match instead i.e. search for everything except Package WGCNA
c) tee -> This writes output to terminal + also redirects the same to a log file name, passed as argument. You could skip this command altogether, if not needed

Write to log file using doMPI

I am running doMPI on an HPC and I would like to log output from workers. Using doParallel, I was able to use makeCluster(outfile='myfile.log'). With doMPI, there does not seem to be an outfile argument in any of the methods. I tried using sinkWorkerOutput(). This works, but only wrote the log for one of the workers. I suspect that each worker is overwriting the other. Is there an analog for outfile for doMPI?
A related question - inside of a worker, can I find the worker number?
EDIT: here is a link to an answer discussing how to use outfile: How can I print when using %dopar%
Thank you for your help,
Ben
To send worker output to a file in the doMPI package, set the startMPIcluster "verbose" option to TRUE:
cl <- startMPIcluster(verbose=TRUE)
This creates one file per worker with names of the form "MPI_1_steve_41747.log". The MPI rank, user name, and process ID is used to make the file names unique. You can also specify the log directory via the "logdir" option.
To get a worker number, you can simply call the mpi.comm.rank function.

Run multiple R scripts with exiting/restarting in between on Linux

I have a series of R scripts for doing the multiple steps of data
analysis that I require. Some of these take a very long time and create really
large objects. I've noticed that if I just source all of them in a row (via a main.R script), the
processing for later steps takes much longer than if I source one script, save
what I need, and restart R for the next step (loading the data I need).
I was wondering if there was a
way, via Rscript or a Bash script perhaps, that I could carry this out.
There would need to be objects that persist for the first 2 scripts (which load
my external data and create the objects that will be used for all further
steps). I suppose I could also just save those and load them in further scripts.
(I would also like to pass a number of named arguments to this script, which I think I can find on other SO posts and can use something like optparse.)
So, the script would look something like this, I think:
#! /bin/bash
Rscript 01_load.R # Objects would persist, ideally
Rscript 02_create_graphs.R # Objects would persist, ideally
Rscript 03_random_graphs.R # contains code to save objects
#exit R
Rscript 04_permutation_analysis.R # would have to contain code to load data
#exit
And so on. Is there a solution to this? I'm using R 3.2.2 on 64-bit CentOS 6. Thanks.
Chris,
it sounds you should do some manual housekeeping between (or within) your steps by using gc() and maybe also rm(). For more details see help(gc) and help(rm).
So instead of exit R and restart it again you could do:
rm(list = ls())
gc()
But please note: rm(list = ls()) would throw away all your objects. Better you create a suitable list of objects you really want to throw away and pass this list to rm().

R parallel system call on files

I have to convert a large number of RAW images and am using the program DCRAW to do that. Since this program is only using one core I want to parallelize this in R. To call this function I use:
system("dcraw.exe -4 -T image.NEF")
This results in outputting a file called image.tiff in the same folder as the NEF file, which is totally fine. Now I tried multiple R packages to parallelize this but I only get nonsensical returns (probably caused by me). I want to run a large list (1000+ files) through this system call in r , obtained by list.files()
I could only find info on parallel programming for variables within R but not for system calls. Anybody got any ideas? Thanks!
It doesnt' matter if you use variables or system. Assuming you're not on Windows (which doesn't support parallel), on any decent system you can run
parallel::mclapply(Sys.glob("*.NEF"),
function(fn) system(paste("dcraw.exe -4 -T", shQuote(fn))),
mc.cores=8, mc.preschedule=F)
It will run 8 jobs in parallel. But then you may as well not use R and use instead
ls *.NEF | parallel -u -j8 'dcraw.exe -4 -T {}'
instead (using GNU parallel).
On Windows I use a modification of this solution (the top voted one) to run many commands with no more than, say, 4 or 8 simultaneously:
Parallel execution of shell processes
It's not an R solution, but I like it.

Using knitr with cluster computing

I have a R script which needs to be run repeatedly. (For concreteness, I am talking about 500-1000 independent computationally-intensive MCMC chains which I want to summarize in just a few key plots at the end.) My school has a server available that uses a queuing system that makes these computations feasible. Right now I submit multiple jobs to the "short" queue since it is less overburdened than the "multicore" or "long" job queues. I have been running it by having the R script called multiple times, so I am submitting 50 jobs of 10 chains apiece and saving the results to a single output file by appending. This is my job submission code:
for ARRAYVAR in `seq 1 1 50`
do
bsub -q short -u me#school.edu R CMD BATCH "CODE.R --args arg1 = $ARRAYVAR"
done
ARRAYVAR is used only for setting the random number seed. Once the jobs have all completed, the plotting is then done in a separate script.
For homework assignments and previous research, I have used knitr with Rstudio to combine LaTeX notes with my R code. The end result is a single .Rnw that generates a reproducible document containing all notes, code, and results. I liked that approach much better since I could always be sure the plots/results corresponded to the code version I saw in front of me. Is it possible to do something similar here so that there is one file that I could re-run to reproduce my findings? I am new to using the cluster and R without Rstudio.

Resources