I have a R script which needs to be run repeatedly. (For concreteness, I am talking about 500-1000 independent computationally-intensive MCMC chains which I want to summarize in just a few key plots at the end.) My school has a server available that uses a queuing system that makes these computations feasible. Right now I submit multiple jobs to the "short" queue since it is less overburdened than the "multicore" or "long" job queues. I have been running it by having the R script called multiple times, so I am submitting 50 jobs of 10 chains apiece and saving the results to a single output file by appending. This is my job submission code:
for ARRAYVAR in `seq 1 1 50`
do
bsub -q short -u me#school.edu R CMD BATCH "CODE.R --args arg1 = $ARRAYVAR"
done
ARRAYVAR is used only for setting the random number seed. Once the jobs have all completed, the plotting is then done in a separate script.
For homework assignments and previous research, I have used knitr with Rstudio to combine LaTeX notes with my R code. The end result is a single .Rnw that generates a reproducible document containing all notes, code, and results. I liked that approach much better since I could always be sure the plots/results corresponded to the code version I saw in front of me. Is it possible to do something similar here so that there is one file that I could re-run to reproduce my findings? I am new to using the cluster and R without Rstudio.
Related
I'm not an R expert and I'm not very good at english language, so keep that in mind when you'll respond to my question.
I'm trying to automate some R script execution. My ultimate goal is to automate the execution of a script that can query some data from Binance exchange API, export those data into an external .csv file (or update the file with new observations if it already exists) and then import the file into RStudio Global Environment, so that I always have up to date data ready to be analyzed.
Searching on the web I learned that I can automate tasks using Windows Task Scheduler, so I downloaded taskscheduleR package to speed up the process. It worked quite well in fact, however I only managed to automate 2 of the 3 tasks I mentioned above:
Query from API
Export data into .csv
So, using task scheduler I can periodically query the web and export data/update existing datasets. However, I'm struggling with the 3rd task. I can't figure out how to automatically and periodically import the data into RStudio Global Environment.
In order to simplify my question I'll use a very basic code line as an example. Imagine I want to automate a script named "rnorm.R"
x <- rnorm(10)
I want the output of rnorm() to be stored into x and then loaded into the Global Environment, and I want this code line to be run every 2 minutes, so that the x values also change every 2 minutes while I'm working with Rstudio. I tried many times with different methods.
First I tried with taskscheduleR package. Used the following code:
require(taskscheduleR)
require(lubridate)
taskscheduler_create(taskname = "rnorm", rscript = "mydir/rnorm.R", starttime = format(ceiling_date(Sys.time(), unit = "mins"), "%H:%M), schedule = "MINUTE", modifier = 2)
Then, I tried with manual schedulation using Windows task scheduler.
Lastly, I tried scheduling a batch file:
#echo off
Rscript.exe mydir/rnorm.R
None of the 3 methods worked. Or better, the schedulation works, I see a command line window appear each time Windows task scheduler executes the script. However, it does nothing. The x variable isn't loaded into the Global Environment. What am I doing wrong? I'm sure this problem has a very simple and stupid answer, however I can't figure out what it may be. Thanks for your answers.
I have a series of R scripts for doing the multiple steps of data
analysis that I require. Some of these take a very long time and create really
large objects. I've noticed that if I just source all of them in a row (via a main.R script), the
processing for later steps takes much longer than if I source one script, save
what I need, and restart R for the next step (loading the data I need).
I was wondering if there was a
way, via Rscript or a Bash script perhaps, that I could carry this out.
There would need to be objects that persist for the first 2 scripts (which load
my external data and create the objects that will be used for all further
steps). I suppose I could also just save those and load them in further scripts.
(I would also like to pass a number of named arguments to this script, which I think I can find on other SO posts and can use something like optparse.)
So, the script would look something like this, I think:
#! /bin/bash
Rscript 01_load.R # Objects would persist, ideally
Rscript 02_create_graphs.R # Objects would persist, ideally
Rscript 03_random_graphs.R # contains code to save objects
#exit R
Rscript 04_permutation_analysis.R # would have to contain code to load data
#exit
And so on. Is there a solution to this? I'm using R 3.2.2 on 64-bit CentOS 6. Thanks.
Chris,
it sounds you should do some manual housekeeping between (or within) your steps by using gc() and maybe also rm(). For more details see help(gc) and help(rm).
So instead of exit R and restart it again you could do:
rm(list = ls())
gc()
But please note: rm(list = ls()) would throw away all your objects. Better you create a suitable list of objects you really want to throw away and pass this list to rm().
I have to convert a large number of RAW images and am using the program DCRAW to do that. Since this program is only using one core I want to parallelize this in R. To call this function I use:
system("dcraw.exe -4 -T image.NEF")
This results in outputting a file called image.tiff in the same folder as the NEF file, which is totally fine. Now I tried multiple R packages to parallelize this but I only get nonsensical returns (probably caused by me). I want to run a large list (1000+ files) through this system call in r , obtained by list.files()
I could only find info on parallel programming for variables within R but not for system calls. Anybody got any ideas? Thanks!
It doesnt' matter if you use variables or system. Assuming you're not on Windows (which doesn't support parallel), on any decent system you can run
parallel::mclapply(Sys.glob("*.NEF"),
function(fn) system(paste("dcraw.exe -4 -T", shQuote(fn))),
mc.cores=8, mc.preschedule=F)
It will run 8 jobs in parallel. But then you may as well not use R and use instead
ls *.NEF | parallel -u -j8 'dcraw.exe -4 -T {}'
instead (using GNU parallel).
On Windows I use a modification of this solution (the top voted one) to run many commands with no more than, say, 4 or 8 simultaneously:
Parallel execution of shell processes
It's not an R solution, but I like it.
I'm currently writing some code that
connect to a server via API and fetches a bunch of data,
organizes that data by case ID,
generates an individual case report,
creates one pdf (case overview) file per case, and finally
pushes these files back to the server.
I'm quite familiar with R and somewhat familiar with pdflatex. I've just found out about bash scripts-as I have started to work in a Ubuntu environment-and I am now starting to realize that it is not straightforward which programs are best suited for the job.
My current plan is to fetch the data using RCrul in R, organize the data in R and generate a bunch of .tex-files. Hereafter I plan to use pdflatex to create teh pdf-files, and finally use R again to push the newly create pdf files back to the server. I’ve started writing a small bash script,
for f in *Rnw
do
# do something on ${f%%.*}
Rscript -e “source("fetch.data.and.generate.Rnw.R")” # 1 through 3
Rscript -e "library(knitr); knit('${f%%.*}.Rnw')" # 4
pdflatex "${f%%.*}.tex" # 4 continued
rm "${f%%.*}.tex" "${f%%.*}.aux" "${f%%.*}.log" "${f%%.*}.out" # cleanup after 4
Rscript -e “source("push.pdf.R")” # 5
done
I hoped someone out there could advise me about what software is best suited for the individual part of the job and what would give my the best performance.
The data is not that extensive, I will be working with about 500 to 2000 cases and approximately 20 to 30 variables.
#flodel and #shellter make excellent points. I'll only add that, if you decide to keep using bash in your solution, you might find it easier to calculate your filename variable once and then use that elsewhere:
for f in *Rnw; do
stem="${f%%.*}"
Rscript commands with $stem
pdflatex command involving $stem
Rscript commands for pushing $stem.pdf
rm $stem.*
end
To get myscript.R to run on a cluster slave node using a job scheduler (specifically, PBS)
Currently, I submit an R script to a slave node using the following command
qsub -S /bin/bash -p -1 -cwd -pe mpich 1 -j y -o output.log ./myscript.R
Are there functions in R that would allow me to run myscript.R on the head node and send individual tasks to the slave nodes? Something like:
foreach(i=c('file1.csv', 'file2.csv', pbsoptions = list()) %do% read.csv(i)
Update: alternative solution to the qsub command is to remove #/usr/bin/Rscript from the first line of myscript.R and call it directly, as pointed out by #Josh
qsub -S /usr/bin/Rscript -p -1 -cwd -pe mpich 1 -j y -o output.log myscript.R
If you want to submit jobs from within an R script, I suggest that you look at the "BatchJobs" package. Here is a quote from the DESCRIPTION file:
Provides Map, Reduce and Filter variants to generate jobs on batch computing systems like PBS/Torque, LSF, SLURM and Sun Grid Engine.
BatchJobs appears to be more sophisticated than previous, similar packages, such as Rsge and Rlsf. There are functions for registering, submitting, and retrieving the results of jobs. Here's a simple example:
library(BatchJobs)
reg <- makeRegistry(id='test')
batchMap(reg, sqrt, x=1:10)
submitJobs(reg)
y <- loadResults(reg)
You need to configure BatchJobs to use your batch queueing system. The submitJobs "resource" argument can be used to request appropriate resources for the jobs.
This approach is very useful if your cluster doesn't allow very long running jobs, or if it severely restricts the number of long running jobs. BatchJobs allows you to get around those restrictions by breaking up your work into multiple jobs while hiding most of the work associated with doing that manually.
Documentation and examples are available at the project website.
For most of our work we do run multiple R sessions in parallel using qsub (instead).
If it is for multiple files I normally do:
while read infile rest
do
qsub -v infile=$infile call_r.sh
done < list_of_infiles.txt
call_r.sh:
...
R --vanilla -f analyse_file.R $infile
...
analyse_file.R:
args <- commandArgs()
infile=args[5]
outfile=paste(infile,".out",sep="")...
Then I combine all the output afterwards...
The R package Rsge allows job submission to SGE managed clusters. It basically saves the required environment to disk, builds job submission scripts, executes them via qsub and then collates the results and returns them to you.
Because it basically wraps calls to qsub, it should work with PBS too (although since I don't know PBS, I can't guarantee it). You can alter the qsub command and the options used by altering the Rsge associated global options (prefixed sge. in the options() output)
Its is no longer on CRAN, but it is availible from github: https://github.com/bodepd/Rsge, although it doesn't look like its maintained any more.
To use it use one of the apply type functions supplied with the package: sge.apply , sge.parRapply, sge.parCapply, sge.parLapply and sge.parSapply, which are parallel equivalents to apply, rapply, rapply(t(x),…), lapply and sapply respectively. In addition to the standard parameters passed to the non-parallel functions a couple of other parameters are needed:
njobs: Number of parallel jobs to use
global.savelist: Character vector giving the names of variables
from the global environment that should be imported.
function.savelist: Character vector giving the variables to save from
the local environment.
packages: List of library packages to be loaded by each worker process
before computation is started.
The two savelist parameters and the packages parameters basically specify what variables, functions and packages should be loaded into the new instances of R running on the cluster machines before your code is executed. The different components of X (either list items or data.frame rows/columns) are divided between njobs different jobs and submitted as a job array to SGE. Each node starts an instance of R loads the specified variables, functions and packages, executes the code, saves and save the results to a tmp file. The controlling R instance checks when the jobs are complete, loads the data from the tmp files and joins the results back together to get the final results.
For example computing a statistic on a random sample of a gene list:
library(Rsge)
library(some.bioc.library)
gene.list <- read.delim(“gene.list.tsv”)
compute.sample <- function(gene.list) {
gene.list.sample <- sample(1000, gene.list)
statistic <- some.slow.bioc.function(gene.list.sample)
return (statistic)
}
results <- sge.parSapply(1:10000, function(x) compute.sample,
njobs = 100,
global.savelist = c(“gene.list”),
function.savelist(“compute.sample”),
packages = c(“some.bioc.library”))
If you like to send tasks to slave nodes as you go along with a script on the head node, I believe your options are the following:
Pre-allocate all slave nodes and them and keep them in standby when they are not needed (as I suggested in my first answer).
Launch new jobs when the slave nodes are needed and have them save their results to disk. Put the main process on hold until the slaves have completed their tasks and then assemble their output files.
Option 2 is definitely possible but will take a lot longer to implement (I've actually done it myself several times). #pallevillesen's answer is pretty much spot on.
Original answer, with missinterpreted question
I have never worked with PBS myself, but it appears that you can use it to submit MPI jobs. You might need to load an MPI module before executing the R script, having a shell script along these lines sent to qsub.
#!/bin/bash
#PBS -N my_job
#PBS -l cput=10:00:00,ncpus=4,mem=2gb
module load openmpi
module load R
R -f myscript.R
You should then be able to use doSNOW to execute your foraech loop in parallel.
n.slaves <- 4
library(doSNOW)
cl <- makeMPIcluster(n.slaves)
registerDoSNOW(cl)
foreach(i=c('file1.csv', 'file2.csv'), pbsoptions = list()) %dopar% read.csv(i)