Paralelizing an Rscript using a job array in Slurm - r

I want to run an Rscript.R using an array job in Slurm, with 1-10 tasks, whereby the task id from the job will be directed to the Rscript, to write a file named "'task id'.out", containing 'task id' in its body. However, this has proven to be more challenging than I anticipated haha I am trying the following:
~/bash_test.sh looks like:
#!/bin/bash -l
#SBATCH --time=00:01:00
#SBATCH --array=1-10
conda activate R
cd ~/test
R CMD BATCH --no-save --no-restore ~/Rscript_test.R $SLURM_ARRAY_TASK_ID
~/Rscript_test.R looks like:
#!/usr/bin/env Rscript
taskid = commandArgs(trailingOnly=TRUE)
# taskid <- Sys.getenv('SLURM_ARRAY_TASK_ID')
taskid <- as.data.frame(taskid)
# print task number
print(paste0("the number processed was... ", taskid))
write.table(taskid, paste0("~/test/",taskid,".out"),quote=FALSE, row.names=FALSE, col.names=FALSE)
After I submit my job (sbatch bash_test.sh), it looks like R is not really seeing SLURM_ARRAY_TASK_ID. The script is generating 10 files (1, 2, ..., 10 - just numbers - probably corresponding to the task ids), but it's not writing the files with the extension ".out": the script wrote an empty "integer(0).out" file.
What I wanted, was to populate the folder ~/test/ with 10 files, 1.out, 2.out, ..., 10.out, and each file has to contain the task id inside (simply the number 1, 2, ..., or 10, respectively).
P.S.: Note that I tried playing with Sys.getenv() too, but I don't think I was able to set that up properly. That option generates 10 files, and one 1.out file, containing number 10.
P.S.2: This is slurm 19.05.5. I am running R wihthin a conda environment.

You should avoid using "R CMD BATCH". It doesn't handle arguments the way most functions do. "Rscript" has been the recommended option for a while now. By calling "R CMD BATCH" you are basically ignoring the "#!/usr/bin/env Rscript" part of your script.
So change your script file to
#!/bin/bash -l
#SBATCH --time=00:01:00
#SBATCH --array=1-10
conda activate R
cd ~/test
Rscript ~/Rscript_test.R $SLURM_ARRAY_TASK_ID
And then becareful in your script that you aren't using the same variable as both a string a data.frame. You can't easily paste a data.frame into a file path for example. So
taskid <- commandArgs(trailingOnly=TRUE)
# taskid <- Sys.getenv('SLURM_ARRAY_TASK_ID') # This should also work
print(paste0("the number processed was... ", taskid))
outdata <- as.data.frame(taskid)
outfile <- paste0("~/test/", taskid, ".out")
write.table(outdata, outfile, quote=FALSE, row.names=FALSE, col.names=FALSE)
The extra files with just the array number were created because the usage of R CMD BATCH is
R CMD BATCH [options] infile [outfile]
So the $SLURM_ARRAY_TASK_ID value you were passing at the command line was treated as the outfile name. Instead that value needed to be passed as options. But again, it's better to use Rscript which has more standard argument conventions.

Related

Running multiple R scripts sequentially from shell in the same R session

Is it possible to run multiple .R files from the shell or a bash script, in sequence, in the same R session (so without having to write intermediate results to disk)?
E.g. if file1.R contains a=1 and file2.R print(a+1)
then do something like
$ Rscript file1.R file2.R
[1] 2
(of course a workaround would be to stitch the scripts together or have a master script sourcing 1 and 2)
You could write a wrapper script that calls each script in turn:
source("file1.R")
source("file2.R")
Call this source_files.R and then run Rscript source_files.R. Of course, with something this simple you can also just pass the statements on the command line:
Rscript -e 'source("file1.R"); source("file2.R")'

R or bash command line length limit

I'm developing a bash program that execute a R oneliner command to convert a RMarkdown template into a HTML document.
This R oneliner command looks like:
R -e 'library(rmarkdown) ; rmarkdown::render( "template.Rmd", "html_document", output_file = "report.html", output_dir = "'${OUTDIR}'", params = list( param1 = "'${PARAM1}'", param2 = "'${PARAM2}'", ... ) )
I have a long list of parameters, let's say 10 to explain the problem, and it seems that the R or bash has a command line length limit.
When I execute the R oneliner with 10 parameters I obtain a error message like this:
WARNING: '-e library(rmarkdown)~+~;~+~rmarkdown::render(~+~"template.Rmd",~+~"html_document",~+~output_file~+~=~+~"report.html",~+~output_dir~+~=~+~"output/",~+~params~+~=~+~list(~+~param1~+~=~+~"param2", ...
Fatal error: you must specify '--save', '--no-save' or '--vanilla'
When I execute the R oneliner with 9 parameters it's ok (I tried different combinations to verify that the problem was not the last parameter).
When I execute the R oneliner with 10 parameters but with removing all spaces in it, it's ok too so I guess that R or bash use a command line length limit.
R -e 'library(rmarkdown);rmarkdown::render("template.Rmd","html_document",output_file="report.html",output_dir="'${OUTDIR}'",params=list(param1="'${PARAM1}'",param2="'${PARAM2}'",...))
Is it possible to increase this limit?
This will break a number of ways – including if your arguments have spaces or quotes in them.
Instead, try passing the values as arguments. Something like this should give you an idea how it works:
# create a script file
tee arguments.r << 'EOF'
argv <- commandArgs(trailingOnly=TRUE)
arg1 <- argv[1]
print(paste("Argument 1 was", arg1))
EOF
# set some values
param1="foo bar"
param2="baz"
# run the script with arguments
Rscript arguments.r "$param1" "$param2"
Expected output:
[1] "Argument 1 was foo bar"
Always quote your variables and always use lowercase variable names to avoid conflicts with system or application variables.

loop through different arguments in Rscript within Korn shell

I have an R script which I'm running in the terminal by firstly generating a .ksh file called myscript.ksh with the following information:
#!/bin/ksh
Rscript myscript.R 'Input1'
and then run the function with
./mycode.ksh
which sends the script to a node on the cluster in our department (the processes that we send to the cluster must be as a .ksh file).
'Input1' is an input argument that is used by the R script to some analysis.
The issue that I now have is that I need to run this script a number of times with different input arguments to the function. One solution is to generate a few .ksh files, such as:
#!/bin/ksh
Rscript myscript.R 'Input2'
and
#!/bin/ksh
Rscript myscript.R 'Input3'
and then execute them seperately, but I was hoping to find a better solution.
Note that I have to do this for 100 different input arguments so it is not realistic to write 100 of these files. Is there a way of generating another file with the information needed to be supplied to the function e.g. 'Input1' 'Input2' 'Input3' and then run myscript.ksh for these individually.
For example, I could have a variable defining the name of the input arguments and then have a loop which would pass it to myscript.ksh. Is that possible?
The reason for running these in this manner is so that each iteration will hopefully be send to a different node on the cluster, thus analysing the data at a much faster rate.
You need to do two things:
Create an array of all your input variables
Loop through the array and initiate all your calls
The following illustrates the concept:
#!/bin/ksh
#Create array of inputs - space separator
inputs=(Input1 Input2 Input3 Input4)
# Loop through all the array items {0 ... n-1}
for i in {0..3}
do
echo ${inputs[i]}
done
This will output all the values in the inputs array.
You just need to replace the contents of the do-loop with:
Rscript myscript.R ${inputs[i]}
Also, you may need to add a ` &' at the end of the Rscript command line to spawn off each Rscript command as a separate thread -- otherwise, the shell would wait for a return from each Rscript command before going onto the next.
EDIT:
Based on your comments, you need to actually generate .ksh scripts to submit to qsub. For this you just need to expand the do loop.
For example:
#!/bin/ksh
#Create array of inputs - space separator
inputs=(Input1 Input2 Input3 Input4)
# Loop through all the array items {0 ... n-1}
for i in {0..3}
do
cat > submission.ksh << EOF
#!/bin/ksh
Rscript myscript.R ${inputs[i]}
EOF
chmod u+x submission.ksh
qsub submission.ksh
done
The EOF defines the beginning and end of what will be taken as input (STDIN) and the output (STDOUT) will written to submission.ksh.
Then submission.ksh is made executable with the chmod command.
And then the script is submitted via qsub. I'll let you fill in any other arguments you need for qsub.
When your script doesn't know all parameters when it starts, you can make a .ksh file called mycode.ksh with the following information:
#!/bin/ksh
if [ $# -ne 1 ]; then
echo "Usage: $0 input"
exit 1
fi
# Or start at the background with nohup .... &, other question
Rscript myscript.R "$1"
and then run the function with
./mycode.ksh inputX
When your application knows all arguments, you can use a loop:
#!/bin/ksh
if [ $# -eq 0 ]; then
echo "Usage: $0 input(s)"
exit 1
fi
for input in $*; do
Rscript myscript.R "${input}"
done
and then run the function with
./mycode.ksh input1 input2 "input with space in double quotes" input4

How To Run Different, Multiple Rscripts on SGE Cluster

I am trying to run different Rscripts on a SGE cluster, each Rscript only changes by one variable (e.g. cancer <- "UVM" or "ACC", etc.).
I have attempted two ways: either run a Single Rscript that gets command line arguments for the 30 different cancer names
OR
run each Rscript (i.e. UVM.r, ACC.r, etc.)
Either way, I am having alot of difficulty figuring out how to submit these jobs so I can run either one Rscript 30 times with different argument each time OR run multiple Rscripts with no command line arguments.
You can use while loop in bash for this.
Setup input file of arguments, e.g. args.txt:
UVM
ACC
Run qsub in while loop to submit script for each argument:
while read arg
do
echo "Rscript script.R ${arg}" | qsub <options>
done <args.txt
Above uses echo to pass code to run to qsub.
A job script like this:
#!/bin/bash
#$ -t 1-30
shift ${SGE_TASK_ID}
exec Rscript script.R $1
Submit like this qsub job_script dummy UVM ACC ...

Converting this code from R to Shell script?

So I'm running a program that works but the issue is my computer is not powerful enough to handle the task. I have a code written in R but I have access to a supercomputer that runs a Unix system (as one would expect).
The program is designed to read a .csv file and find everything with the unit ft3(monthly total) in the "Units" column and select the value in the column before it. The files are charts that list things in multiple units.
To convert this program in R:
getwd()
setwd("/Users/youruserName/Desktop")
myData= read.table("yourFileName.csv", header=T, sep=",")
funData= subset(myData, units="ft3(monthly total)", select=units:value)
write.csv(funData, file="funData.csv")
To a program in Shell Script, I tried:
pwd
cd /Users/yourusername/Desktop
touch RunThisProgram
nano RunThisProgram
(((In nano, I wrote)))
if
grep -r yourFileName.csv ft3(monthly total)
cat > funData.csv
else
cat > nofun.csv
fi
control+x (((used control x to close nano)))
chmod -x RunThisProgram
./RunThisProgram
(((It runs for a while)))
We get a funData.csv file output but that file is empty
What am I doing wrong?
It isn't actually running, because there are a couple problems with your script.
grep needs the pattern first, and quoted; -r is for recursing a
directory...
if without a then
cat is called wrong so it is actually reading from stdin.
You really only need one line:
grep -F "ft3(monthly total)" yourFileName.csv > funData.csv

Resources