How To Run Different, Multiple Rscripts on SGE Cluster - r

I am trying to run different Rscripts on a SGE cluster, each Rscript only changes by one variable (e.g. cancer <- "UVM" or "ACC", etc.).
I have attempted two ways: either run a Single Rscript that gets command line arguments for the 30 different cancer names
OR
run each Rscript (i.e. UVM.r, ACC.r, etc.)
Either way, I am having alot of difficulty figuring out how to submit these jobs so I can run either one Rscript 30 times with different argument each time OR run multiple Rscripts with no command line arguments.

You can use while loop in bash for this.
Setup input file of arguments, e.g. args.txt:
UVM
ACC
Run qsub in while loop to submit script for each argument:
while read arg
do
echo "Rscript script.R ${arg}" | qsub <options>
done <args.txt
Above uses echo to pass code to run to qsub.

A job script like this:
#!/bin/bash
#$ -t 1-30
shift ${SGE_TASK_ID}
exec Rscript script.R $1
Submit like this qsub job_script dummy UVM ACC ...

Related

Parallelize a bash script and wait for each loop to finish

I'm trying to write a script, that we call pippo.R. pippo.R aim, is to run another script (for.sh) in a for loop with a parallelization using two values :
nPerm= total number of times the script has to be run
permAtTime= number of script that can run at the same time.
A very important thing to do, is to wait for each loop to be concluded, thats why I added a file in which all the PID are stored and then I use the wait function to wait for each of them. The main problem of this script is the following error :
./wait.sh: line 2: wait: pid 836844 is not a child of this shell
For reproducibility sake you can put in a folder the following files :
pippo.R
nPerm=10
permAtTime=2
cycles=nPerm/permAtTime
for(i in 1:cycles){
d=1
system(paste("./for.sh ", i," ",permAtTime,sep=""))
}
for.sh
#!/bin/bash
for X in $(seq $1)
do
nohup ./script.sh $(($X +($2 -1)*$1 )) &
echo $! >> ./save_pid.txt
done
./wait.sh
wait.sh
#!/bin/bash
while read p; do wait $p; done < ./save_pid.txt
Running Rscript pippo.R you will have the explained error. I know that there is the parallel function that can help me in this but for several reasons i cannot use that package.
Thanks
You don't need to keep track of PIDs, because if you call wait without any argument, the script will wait for all the child processes to finish.
#!/bin/bash
for X in $(seq $1)
do
nohup ./script.sh $(($X +($2 -1)*$1 )) &
done
wait

Paralelizing an Rscript using a job array in Slurm

I want to run an Rscript.R using an array job in Slurm, with 1-10 tasks, whereby the task id from the job will be directed to the Rscript, to write a file named "'task id'.out", containing 'task id' in its body. However, this has proven to be more challenging than I anticipated haha I am trying the following:
~/bash_test.sh looks like:
#!/bin/bash -l
#SBATCH --time=00:01:00
#SBATCH --array=1-10
conda activate R
cd ~/test
R CMD BATCH --no-save --no-restore ~/Rscript_test.R $SLURM_ARRAY_TASK_ID
~/Rscript_test.R looks like:
#!/usr/bin/env Rscript
taskid = commandArgs(trailingOnly=TRUE)
# taskid <- Sys.getenv('SLURM_ARRAY_TASK_ID')
taskid <- as.data.frame(taskid)
# print task number
print(paste0("the number processed was... ", taskid))
write.table(taskid, paste0("~/test/",taskid,".out"),quote=FALSE, row.names=FALSE, col.names=FALSE)
After I submit my job (sbatch bash_test.sh), it looks like R is not really seeing SLURM_ARRAY_TASK_ID. The script is generating 10 files (1, 2, ..., 10 - just numbers - probably corresponding to the task ids), but it's not writing the files with the extension ".out": the script wrote an empty "integer(0).out" file.
What I wanted, was to populate the folder ~/test/ with 10 files, 1.out, 2.out, ..., 10.out, and each file has to contain the task id inside (simply the number 1, 2, ..., or 10, respectively).
P.S.: Note that I tried playing with Sys.getenv() too, but I don't think I was able to set that up properly. That option generates 10 files, and one 1.out file, containing number 10.
P.S.2: This is slurm 19.05.5. I am running R wihthin a conda environment.
You should avoid using "R CMD BATCH". It doesn't handle arguments the way most functions do. "Rscript" has been the recommended option for a while now. By calling "R CMD BATCH" you are basically ignoring the "#!/usr/bin/env Rscript" part of your script.
So change your script file to
#!/bin/bash -l
#SBATCH --time=00:01:00
#SBATCH --array=1-10
conda activate R
cd ~/test
Rscript ~/Rscript_test.R $SLURM_ARRAY_TASK_ID
And then becareful in your script that you aren't using the same variable as both a string a data.frame. You can't easily paste a data.frame into a file path for example. So
taskid <- commandArgs(trailingOnly=TRUE)
# taskid <- Sys.getenv('SLURM_ARRAY_TASK_ID') # This should also work
print(paste0("the number processed was... ", taskid))
outdata <- as.data.frame(taskid)
outfile <- paste0("~/test/", taskid, ".out")
write.table(outdata, outfile, quote=FALSE, row.names=FALSE, col.names=FALSE)
The extra files with just the array number were created because the usage of R CMD BATCH is
R CMD BATCH [options] infile [outfile]
So the $SLURM_ARRAY_TASK_ID value you were passing at the command line was treated as the outfile name. Instead that value needed to be passed as options. But again, it's better to use Rscript which has more standard argument conventions.

Running multiple R scripts sequentially from shell in the same R session

Is it possible to run multiple .R files from the shell or a bash script, in sequence, in the same R session (so without having to write intermediate results to disk)?
E.g. if file1.R contains a=1 and file2.R print(a+1)
then do something like
$ Rscript file1.R file2.R
[1] 2
(of course a workaround would be to stitch the scripts together or have a master script sourcing 1 and 2)
You could write a wrapper script that calls each script in turn:
source("file1.R")
source("file2.R")
Call this source_files.R and then run Rscript source_files.R. Of course, with something this simple you can also just pass the statements on the command line:
Rscript -e 'source("file1.R"); source("file2.R")'

Pass output of R script to bash script

I would like to pass the output of my R file to a bash script.
The output of the R script being the title of a video: "Title of Video"
And my bash script being simply:
youtube-dl title_of_video.avi https://www.youtube.com/watch?v=w82a1FTere5o88
Ideally I would like the output of the video being "Title of Video".avi
I know I can use Rscript to launch an R script with a bash command but I don't think Rscript can help me here.
In bash you can call a command and use its output further via the $(my_command) syntax.
Make sure your script only outputs the title.
E.g.
# in getTitle.R
cat('This is my title') # note, print would have resulted in an extra "[1]"
# in your bash script:
youtube-dl $(Rscript getTitle.R) http://blablabla
If you want to pass arguments to your R script as well, do so inside the $() syntax; if you want to pass those arguments to your bash script, and delegate them to the R script to handle them, you can use the special bash variables $1 (denote the 1st argument, etc), or $* to denote all arguments passed to the bash script, e.g.
#!/bin/bash
youtube-dl $(Rscript getTitle.R $1) http://blablablabla
(this assumes that getTitle.R does something to those arguments internally via commandArgs etc to produce the wanted title output)
You can call Rscript in bash script and you might assign the output of the R script to a variable in bash script. Check this question. After that you can execute
youtube-dl $outputFromRScript https://www.youtube.com/watch?v=w82a1FTere5o88

loop through different arguments in Rscript within Korn shell

I have an R script which I'm running in the terminal by firstly generating a .ksh file called myscript.ksh with the following information:
#!/bin/ksh
Rscript myscript.R 'Input1'
and then run the function with
./mycode.ksh
which sends the script to a node on the cluster in our department (the processes that we send to the cluster must be as a .ksh file).
'Input1' is an input argument that is used by the R script to some analysis.
The issue that I now have is that I need to run this script a number of times with different input arguments to the function. One solution is to generate a few .ksh files, such as:
#!/bin/ksh
Rscript myscript.R 'Input2'
and
#!/bin/ksh
Rscript myscript.R 'Input3'
and then execute them seperately, but I was hoping to find a better solution.
Note that I have to do this for 100 different input arguments so it is not realistic to write 100 of these files. Is there a way of generating another file with the information needed to be supplied to the function e.g. 'Input1' 'Input2' 'Input3' and then run myscript.ksh for these individually.
For example, I could have a variable defining the name of the input arguments and then have a loop which would pass it to myscript.ksh. Is that possible?
The reason for running these in this manner is so that each iteration will hopefully be send to a different node on the cluster, thus analysing the data at a much faster rate.
You need to do two things:
Create an array of all your input variables
Loop through the array and initiate all your calls
The following illustrates the concept:
#!/bin/ksh
#Create array of inputs - space separator
inputs=(Input1 Input2 Input3 Input4)
# Loop through all the array items {0 ... n-1}
for i in {0..3}
do
echo ${inputs[i]}
done
This will output all the values in the inputs array.
You just need to replace the contents of the do-loop with:
Rscript myscript.R ${inputs[i]}
Also, you may need to add a ` &' at the end of the Rscript command line to spawn off each Rscript command as a separate thread -- otherwise, the shell would wait for a return from each Rscript command before going onto the next.
EDIT:
Based on your comments, you need to actually generate .ksh scripts to submit to qsub. For this you just need to expand the do loop.
For example:
#!/bin/ksh
#Create array of inputs - space separator
inputs=(Input1 Input2 Input3 Input4)
# Loop through all the array items {0 ... n-1}
for i in {0..3}
do
cat > submission.ksh << EOF
#!/bin/ksh
Rscript myscript.R ${inputs[i]}
EOF
chmod u+x submission.ksh
qsub submission.ksh
done
The EOF defines the beginning and end of what will be taken as input (STDIN) and the output (STDOUT) will written to submission.ksh.
Then submission.ksh is made executable with the chmod command.
And then the script is submitted via qsub. I'll let you fill in any other arguments you need for qsub.
When your script doesn't know all parameters when it starts, you can make a .ksh file called mycode.ksh with the following information:
#!/bin/ksh
if [ $# -ne 1 ]; then
echo "Usage: $0 input"
exit 1
fi
# Or start at the background with nohup .... &, other question
Rscript myscript.R "$1"
and then run the function with
./mycode.ksh inputX
When your application knows all arguments, you can use a loop:
#!/bin/ksh
if [ $# -eq 0 ]; then
echo "Usage: $0 input(s)"
exit 1
fi
for input in $*; do
Rscript myscript.R "${input}"
done
and then run the function with
./mycode.ksh input1 input2 "input with space in double quotes" input4

Resources