Run R code in parallel in a shell without having R file - r

I've got the following .sh file which can be run on a cluster computer using sbatch:
Shell.sh
#!/bin/bash
#
#SBATCH -p smp # partition (queue)
#SBATCH -N 2 # number of nodes
#SBATCH -n 2 # number of cores
#SBATCH --mem 2000 # memory pool for all cores
#SBATCH -t 5-0:00 # time (D-HH:MM)
#SBATCH -o out.out # STDOUT
#SBATCH -e err.err # STDERR
module load R
srun -N1 -n1 R CMD BATCH ./MyFile.R &
srun -N1 -n1 R CMD BATCH ./MyFile2.R &
wait
My problem is that MyFile.R and MyFile2.R almost look the same:
MyFile.R
source("Experiment.R")
Experiment(args1) # some arguments
MyFile2.R
source("Experiment.R")
Experiment(args2) # some arguments
In fact, I need to do this for about 100 files. Since they all load some R file and then run the experiment with different arguments, I was wondering whether I could do this without creating a new file for each run. I want to run all processes in parallel, so I can't just create one single R file, I think.
My question is: is there some way to run the process directly from the shell, without having an R file for each run? So can I do something like
srun -N1 -n1 R cmd BATCH 'source("Experiment.R"); Experiment(args1)' &
srun -N1 -n1 R cmd BATCH 'source("Experiment.R"); Experiment(args2)' &
wait
instead of the last three lines in shell.sh?

Your batch script should still include 2 lines to start 2 different R processes, but you can pass the arguments on command line using the same file name:
module load R
srun -N1 -n1 Rscript ./MyFile.R args1_1 args1_2 &
srun -N1 -n1 Rscript ./MyFile.R args2_1 args2_2 &
wait
Then within your R file:
source("Experiment.R")
#Get aruments from the command line
argv <- commandArgs(TRUE)
# Check if the command line is not empty and convert values if needed
if (length(argv) > 0){
nSim <- as.numeric( argv[1] )
meanVal <- as.numeric( argv[2] )
} else {
nSim=100 # some default values
meanVal =5
}
Experiment(nSim, meanVal) # some arguments
If you prefer to use R command instead of Rscript, then your batch script should look like:
module load R
srun -N1 -n1 R -q --slave --vanilla --args args1_1 args1_2 < myFile.R &
srun -N1 -n1 R -q --slave --vanilla --args args2_1 args2_2 < myFile.R &
wait
You might need (or not) quotes for "R -q --slave ... < myFile.R" part

Related

Batch and Bash codes while submitting jobs

I was used to the following way of submitting my jobs that to be done in R in an sequential way in 'PBS/Torque'.
following is my R code named simsim.R
#########
set<-1
#########
# Read i
#########
#the following two refers to the bash code
arg <- commandArgs()
arg
itration<- as.numeric(arg)[3]
itration
setwd("/home/habijabi")
save(arg,itration,
file = paste0('simsim_RESULT_',set,itration,'.RData'))
Now I write the following set of codes
#!/bin/bash
Chains=10
for cha in `seq 1 $Chains`
do
echo "Chains: " $cha
sleep 1
qsub -q long -l nodes=1:ppn=12,walltime=24:00:00 -v c=$cha ./diffv1.sh
done
in this 'diffv1.sh' I used to load the module and pass the variable 'c'.
#!/bin/bash
## input values
c=$c
#configure software
module load R/4.1.2
#changed
cd /home/habijabi
R --no-save < simsim.R $c
In this way I was used to sending the '$c' value to my R code. And it would have produced me 10 many .R files with the corresponding names.
But then I had to change to 'SLURM'. Following is the batch code that I was using.
#!/bin/bash
#SBATCH --job-name=R-test
#IO files
#SBATCH --error=R-test.%J.err
#SBATCH --output=R-test.%J.out
#!/bin/bash
module load R/4.1.2
set -e -x
mkdir -p jobs
cd /home/habijabi
for cha in {1..10}
do
sbatch --time=24:00:00 \
--ntasks-per-node=12 \
--nodes=1 \
-p compute \
-o jobs/${cha}_srun.txt \
--wrap="R --no-save < /home/habijabi/simsim.R ${cha}"
done
But with this code, it runs only once or twice. And I do not understand why after submitting 150 jobs it does not run all of them.... The run file shows the following:
+ mkdir -p jobs
+ cd /home/habijabi
+ for cha in '{1..10}'
+ sbatch --time=24:00:00 --ntasks-per-node=12 --nodes=1 -p compute -o jobs/1_srun.txt '--wrap=R --no-save < /home/habijabi/simsim.R 1'
+ for cha in '{1..10}'
+ sbatch --time=24:00:00 --ntasks-per-node=12 --nodes=1 -p compute -o jobs/2_srun.txt '--wrap=R --no-save < /home/habijabi/simsim.R 2'
+ for cha in '{1..10}'
+ sbatch --time=24:00:00 --ntasks-per-node=12 --nodes=1 -p compute -o jobs/3_srun.txt '--wrap=R --no-save < /home/habijabi/simsim.R 3'
...so on...
and the .out file shows the following
Submitted batch job 146299
Submitted batch job 146300
Submitted batch job 146301
Submitted batch job 146302
Submitted batch job 146303
......
......
Both are doing fine...But here, a few of the jobs run, and majority of them gives error as follows.
/opt/ohpc/pub/libs/gnu8/R/4.1.2/lib64/R/bin/exec/R: error while loading shared libraries: libpcre2-8.so.0: cannot open shared object file: No such file or directory
I do not understand what I have done wrong....This does not produce anything... I am new at this type of coding, any help is appreciated.

Slurm job array error: slurmstepd: error: execve(): Rscript: No such file or directory

I am trying to get a very basic job array script working using Slurm job scheduler on a HPC. I am getting the error:
slurmstepd: error: execve(): Rscript: No such file or directory
This is similar to this but I am not using any export commands so this isn't the cause here. Some sources say it could be something to do with creating these scripts in Windows so the line ends will not work for Unix. Could this be the issue? If so how can I check for this?
My shell script:
#!/bin/bash
# Example of running R script with a job array
#SBATCH --nodes=1
#SBATCH --array=1-10 # how many tasks in the array
#SBATCH --ntasks-per-node=10
#SBATCH -o hello-%j-%a.txt
#SBATCH --mail-user=user#email.address
# Load software
module load R/4.0.0-foss-2020a
# Run R script with a command line argument
srun Rscript hello.R $SLURM_ARRAY_TASK_ID
The script -hello.R is:
#!/usr/bin/env
# Rscript
# accept command line arguments and save them in a list called args
args = commandArgs(trailingOnly=TRUE)
# print task number
print(paste0('Hello! I am a task number: ', args[1]))
Thanks to some offline help I have got an answer to my question. Seems I didn't need to use srun in the shell script and needed to include Rscript in the shebang line in hello.R
Shell script is:
#!/bin/bash
# Example of running R script with a job array
#SBATCH --nodes=1
#SBATCH --array=1-10 # how many tasks in the array
#SBATCH --ntasks-per-node=10
#SBATCH -o hello-%j-%a.txt
#SBATCH --mail-user=tclewis1#sheffield.ac.uk
# Load software
module load R/4.0.0-foss-2020a
# Run R script with a command line argument
Rscript hello.R $SLURM_ARRAY_TASK_ID
And hello.R is now:
#!/usr/bin/env/Rscript
# accept command line arguments and save them in a list called args
args = commandArgs(trailingOnly=TRUE)
# print task number
print(paste0('Hello! I am a task number: ', args[1]))

Passing SLURM batch command line arguments to R

I'm trying to run a SLURM sbatch command with various parameters that I can read in an R script. When using PBS system, I used to write qsub -v param1=x,param2=y (+ other system parameters like the memory requirements etc and the script name to be read by PBS) and then in the R script read it with x = Sys.getenv(‘param1’).
Now I tried
sbatch run.sh --export=basePath=‘a’
With run.sh:
#!/bin/bash
cd $SLURM_SUBMIT_DIR
echo $PWD
module load R/common/3.3.3
R CMD BATCH --quiet --no-restore --no-save runDo.R output.txt
And runDo.R:
base.path = Sys.getenv('basePath')
print(base.path)
The script is running but the argument value is not assigned to base.path variable (it prints an empty string).
The export parameter has to be passed to sbatch not to the run.sh script.
It should be like this:
sbatch --export=basePath=‘a’ run.sh

Submitting R scripts with command_line_arguments to PBS HPC clusters

please could you advise me on the following : I've written a R script that reads 3 arguments from the command line, i.e. :
args <- commandArgs(TRUE)
TUMOR <- args[1]
GERMLINE <- args[2]
CHR <- args[3]
when I submit the R script to a PBS HPC scheduler, I do the following (below), but ... I am getting an error message.
(I am not posting the error message, because the R script I wrote works fine when it is run from a regular terminal ..)
Please may I ask, how do you usually submit the R scripts with command line arguments to PBS HPC schedulers ?
qsub -d $PWD -l nodes=1:ppn=4 -l vmem=10gb -m bea -M tanasa#gmail.com \
-v TUMOR="tumor.bam",GERMLINE="germline.bam",CHR="chr22" \
-e script.efile.chr22 \
-o script.ofile.chr22 \
script.R

for loop of subjobs using qsub R

I am trying to run subjobs (one for each chromosome) using R --vanilla. Since each chromosome is independent I want them to run parallel in the system. I have written the following script:
#!/bin/bash
for K in {20..21};
do
qsub -V -cwd -b y -q short.q R --vanilla --args arg1$K arg2$K arg3$K < RareMETALS.R > loggroup$K.txt; done
But somehow R opens interactively and not in command line as suppose... when trying the script itself
R --vanilla --args arg1 arg2 arg3 < RareMETALS.R > loggroup.txt; done
It runs perfectly calling the script.
Can someboby guide me, or point out which might be the problem.
My take on this would be to use echo instead of --args option to pass parameters to the script. I find separating the script and the Grid Engine code to be more straightforward:
for K in {20..21};
do
echo "Rscript RareMETALS.R arg1$K arg2$K arg3$K > loggroup$K.txt" | qsub -V -cwd -q short.q
done
As others have commented use Rscript.
Code seems cleaner to me, but there may be some limitations to using echo as opposed to --args I am unaware of.

Resources