for loop of subjobs using qsub R - r

I am trying to run subjobs (one for each chromosome) using R --vanilla. Since each chromosome is independent I want them to run parallel in the system. I have written the following script:
#!/bin/bash
for K in {20..21};
do
qsub -V -cwd -b y -q short.q R --vanilla --args arg1$K arg2$K arg3$K < RareMETALS.R > loggroup$K.txt; done
But somehow R opens interactively and not in command line as suppose... when trying the script itself
R --vanilla --args arg1 arg2 arg3 < RareMETALS.R > loggroup.txt; done
It runs perfectly calling the script.
Can someboby guide me, or point out which might be the problem.

My take on this would be to use echo instead of --args option to pass parameters to the script. I find separating the script and the Grid Engine code to be more straightforward:
for K in {20..21};
do
echo "Rscript RareMETALS.R arg1$K arg2$K arg3$K > loggroup$K.txt" | qsub -V -cwd -q short.q
done
As others have commented use Rscript.
Code seems cleaner to me, but there may be some limitations to using echo as opposed to --args I am unaware of.

Related

r + hpc + git question: submitting multiple jobs with different values for a parameter list [duplicate]

I am running R on a multiple node Linux cluster. I would like to run my analysis on R using scripts or batch mode without using parallel computing software such as MPI or snow.
I know this can be done by dividing the input data such that each node runs different parts of the data.
My question is how do I go about this exactly? I am not sure how I should code my scripts. An example would be very helpful!
I have been running my scripts so far using PBS but it only seems to run on one node as R is a single thread program. Hence, I need to figure out how to adjust my code so it distributes labor to all of the nodes.
Here is what I have been doing so far:
1) command line:
> qsub myjobs.pbs
2) myjobs.pbs:
> #!/bin/sh
> #PBS -l nodes=6:ppn=2
> #PBS -l walltime=00:05:00
> #PBS -l arch=x86_64
>
> pbsdsh -v $PBS_O_WORKDIR/myscript.sh
3) myscript.sh:
#!/bin/sh
cd $PBS_O_WORKDIR
R CMD BATCH --no-save my_script.R
4) my_script.R:
> library(survival)
> ...
> write.table(test,"TESTER.csv",
> sep=",", row.names=F, quote=F)
Any suggestions will be appreciated! Thank you!
-CC
This is rather a PBS question; I usually make an R script (with Rscript path after #!) and make it gather a parameter (using commandArgs function) that controls which "part of the job" this current instance should make. Because I use multicore a lot I usually have to use only 3-4 nodes, so I just submit few jobs calling this R script with each of a possible control argument values.
On the other hand your use of pbsdsh should do its job... Then the value of PBS_TASKNUM can be used as a control parameter.
This was an answer to a related question - but it's an answer to the comment above (as well).
For most of our work we do run multiple R sessions in parallel using qsub (instead).
If it is for multiple files I normally do:
while read infile rest
do
qsub -v infile=$infile call_r.pbs
done < list_of_infiles.txt
call_r.pbs:
...
R --vanilla -f analyse_file.R $infile
...
analyse_file.R:
args <- commandArgs()
infile=args[5]
outfile=paste(infile,".out",sep="")...
Then I combine all the output afterwards...
This problem seems very well suited for use of GNU parallel. GNU parallel has an excellent tutorial here. I'm not familiar with pbsdsh, and I'm new to HPC, but to me it looks like pbsdsh serves a similar purpose as GNU parallel. I'm also not familiar with launching R from the command line with arguments, but here is my guess at how your PBS file would look:
#!/bin/sh
#PBS -l nodes=6:ppn=2
#PBS -l walltime=00:05:00
#PBS -l arch=x86_64
...
parallel -j2 --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
Rscript myscript.R {} :::: infilelist.txt
where infilelist.txt lists the data files you want to process, e.g.:
inputdata01.dat
inputdata02.dat
...
inputdata12.dat
Your myscript.R would access the command line argument to load and process the specified input file.
My main purpose with this answer is to point out the availability of GNU parallel, which came about after the original question was posted. Hopefully someone else can provide a more tangible example. Also, I am still wobbly with my usage of parallel, for example, I'm unsure of the -j2 option. (See my related question.)

Passing SLURM batch command line arguments to R

I'm trying to run a SLURM sbatch command with various parameters that I can read in an R script. When using PBS system, I used to write qsub -v param1=x,param2=y (+ other system parameters like the memory requirements etc and the script name to be read by PBS) and then in the R script read it with x = Sys.getenv(‘param1’).
Now I tried
sbatch run.sh --export=basePath=‘a’
With run.sh:
#!/bin/bash
cd $SLURM_SUBMIT_DIR
echo $PWD
module load R/common/3.3.3
R CMD BATCH --quiet --no-restore --no-save runDo.R output.txt
And runDo.R:
base.path = Sys.getenv('basePath')
print(base.path)
The script is running but the argument value is not assigned to base.path variable (it prints an empty string).
The export parameter has to be passed to sbatch not to the run.sh script.
It should be like this:
sbatch --export=basePath=‘a’ run.sh

LSF Sequential file name job submission

I am a beginner with submitting jobs on a cluster. I use R to code and my objective is to run a set of sequential file names. say for eg. main1.R, main2.R and so on till about 100. All these R scripts are stand alone scripts and do not have an input argument. Instead of submitting these as
bsub - W 24:00 -n 48 "R --vanilla --slave < main1.R"
and so on till 100 files, is there any way to use a job array to specify the file name and which does not get passed on as input argument.
I did look up at some documentation and the best I could look up was
bsub - W 24:00 -n 48 -J "myarray[1:1000] "R --vanilla --slave < main%I.R"
Any tips or ideas would be of great help.
Thank you
Will -i option of bsub help? Try this:
bsub -i main%I.R -J "array[1-100]" "R --vanilla --slave"

Rscript not working in qsub cluster

I have two Rscripts named iHS.hist.R and Fst.hist.R. I know both scripts work. When I use the following commands in my directory in my ubuntu terminal I get a histogram plot for each script (two total if I do both scripts)
module load R
Rscript iHS.hist.R
or I could do Rscript Fst.hist.R
The point is I know they both work.
The problem is that each Rscript takes about 20 minutes to run because my data is pretty big. And unfortunately it's only going to get bigger. I have access to a cluster and I would like to make use of that. I have created two .sh scripts to send to the cluster with qsub but I am running into issues. Here is my iHS.his.sh script for my iHS.hist.R script.
#PBS -N iHS.plots
#PBS -S /bin/bash
#PBS -l walltime=2:00:00
#PBS -l nodes=1:ppn=8
#PBS -l mem=4gb
#PBS -o $HOME/${PBS_JOBNAME}.o${PBS_JOBID}.log
#PBS -e $HOME/${PBS_JOBNAME}.e${PBS_JOBID}.err
###############related commands
###edit it
#code in qsub
###############cut columns we don't need
###
cut -f1,2,3,4 /group/stranger-lab/ebeiter/test/SNPsnap_mdd_5_100/matched_snps_annotated.txt > /group/stranger-lab/ebeiter/test/SNPsnap_mdd_5_100/cut.matched_snps_annotated.txt
cut -f1,2 /group/stranger-lab/ebeiter/test/SNPsnap_mdd_5_100/input_snps_insufficient_matches.txt > /group/stranger-lab/ebeiter/test/SNPsnap_mdd_5_100/cut.input_snps_insufficient_matches.txt
###
###############only needed columns remain
cd /group/stranger-lab/ebeiter
module load R
Rscript iHS.hist.R
The cuts in the beginning are for setting up the data in the right format.
I have tried qsub iHS.hist.sh and it gives me a job. I check on it, and after about 10 minutes it finishes. So I'm assuming it's running my Rscript. I check the error file and it's empty. I check the log file and it does not give me the usual null device 1 that I get after my jpeg is completed in my Rscript. I don't get the output jpeg file for the Rscript when the cluster job is done. I do get the output jpeg file if I just did the Rscript on it's own like at the top of this. Any idea what is going on?

Passing arguments from a call to a bash script to an Rscript

I have a bash script that does some things and then calls Rscript. Here a simple example to illustrate:
test.sh:
Rscript test.r
test.r:
args <- commandArgs()
print(args)
How can I use ./test.sh hello on the command line result in R printing hello?
You can have bash pass all the arguments to the R script using something like this for a bash script:
#!/bin/bash
Rscript /path/to/R/script --args "$*"
exit 0
You can then choose how many of the arguments from $* need to be discarded inside of R.
I noticed the way to deal with this is:
test.sh:
Rscript test.r $1
test.r:
args <- commandArgs(TRUE)
print(args)
The $1 represents the first argument passed to the bash script.
When calling commandArgs() instead of commandArgs(TRUE), it does not pass from bash, but instead it will print other arguments called internally.
Regarding asb's answer:
having "--args" in the line of bash script doesn't work, the "--args" was taken as the literal of real argument that I want to pass into my R script. Taking it out works, i.e. "Rscript /path/to/my/rfile.R arg1 arg2"
bash version: GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu)
Rscript version: R scripting front-end version 3.0.1 (2013-05-16)

Resources