LSF Sequential file name job submission - r

I am a beginner with submitting jobs on a cluster. I use R to code and my objective is to run a set of sequential file names. say for eg. main1.R, main2.R and so on till about 100. All these R scripts are stand alone scripts and do not have an input argument. Instead of submitting these as
bsub - W 24:00 -n 48 "R --vanilla --slave < main1.R"
and so on till 100 files, is there any way to use a job array to specify the file name and which does not get passed on as input argument.
I did look up at some documentation and the best I could look up was
bsub - W 24:00 -n 48 -J "myarray[1:1000] "R --vanilla --slave < main%I.R"
Any tips or ideas would be of great help.
Thank you

Will -i option of bsub help? Try this:
bsub -i main%I.R -J "array[1-100]" "R --vanilla --slave"

Related

Use of sleep in bash within for loop launching sbatch

I want to submit an R script myjob.R that takes two arguments for which I have several scenarios (here only a few as an example).
I want to pass these arguments by looping through scens and sets.
In order to avoid overloading the squeue on the cluster, I don't want to submit the whole loop at once.
Instead I want to wait 1h between each individual job submission.
Therefore, I included the sleep 1h command, after each iteration.
I used to launch the bash script via bash mybash.sh, however this command requires to keep the terminal open until all jobs have been submitted.
My solution was then to launch mybash.sh via sbatch mybash.sh. This is somehow nesting two sbatch commands. Seems to work very well.
My question is only if there is any reason against submitting nested sbatch commands.
Thanks!
Here is mybash.sh script:
#!/bin/bash
scens=('AAA' 'BBB')
sets=('set1' 'set2')
wd=/projects/workdir
for sc in "${!scens[#]}";do
for se in "${!sets[#]}" ;do
echo "SCENARIO: ${scens[sc]} --- SET: ${sets[se]}"
sbatch -t 00:05:00 -J myjob --workdir=${wd} -e myjob.err -o myjob.out R --file=myjob.R --args "${scens[sc]}" "${sets[se]}"
# My solution is to include the following line & run this bash script via sbatch
sleep 1h
done
done

r + hpc + git question: submitting multiple jobs with different values for a parameter list [duplicate]

I am running R on a multiple node Linux cluster. I would like to run my analysis on R using scripts or batch mode without using parallel computing software such as MPI or snow.
I know this can be done by dividing the input data such that each node runs different parts of the data.
My question is how do I go about this exactly? I am not sure how I should code my scripts. An example would be very helpful!
I have been running my scripts so far using PBS but it only seems to run on one node as R is a single thread program. Hence, I need to figure out how to adjust my code so it distributes labor to all of the nodes.
Here is what I have been doing so far:
1) command line:
> qsub myjobs.pbs
2) myjobs.pbs:
> #!/bin/sh
> #PBS -l nodes=6:ppn=2
> #PBS -l walltime=00:05:00
> #PBS -l arch=x86_64
>
> pbsdsh -v $PBS_O_WORKDIR/myscript.sh
3) myscript.sh:
#!/bin/sh
cd $PBS_O_WORKDIR
R CMD BATCH --no-save my_script.R
4) my_script.R:
> library(survival)
> ...
> write.table(test,"TESTER.csv",
> sep=",", row.names=F, quote=F)
Any suggestions will be appreciated! Thank you!
-CC
This is rather a PBS question; I usually make an R script (with Rscript path after #!) and make it gather a parameter (using commandArgs function) that controls which "part of the job" this current instance should make. Because I use multicore a lot I usually have to use only 3-4 nodes, so I just submit few jobs calling this R script with each of a possible control argument values.
On the other hand your use of pbsdsh should do its job... Then the value of PBS_TASKNUM can be used as a control parameter.
This was an answer to a related question - but it's an answer to the comment above (as well).
For most of our work we do run multiple R sessions in parallel using qsub (instead).
If it is for multiple files I normally do:
while read infile rest
do
qsub -v infile=$infile call_r.pbs
done < list_of_infiles.txt
call_r.pbs:
...
R --vanilla -f analyse_file.R $infile
...
analyse_file.R:
args <- commandArgs()
infile=args[5]
outfile=paste(infile,".out",sep="")...
Then I combine all the output afterwards...
This problem seems very well suited for use of GNU parallel. GNU parallel has an excellent tutorial here. I'm not familiar with pbsdsh, and I'm new to HPC, but to me it looks like pbsdsh serves a similar purpose as GNU parallel. I'm also not familiar with launching R from the command line with arguments, but here is my guess at how your PBS file would look:
#!/bin/sh
#PBS -l nodes=6:ppn=2
#PBS -l walltime=00:05:00
#PBS -l arch=x86_64
...
parallel -j2 --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
Rscript myscript.R {} :::: infilelist.txt
where infilelist.txt lists the data files you want to process, e.g.:
inputdata01.dat
inputdata02.dat
...
inputdata12.dat
Your myscript.R would access the command line argument to load and process the specified input file.
My main purpose with this answer is to point out the availability of GNU parallel, which came about after the original question was posted. Hopefully someone else can provide a more tangible example. Also, I am still wobbly with my usage of parallel, for example, I'm unsure of the -j2 option. (See my related question.)

Rscript not working in qsub cluster

I have two Rscripts named iHS.hist.R and Fst.hist.R. I know both scripts work. When I use the following commands in my directory in my ubuntu terminal I get a histogram plot for each script (two total if I do both scripts)
module load R
Rscript iHS.hist.R
or I could do Rscript Fst.hist.R
The point is I know they both work.
The problem is that each Rscript takes about 20 minutes to run because my data is pretty big. And unfortunately it's only going to get bigger. I have access to a cluster and I would like to make use of that. I have created two .sh scripts to send to the cluster with qsub but I am running into issues. Here is my iHS.his.sh script for my iHS.hist.R script.
#PBS -N iHS.plots
#PBS -S /bin/bash
#PBS -l walltime=2:00:00
#PBS -l nodes=1:ppn=8
#PBS -l mem=4gb
#PBS -o $HOME/${PBS_JOBNAME}.o${PBS_JOBID}.log
#PBS -e $HOME/${PBS_JOBNAME}.e${PBS_JOBID}.err
###############related commands
###edit it
#code in qsub
###############cut columns we don't need
###
cut -f1,2,3,4 /group/stranger-lab/ebeiter/test/SNPsnap_mdd_5_100/matched_snps_annotated.txt > /group/stranger-lab/ebeiter/test/SNPsnap_mdd_5_100/cut.matched_snps_annotated.txt
cut -f1,2 /group/stranger-lab/ebeiter/test/SNPsnap_mdd_5_100/input_snps_insufficient_matches.txt > /group/stranger-lab/ebeiter/test/SNPsnap_mdd_5_100/cut.input_snps_insufficient_matches.txt
###
###############only needed columns remain
cd /group/stranger-lab/ebeiter
module load R
Rscript iHS.hist.R
The cuts in the beginning are for setting up the data in the right format.
I have tried qsub iHS.hist.sh and it gives me a job. I check on it, and after about 10 minutes it finishes. So I'm assuming it's running my Rscript. I check the error file and it's empty. I check the log file and it does not give me the usual null device 1 that I get after my jpeg is completed in my Rscript. I don't get the output jpeg file for the Rscript when the cluster job is done. I do get the output jpeg file if I just did the Rscript on it's own like at the top of this. Any idea what is going on?

for loop of subjobs using qsub R

I am trying to run subjobs (one for each chromosome) using R --vanilla. Since each chromosome is independent I want them to run parallel in the system. I have written the following script:
#!/bin/bash
for K in {20..21};
do
qsub -V -cwd -b y -q short.q R --vanilla --args arg1$K arg2$K arg3$K < RareMETALS.R > loggroup$K.txt; done
But somehow R opens interactively and not in command line as suppose... when trying the script itself
R --vanilla --args arg1 arg2 arg3 < RareMETALS.R > loggroup.txt; done
It runs perfectly calling the script.
Can someboby guide me, or point out which might be the problem.
My take on this would be to use echo instead of --args option to pass parameters to the script. I find separating the script and the Grid Engine code to be more straightforward:
for K in {20..21};
do
echo "Rscript RareMETALS.R arg1$K arg2$K arg3$K > loggroup$K.txt" | qsub -V -cwd -q short.q
done
As others have commented use Rscript.
Code seems cleaner to me, but there may be some limitations to using echo as opposed to --args I am unaware of.

SharpSsh - script runs twice in csh and ksh

i'm running a script from ASP.NET/C# using SharpSsh. I realize when the script runs and i do a ps -ef grep from unix, i see the same script running twice, one in csh -c, and the other with ksh. The script has shebang ksh, so i'm not sure why a copy of csh is also running. Also if i run the same script directly from unix, only one copy runs with ksh. There's no other shell running from within the script.
Most Unix/Linux now have a command or option that will show process trees, with indented list like, look for -t or -T options to ps OR ptree OR ???
USER PID PPID START TT TIME CMD
daemon 1 1 11-03-06 ? 0 init
myusr 221568 1 11-03-07 tty10 1.00s \_ -ksh
myusr 350976 221568 07:52:11 tty10 0 | \_ ps -efT
I bet you'll see that the csh is the user login shell that includes your script as an argument ( you may have to use different options to ps to see the full command-line of the csh process) AND as a sub process you'll see ksh executing your script, and further sub-processes under ksh for any external commands that the script is calling.
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, or give it a + (or -) as a useful answer.

Resources