Use of sleep in bash within for loop launching sbatch - r

I want to submit an R script myjob.R that takes two arguments for which I have several scenarios (here only a few as an example).
I want to pass these arguments by looping through scens and sets.
In order to avoid overloading the squeue on the cluster, I don't want to submit the whole loop at once.
Instead I want to wait 1h between each individual job submission.
Therefore, I included the sleep 1h command, after each iteration.
I used to launch the bash script via bash mybash.sh, however this command requires to keep the terminal open until all jobs have been submitted.
My solution was then to launch mybash.sh via sbatch mybash.sh. This is somehow nesting two sbatch commands. Seems to work very well.
My question is only if there is any reason against submitting nested sbatch commands.
Thanks!
Here is mybash.sh script:
#!/bin/bash
scens=('AAA' 'BBB')
sets=('set1' 'set2')
wd=/projects/workdir
for sc in "${!scens[#]}";do
for se in "${!sets[#]}" ;do
echo "SCENARIO: ${scens[sc]} --- SET: ${sets[se]}"
sbatch -t 00:05:00 -J myjob --workdir=${wd} -e myjob.err -o myjob.out R --file=myjob.R --args "${scens[sc]}" "${sets[se]}"
# My solution is to include the following line & run this bash script via sbatch
sleep 1h
done
done

Related

Create multiple similar jobs and submit them to cloud computing platform with the same bash script

I have a R script that takes in a string and compares it with other strings. And then I will submit a bash script to use the R script. But there are about 3000 strings that I want to take in. I don't want to manually submit each job. How can I automate the job submission? Basically my question is how can I submit multiple jobs that uses the same bash script?
I want to take in the first line of each file and use that string to do the comparison.
My R script looks similar to this:
sfile <- commandArgs(trailingOnly = TRUE)
print(sfile == another_string)
My bash script looks like this:
#!/bin/bash
#SBATCH -J BV1
#SBATCH --account=def-*****
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G # 4GiB of memery
#SBATCH -t 0-10:00 # Running time of 10 hr
module load r
Rscript --vanilla $HOME/projects/def-*****/h*****/****/mappabilityprofile/mappabilityprofile.R $1 > $HOME/projects/def-*****/h*****/****/Rout/testRunOutput.$1.Rout 2>&1
The codes that I tried to use to submit the automated jobs to the server in command line is this:
for ii in /path/to/files; do
> line=$(head -n 1 $f)
> sbatch mappabilityprofile.sh line
> done
This doesn't really work because it only submits one job when I want it to submit multiple jobs according to each file.
Is there any way that I could achieve what I want it to do?
I found out that I can use
while read first; do read second; sbatch mappabilityprofile.sh "$second"; done
thanks to this post!

Can Snakemake work if a rule's shell command is a cluster job?

In below example, if shell script shell_script.sh sends a job to cluster, is it possible to have snakemake aware of that cluster job's completion? That is, first, file a should be created by shell_script.sh which sends its own job to the cluster, and then once this cluster job is completed, file b should be created.
For simplicity, let's assume that snakemake is run locally meaning that the only cluster job originating is from shell_script.sh and not by snakemake .
localrules: that_job
rule all:
input:
"output_from_shell_script.txt",
"file_after_cluster_job.txt"
rule that_job:
output:
a = "output_from_shell_script.txt",
b = "file_after_cluster_job.txt"
shell:
"""
shell_script.sh {output.a}
touch {output.b}
"""
PS - At the moment, I am using sleep command to give it a waiting time before the job is "completed". But this is an awful workaround as this could give rise to several problems.
Snakemake can manage this for you with the --cluster argument on the command line.
You can supply a template for the jobs to be executed on the cluster.
As an example, here is how I use snakemake on a SGE managed cluster:
template which will encapsulate the jobs which I called sge.sh:
#$ -S /bin/bash
#$ -cwd
#$ -V
{exec_job}
then I use directly on the login node:
snakemake -rp --cluster "qsub -e ./logs/ -o ./logs/" -j 20 --jobscript sge.sh --latency-wait 30
--cluster will tell which queuing system to use
--jobscript is the template in which jobs will be encapsulated
--latency-wait is important if the file system takes a bit of time to write the files. You job might end and return before the output of the rules are actually visible to the filesystem which will cause an error
Note that you can specify rules not to be executed on the nodes in the Snakefile with the keyword localrules:
Otherwise, depending on your queuing system, some options exist to wait for job sent to cluster to finish:
SGE:
Wait for set of qsub jobs to complete
SLURM:
How to hold up a script until a slurm job (start with srun) is completely finished?
LSF:
https://superuser.com/questions/46312/wait-for-one-or-all-lsf-jobs-to-complete

Calling Rscript from linux shell script

Can anyone suggest how I might get this working....
I have an R script that takes several minutes to run and writes a few hundred lines of output. I want to write a shell script wrapper around this R script which will launch the R script in the background, pipe its output to a file and start following the bottom of that file. If the user then enters CTRL-C I want that to kill the shell script and tail command but not the R script. Sounds simple right?
I've produced a simplified example below, but I don't understand why this doesn't work. Whenever I kill the shell script the R script is also killed despite apparently running in the background. I've tried nohup, disown etc with no success.
example.R
for(i in 1:1000){
Sys.sleep(1)
print(i)
}
wrapper.sh
#!/bin/bash
Rscript example.R > logfile &
tail -f logfile
Thanks in advance!
The following seems to work on my Ubuntu machine:
#!/bin/bash
setsid Rscript example.R > logfile.txt &
tail -f logfile.txt
Here are some of the relevant processes before sending SIGINT to wrapper.sh:
5361 pts/10 00:00:00 bash
6994 ? 00:00:02 update-notifier
8519 pts/4 00:00:00 wrapper.sh
8520 ? 00:00:00 R
8521 pts/4 00:00:00 tail
and after Ctrl+C, you can see that R is still running, but wrapper.sh and tail have been killed:
5361 pts/10 00:00:00 bash
6994 ? 00:00:02 update-notifier
8520 ? 00:00:00 R
Although appending your Rscript [...] command with & will send it to the background, it is still part of the same process group, and therefore receives SIGINT as well.
I'm not sure if it was your intention, but since you are calling tail -f, if not interrupted with Ctrl+c, your shell that is running wrapper.sh will continue to hang even after the R process completes. If you want to avoid this, the following should work,
#!/bin/bash
setsid Rscript example.R > logfile.txt &
tail --pid="$!" -f logfile.txt
where "$!" is the process id of the last background process executed (the Rscript [...] call).

Kill all R processes that hang for longer than a minute

I use crontask to regularly run Rscript. Unfortunately, I need to do this on a small instance of aws and the process may hang, building more and more processes on top of each other until the whole system is lagging.
I would like to write a crontask to kill all R processes lasting longer than one minute. I found another answer on Stack Overflow that I've adapted that I think would solve the problem. I came up with;
if [[ "$(uname)" = "Linux" ]];then killall --older-than 1m "/usr/lib/R/bin/exec/R --slave --no-restore --file=/home/ubuntu/script.R";fi
I copied the task directly from htop, but it does not work as I expect. I get the No such file or directory error but I've checked it a few times.
I need to kill all R processes that have lasted longer than a minute. How can I do this?
You may want to avoid killing processes from another user and try SIGKILL (kill -9) after SIGTERM (kill -15). Here is a script you could execute every minute with a CRON job:
#!/bin/bash
PROCESS="R"
MAXTIME=`date -d '00:01:00' +'%s'`
function killpids()
{
PIDS=`pgrep -u "${USER}" -x "${PROCESS}"`
# Loop over all matching PIDs
for pid in ${PIDS}; do
# Retrieve duration of the process
TIME=`ps -o time:1= -p "${pid}" |
egrep -o "[0-9]{0,2}:?[0-9]{0,2}:[0-9]{2}$"`
# Convert TIME to timestamp
TTIME=`date -d "${TIME}" +'%s'`
# Check if the process should be killed
if [ "${TTIME}" -gt "${MAXTIME}" ]; then
kill ${1} "${pid}"
fi
done
}
# Leave a chance to kill processes properly (SIGTERM)
killpids "-15"
sleep 5
# Now kill remaining processes (SIGKILL)
killpids "-9"
Why imply an additional process every minute with cron?
Would it not be easier to start R with timeout from coreutils, the processes will then be killed automatically after the time you chose.
timeout [option] duration command [arg]…
I think the best option is to do this with R itself. I am no expert, but it seems the future package will allow executing a function in a separate thread. You could run the actual task in a separate thread, and in the main thread sleep for 60 seconds and then stop().
Previous Update
user1747036's answer which recommends timeout is a better alternative.
My original answer
This question is more appropriate for superuser, but here are a few things wrong with
if [[ "$(uname)" = "Linux" ]];then
killall --older-than 1m \
"/usr/lib/R/bin/exec/R --slave --no-restore --file=/home/ubuntu/script.R";
fi
The name argument is either the name of image or path to it. You have included parameters to it as well
If -s signal is not specified killall sends SIGTERM which your process may ignore. Are you able to kill a long running script with this on the command line? You may need SIGKILL / -9
More at http://linux.die.net/man/1/killall

Write to one output file from a few parallel LSF bsub jobs, avoiding writing at the same time

I have developed a code that composed of two files:
An 'envelop bash file', which does a few things and writes to a log-file, and then at some point runs into a for loop in which within it it executes one job at a time using bsub.
And 'an internal bash file', which gets as input the name of the log-file (in addition to other input values that necessary for its execution), and executes process X (using the input values it received from the 'envelop file'.
Once process X is finished, the 'internal script' writes to the log-file that process X (with its specific serial number) has been completed.
Since the for-loop of the envelop file loops 10 times, there are at least 10 parallel processes that being executed and run in parallel, and they all being executed with bsub given the SAME log-file name. The idea is that they would all report to the same log-file once they completed their execution of Process X.
The general procedure works well, and in each case process X is being executed, and the log-file accumulates as required all the notifications regarding the completion of process X. However, in some incidences we see that the writing to the log-file get disturbed and output lines of two parallel runs are running into each other.
I would like to lock the log-file in such manner that would allow it to receive text only from one parallel run at a time. The idea is to avoid cases where the text becomes mixed due to two processes that write by chance to the log-file exactly at the same time.
Here is the part of my envelop file which call to the bsub (I reduced the content to the minimum necessary):
for ((i=1;i<=$batchesnumber; i++));
do
bsub -J $SerialName -q normal "bash FetchFasta.bash $genome_fa ${SerialFileName}".bed" $logfile"
done
Here is the part of my internal file that echo to the log-file:
(
echo "~~~~~~~~~~~~~~~~~~"
echo "^^^^^^^^^^^^^^^^^^"
echo -n "Completed running "; bedtools -version
echo "bedtools getfasta -s -fi $genome_fasta -bed $mySerialFile -fo ${mySerialFile%.*}".fa" "
echo "Run's completion time is: $timedate"
echo -e "~~~~~~~~~~~~~~~~~~\n"
) >> $logfile
I would appreciate any useful solution!
There's a couple of ways I can think of going about this:
Have each job write its output to a different file (use $LSB_JOBID inside each job to name the file). Then use another "cleanup" job to concatenate all of the ouptut into a single file. You can use job dependencies (bsub -w) to make sure the cleanup job runs after all the other jobs are done.
Implement a lock inside your "internal" job to make sure only one of them writes to a file at a time. This is a lot simpler than it might sound, one way to do it is to have each job try to create the same directory with mkdir before writing to the file and then delete the directory after its done. If they fail to create the directory it's because another one of the jobs got to it first and is currently writing to the file.
Here's a snippet illustrating #2 in bash:
# Try to get the lock every second
while ! mkdir lock &> /dev/null ; do
sleep 1
done
# Got the lock, write to the logfile
echo blahblahblah >> $logfile
# Release the lock
rmdir lock
I should mention an important caveat here though: if one of your jobs dies while it's "holding the lock" (say someone sends it a kill signal at the wrong time) then it'll never remove the directory and all the other jobs won't be able to create it, so they'll just keep sleeping forever.

Resources