Parallelize a bash script and wait for each loop to finish - r

I'm trying to write a script, that we call pippo.R. pippo.R aim, is to run another script (for.sh) in a for loop with a parallelization using two values :
nPerm= total number of times the script has to be run
permAtTime= number of script that can run at the same time.
A very important thing to do, is to wait for each loop to be concluded, thats why I added a file in which all the PID are stored and then I use the wait function to wait for each of them. The main problem of this script is the following error :
./wait.sh: line 2: wait: pid 836844 is not a child of this shell
For reproducibility sake you can put in a folder the following files :
pippo.R
nPerm=10
permAtTime=2
cycles=nPerm/permAtTime
for(i in 1:cycles){
d=1
system(paste("./for.sh ", i," ",permAtTime,sep=""))
}
for.sh
#!/bin/bash
for X in $(seq $1)
do
nohup ./script.sh $(($X +($2 -1)*$1 )) &
echo $! >> ./save_pid.txt
done
./wait.sh
wait.sh
#!/bin/bash
while read p; do wait $p; done < ./save_pid.txt
Running Rscript pippo.R you will have the explained error. I know that there is the parallel function that can help me in this but for several reasons i cannot use that package.
Thanks

You don't need to keep track of PIDs, because if you call wait without any argument, the script will wait for all the child processes to finish.
#!/bin/bash
for X in $(seq $1)
do
nohup ./script.sh $(($X +($2 -1)*$1 )) &
done
wait

Related

loop through different arguments in Rscript within Korn shell

I have an R script which I'm running in the terminal by firstly generating a .ksh file called myscript.ksh with the following information:
#!/bin/ksh
Rscript myscript.R 'Input1'
and then run the function with
./mycode.ksh
which sends the script to a node on the cluster in our department (the processes that we send to the cluster must be as a .ksh file).
'Input1' is an input argument that is used by the R script to some analysis.
The issue that I now have is that I need to run this script a number of times with different input arguments to the function. One solution is to generate a few .ksh files, such as:
#!/bin/ksh
Rscript myscript.R 'Input2'
and
#!/bin/ksh
Rscript myscript.R 'Input3'
and then execute them seperately, but I was hoping to find a better solution.
Note that I have to do this for 100 different input arguments so it is not realistic to write 100 of these files. Is there a way of generating another file with the information needed to be supplied to the function e.g. 'Input1' 'Input2' 'Input3' and then run myscript.ksh for these individually.
For example, I could have a variable defining the name of the input arguments and then have a loop which would pass it to myscript.ksh. Is that possible?
The reason for running these in this manner is so that each iteration will hopefully be send to a different node on the cluster, thus analysing the data at a much faster rate.
You need to do two things:
Create an array of all your input variables
Loop through the array and initiate all your calls
The following illustrates the concept:
#!/bin/ksh
#Create array of inputs - space separator
inputs=(Input1 Input2 Input3 Input4)
# Loop through all the array items {0 ... n-1}
for i in {0..3}
do
echo ${inputs[i]}
done
This will output all the values in the inputs array.
You just need to replace the contents of the do-loop with:
Rscript myscript.R ${inputs[i]}
Also, you may need to add a ` &' at the end of the Rscript command line to spawn off each Rscript command as a separate thread -- otherwise, the shell would wait for a return from each Rscript command before going onto the next.
EDIT:
Based on your comments, you need to actually generate .ksh scripts to submit to qsub. For this you just need to expand the do loop.
For example:
#!/bin/ksh
#Create array of inputs - space separator
inputs=(Input1 Input2 Input3 Input4)
# Loop through all the array items {0 ... n-1}
for i in {0..3}
do
cat > submission.ksh << EOF
#!/bin/ksh
Rscript myscript.R ${inputs[i]}
EOF
chmod u+x submission.ksh
qsub submission.ksh
done
The EOF defines the beginning and end of what will be taken as input (STDIN) and the output (STDOUT) will written to submission.ksh.
Then submission.ksh is made executable with the chmod command.
And then the script is submitted via qsub. I'll let you fill in any other arguments you need for qsub.
When your script doesn't know all parameters when it starts, you can make a .ksh file called mycode.ksh with the following information:
#!/bin/ksh
if [ $# -ne 1 ]; then
echo "Usage: $0 input"
exit 1
fi
# Or start at the background with nohup .... &, other question
Rscript myscript.R "$1"
and then run the function with
./mycode.ksh inputX
When your application knows all arguments, you can use a loop:
#!/bin/ksh
if [ $# -eq 0 ]; then
echo "Usage: $0 input(s)"
exit 1
fi
for input in $*; do
Rscript myscript.R "${input}"
done
and then run the function with
./mycode.ksh input1 input2 "input with space in double quotes" input4

How To Run Different, Multiple Rscripts on SGE Cluster

I am trying to run different Rscripts on a SGE cluster, each Rscript only changes by one variable (e.g. cancer <- "UVM" or "ACC", etc.).
I have attempted two ways: either run a Single Rscript that gets command line arguments for the 30 different cancer names
OR
run each Rscript (i.e. UVM.r, ACC.r, etc.)
Either way, I am having alot of difficulty figuring out how to submit these jobs so I can run either one Rscript 30 times with different argument each time OR run multiple Rscripts with no command line arguments.
You can use while loop in bash for this.
Setup input file of arguments, e.g. args.txt:
UVM
ACC
Run qsub in while loop to submit script for each argument:
while read arg
do
echo "Rscript script.R ${arg}" | qsub <options>
done <args.txt
Above uses echo to pass code to run to qsub.
A job script like this:
#!/bin/bash
#$ -t 1-30
shift ${SGE_TASK_ID}
exec Rscript script.R $1
Submit like this qsub job_script dummy UVM ACC ...

Sequentially Run Programs in Unix

I have several programs that need to be ran in a certain order (p1 then p2 then p3 then p4).
Normally I would simply make a simple script or type p1 && p2 && p3 && p4.
However, these programs to not exit correctly. I only know it is finished successfully when "Success" is printed. Currently, I SIGINT once I see "Success" or "Fail" and then manually run the next program if it's "Success".
Is there a simpler way to sequentially execute p1, p2, p3, p4 with less human intervention?
Edit: Currently using ksh, but I wouldn't mind knowing the other ones too.
In bash, you can pipe the command to grep looking for 'Success', then rely on grep's result code. The trick to that is wrapping the whole expression in curly braces to get an inline sub-shell. Like so:
$ cat foo.sh
#!/bin/bash
[ 0 -eq $(( $RANDOM %2 )) ] && echo 'Success' || echo 'Failure'
exit 0
$ { ./foo.sh | grep -q 'Success'; } && ls || df
The part inside the curly braces ({}) returns 0 if "Success" is in the output, otherwise 1, as if the foo.sh command had done so itself. More details on that technique.
I've not used ksh in a long while, but I suspect there is a similar construction.
I'm also new to linux programming, but I found something that might be helpful for you. Have you tried using the 'wait' command?
from this answer on stackexchange:
sleep 1 &
PID1=$!
sleep 2 &
PID2=$!
wait $PID1
echo PID1 has ended.
wait
echo All background processes have exited.
I haven't tested it myself, but it looks like what you described in your question.
all the answers so far would work fine if your programs would actually terminate.
here is a couple ideas you can use look through documentation for more details.
1st - option would be to modify your programs to have them terminate after printing the result message by returning a success code.
2nd - if not possible use forks.
write a main where you make a fork each time you want to execute a program.
in the child process use dup2 to have the process' output in a file of your choice.
in the main keep checking the content of said file until you get something and compare it with either success or failure.
-empty the file.
then you can make another fork and execute the next program.
repeat the same operation again.
bear in mind that execute is a command that replaces the code of the process it is executed in with the code of the file passed as a parameter so make the dup2 call first.
When your program returns Success or Fail and continues running, you should kill it as soon as the string passes.
Make a function like
function startp {
prog=$1
./${prog} | while read -r line; do
case "${line}" in
"Success")
echo OK
mykill $prog
exit 0
;;
"Fail")
echo NOK
mykill $prog
exit 1
;;
*) echo "${line}"
;;
esac
done
exit 2
}
You need to add a mykill function that looks for the px program and kills it (xargs is nice for it).
Call the function like
startp p1 && startp p2 && startp p3

while using script in unix how can I get the bell to chime more than once

I am writing a shell script and need the bell to chime several times. Is there a command variation or argument to make this happen ?
I have used the \a and the \007 and I get one chime. I can't seem to find how to make it happen more than once.
run your beep command once, wait a second with sleep and run it again
for instance
echo -n $'\a' ; sleep 1; echo -n $'\a'

R: Using wait=FALSE in system() with multiline commands

I have a long running process (written in Java) that I wish to run asynchronously with system(..., wait=FALSE). In order to be able to determine when the process has ended i want to create a file afterwards as per the suggestions given in How to determine when a process started with system(..., wait=FALSE) has ended. The problem is that it seems the wait parameter only applies to the last line in a multiline system command, and I can't seem to find a way around that.
Example:
system('sleep 2') # waits 2 seconds before control is returned to the user
system('sleep 2', wait=FALSE) # control is returned immediately
system('sleep 2; ls', wait=FALSE) # waits 2 seconds before control is returned to the user
I'm running on a mac system btw...
I find strange that R's system only waits for the first command (it should be calling the shell, which then waits for both commands) but using && should do it:
system('sleep 2 && ls', wait=FALSE)
If R is appending a & to the command line, it becomes sleep 2; ls & and there the & affects only the second parameter.
Another solution would be to put brackets around the commands, ( sleep 2 ; ls ) & will perform both actions sequentially:
system('( sleep 2 ; ls )', wait=FALSE)

Resources