How to do parallel processing in Unix Shell script? - unix

I have a shell script that transfers a build.xml file to a remote unix machine (devrsp02) and executes the ANT task wldeploy on that machine (devrsp02). Now, this wldeploy task takes around 15 minutes to complete and while this is running, the last line at the unix console is -
"task {some digit} initialized".
Once this task is complete, we get a "task Completed" msg and the next task in the script is executed only after that.
But sometimes, there might be a problem with the weblogic domain and the deployment might be failing internally, with no effect on the status of the wldeploy task. The unix console will still be stuck at "task {some digit} initialized". The error of the deployment will be getting logged in a file called output.a
So, what I want now is -
Start a time counter before running wldeploy. If the wldeploy runs for more than 15 minutes, the following command should be run -
tail -f output.a ## without terminating the wldeploy
or
cat output.a ## after terminating the wldeploy forcefully
Point to be noted here is - I can't run the wldeploy task in background, as in that case the user won't get to know when the task is complete, which is crucial for this script.
Could you please suggest anything to achieve this?

Create this script (deploy.sh for example):
#!/bin/sh
sleep 900 && pkill -n wldeploy && cat output.a &
wldeploy
Then from the console
chmod +x deploy.sh
Then run
./deploy.sh
This script will start a counter (15 minutes) that will forcibly kill the wldeploy process if it's running, and if the process was running you'll see the contents of output.a.
If the script has terminated then pkill will not return true and output.a will not be shown.
I would call this task monitoring rather than "parallel processing" :)

This will only kill the wldeploy process it started, tell you whether wldeploy returned success or failure, and run no more than 30 seconds after wldeploy finishes.
It should be sh-compatible, but the /bin/sh I've got access to now seems to have a broken wait command.
#!/bin/ksh
wldeploy &
while [ ${slept:-0} -le 900 ]; do
sleep 30 && slept=`expr ${slept:-0} + 30`
if [ $$ = "`ps -o ppid= -p $!`" ]; then
echo wldeploy still running
else
wait $! && echo "wldeploy succeeded" || echo "wldeploy failed"
break
fi
done
if [ $$ = "`ps -o ppid= -p $!`" ]; then
echo "wldeploy did not finish in $slept seconds, killing it"
kill $!
cat output.a
fi

For the part without terminating the wldeploy it is easy, just execute before
{ sleep 900; tail -f output.a; } &
For the part with kill it, it is more complex, as you have determine the PID of the wldeploy process. The answer of pra is exactly doing that, so I would just refer to that.

Related

Simulate wait's command -n flag in zsh

I am creating two shell jobs as follows
sleep 5 &
completion_pid=$!
sleep 40 && exit 1 &
failure_pid=$!
In bash I am able to get the exit code of the first job to finish by using the -n flag of wait's command
# capture exit code of the first subprocess to exit
wait -n $completion_pid $failure_pid
It seems however that this flag is not available in my MacOS Big Sur's version of wait (probably cause I am using zsh - ? )
▶ wait -n
wait: job not found: -n
Are there any alternative tools to do this that are also available on MacOS?
What perhaps is weird is that I am getting the same error when invoking a script containing wait -n as bash myscript.sh...
Since you are waiting by specifying PIDs, you can simply do a
wait $completion_pid $failure_pid

How to get supervisord to restart hung workers?

I have a number of Python workers managed by supervisord that should continuously print to stdout (after each completed task) if they are working properly. However, they tend to hang, and we've had difficulty finding the bug. Ideally supervisord would notice that they haven't printed in X minutes and restart them; the tasks are idempotent, so non-graceful restarts are fine. Is there any supervisord feature or addon that can do this? Or another supervisor-like program that has this out of the box?
We are already using http://superlance.readthedocs.io/en/latest/memmon.html to kill if memory usage skyrockets, which mitigates some of the hangs, but a hang that doesn't cause a memory leak can still cause the workers to reach a standstill.
One possible solution would be to wrap your python script in a bash script that'd monitor it and exit if there isn't output to stdout for a period of time.
For example:
kill-if-hung.sh
#!/usr/bin/env bash
set -e
TIMEOUT=60
LAST_CHANGED="$(date +%s)"
{
set -e
while true; do
sleep 1
kill -USR1 $$
done
} &
trap check_output USR1
check_output() {
CURRENT="$(date +%s)"
if [[ $((CURRENT - LAST_CHANGED)) -ge $TIMEOUT ]]; then
echo "Process STDOUT hasn't printed in $TIMEOUT seconds"
echo "Considering process hung and exiting"
exit 1
fi
}
STDOUT_PIPE=$(mktemp -u)
mkfifo $STDOUT_PIPE
trap cleanup EXIT
cleanup() {
kill -- -$$ # Send TERM to child processes
[[ -p $STDOUT_PIPE ]] && rm -f $STDOUT_PIPE
}
$# >$STDOUT_PIPE || exit 2 &
while true; do
if read tmp; then
echo "$tmp"
LAST_CHANGED="$(date +%s)"
fi
done <$STDOUT_PIPE
Then you would run a python script in supervisord like: kill-if-hung.sh python -u some-script.py (-u to disable output buffering, or set PYTHONUNBUFFERED).
I'm sure you could imagine a python script that'd do something similar.

Kill all R processes that hang for longer than a minute

I use crontask to regularly run Rscript. Unfortunately, I need to do this on a small instance of aws and the process may hang, building more and more processes on top of each other until the whole system is lagging.
I would like to write a crontask to kill all R processes lasting longer than one minute. I found another answer on Stack Overflow that I've adapted that I think would solve the problem. I came up with;
if [[ "$(uname)" = "Linux" ]];then killall --older-than 1m "/usr/lib/R/bin/exec/R --slave --no-restore --file=/home/ubuntu/script.R";fi
I copied the task directly from htop, but it does not work as I expect. I get the No such file or directory error but I've checked it a few times.
I need to kill all R processes that have lasted longer than a minute. How can I do this?
You may want to avoid killing processes from another user and try SIGKILL (kill -9) after SIGTERM (kill -15). Here is a script you could execute every minute with a CRON job:
#!/bin/bash
PROCESS="R"
MAXTIME=`date -d '00:01:00' +'%s'`
function killpids()
{
PIDS=`pgrep -u "${USER}" -x "${PROCESS}"`
# Loop over all matching PIDs
for pid in ${PIDS}; do
# Retrieve duration of the process
TIME=`ps -o time:1= -p "${pid}" |
egrep -o "[0-9]{0,2}:?[0-9]{0,2}:[0-9]{2}$"`
# Convert TIME to timestamp
TTIME=`date -d "${TIME}" +'%s'`
# Check if the process should be killed
if [ "${TTIME}" -gt "${MAXTIME}" ]; then
kill ${1} "${pid}"
fi
done
}
# Leave a chance to kill processes properly (SIGTERM)
killpids "-15"
sleep 5
# Now kill remaining processes (SIGKILL)
killpids "-9"
Why imply an additional process every minute with cron?
Would it not be easier to start R with timeout from coreutils, the processes will then be killed automatically after the time you chose.
timeout [option] duration command [arg]…
I think the best option is to do this with R itself. I am no expert, but it seems the future package will allow executing a function in a separate thread. You could run the actual task in a separate thread, and in the main thread sleep for 60 seconds and then stop().
Previous Update
user1747036's answer which recommends timeout is a better alternative.
My original answer
This question is more appropriate for superuser, but here are a few things wrong with
if [[ "$(uname)" = "Linux" ]];then
killall --older-than 1m \
"/usr/lib/R/bin/exec/R --slave --no-restore --file=/home/ubuntu/script.R";
fi
The name argument is either the name of image or path to it. You have included parameters to it as well
If -s signal is not specified killall sends SIGTERM which your process may ignore. Are you able to kill a long running script with this on the command line? You may need SIGKILL / -9
More at http://linux.die.net/man/1/killall

How do you stop the current foreground process and re-execute it?

I often have to relaunch a server to see if my changes are fine. I keep this server opened in a shell, so I have a quick access to current logs. So here is what I type in my shell: ^C!!⏎. That is send SIGINT, and then relaunch last event in history.
So what I would like is to type, say ^R, and have the same result.
(Note: I use zsh)
I tried the following:
relaunch-function() {
kill -INT %% && !!
}
zle -N relaunch-widget relaunch-function
bindkey "^R" relaunch-widget
But it seems that while running my server, ^R won't be passed tho the shell but to the server which doesn't notice the shell. So I can't see a generic solution, while testing return value and process name should be feasible.
As long as the job is running in the foreground, keys will not be passed to the shell. So setting a key binding for killing a foreground process and starting it again won't work.
But as you could start your server in an endless loop, so that it restarts automatically. Assuming the name of the command is run_server you can start it like this on the shell:
(TRAPINT(){};while sleep .5; do run_server; done)
The surrounding parentheses start a sub-shell, TRAPINT(){} disables SIGINT for this shell. The while loop will keep restarting run_server until sleep exits with an exit status that is not zero. That can be achieved by interrupting sleep with ^C. (Without setting TRAPINT, interrupting run_server could also interrupt the loop)
So if you want to restart your server, just press ^C and wait for 0.5 seconds. If you want to stop your server without restarting, press ^C twice in 0.5 seconds.
To save some typing you can create a function for that:
doloop() {(
TRAPINT(){}
while sleep .5
do
echo running \"$#\"
eval $#
done
)}
Then call it with doloop run_server. Note: You still need the additional surrounding () as functions do not open a sub-shell by themselves.
eval allows for shell constructs to be used. For example doloop LANG=C locale. In some cases you may need to use (single):
$ doloop echo $RANDOM
running "echo 242"
242
running "echo 242"
242
running "echo 242"
242
^C
$ doloop 'echo $RANDOM'
running "echo $RANDOM"
10988
running "echo $RANDOM"
27551
running "echo $RANDOM"
8910
^C

How to get the proper exit code from nohup

From the nohup documentation in info coreutils 'nohup invocation' it states:
Exit status:
125 if `nohup' itself fails, and `POSIXLY_CORRECT' is not set
126 if COMMAND is found but cannot be invoked
127 if COMMAND cannot be found
the exit status of COMMAND otherwise
However, the only exit codes I've ever gotten from nohup have been 1 and 0. I have a nohup command that's failing from within a script, and I need the exception appropriately...and based on this documentation I would assume that the nohup exit code should be 126. Instead, it is 0.
The command I'm running is: nohup perl myscript.pl &
Is this because perl is exiting successfully?
If your shell script runs the process with:
nohup perl myscript.pl &
you more or less forego the chance to collect the exit status from nohup. The command as a whole succeeds with 0 if the shell forked and fails with 1 if the shell fails to fork. In bash, you can wait for the background process to die and collect its status via wait:
nohup perl myscript.pl &
oldpid=$!
...do something else or this whole rigmarole is pointless...
wait $oldpid
echo $?
The echoed $? is usually the exit status of the specified PID (unless the specified PID had already died and been waited for).
If you run the process synchronously, you can detect the different exit statuses:
(
nohup perl myscript.pl
echo "PID $! exited with status $?" >&2
) &
And now you should be able to spot the different exit statuses from nohup (eg try different misspellings: nohup pearl myscript.pl, etc).
Note that the sub-shell as a whole is run in the background, but the nohup is run synchronously within the sub-shell.
As my understanding, the question was how to get the command status when it was running in nohup. As my experiences it was very little chance that you were able to get the COMMAND exit status even when it failed right away. Most time you just got the 'nohup COMMAND &' exit status unless you wait or synchronize as Jonathan mentioned. To check the COMMAND status right after nohup, I use:
pid=`ps -eo pid,cmd | awk '/COMMAND/ {print $1}'`
if [ -z $pid ]; then
echo "the COMMAND failed"
else
echo "the COMMAND is running in nohup"
fi

Resources