I have a number of Python workers managed by supervisord that should continuously print to stdout (after each completed task) if they are working properly. However, they tend to hang, and we've had difficulty finding the bug. Ideally supervisord would notice that they haven't printed in X minutes and restart them; the tasks are idempotent, so non-graceful restarts are fine. Is there any supervisord feature or addon that can do this? Or another supervisor-like program that has this out of the box?
We are already using http://superlance.readthedocs.io/en/latest/memmon.html to kill if memory usage skyrockets, which mitigates some of the hangs, but a hang that doesn't cause a memory leak can still cause the workers to reach a standstill.
One possible solution would be to wrap your python script in a bash script that'd monitor it and exit if there isn't output to stdout for a period of time.
For example:
kill-if-hung.sh
#!/usr/bin/env bash
set -e
TIMEOUT=60
LAST_CHANGED="$(date +%s)"
{
set -e
while true; do
sleep 1
kill -USR1 $$
done
} &
trap check_output USR1
check_output() {
CURRENT="$(date +%s)"
if [[ $((CURRENT - LAST_CHANGED)) -ge $TIMEOUT ]]; then
echo "Process STDOUT hasn't printed in $TIMEOUT seconds"
echo "Considering process hung and exiting"
exit 1
fi
}
STDOUT_PIPE=$(mktemp -u)
mkfifo $STDOUT_PIPE
trap cleanup EXIT
cleanup() {
kill -- -$$ # Send TERM to child processes
[[ -p $STDOUT_PIPE ]] && rm -f $STDOUT_PIPE
}
$# >$STDOUT_PIPE || exit 2 &
while true; do
if read tmp; then
echo "$tmp"
LAST_CHANGED="$(date +%s)"
fi
done <$STDOUT_PIPE
Then you would run a python script in supervisord like: kill-if-hung.sh python -u some-script.py (-u to disable output buffering, or set PYTHONUNBUFFERED).
I'm sure you could imagine a python script that'd do something similar.
Related
I have a long list of remote hosts and I want to run a shell command on all of them. The command takes a very long time, so I want to run the command inside screen on the remote machine, disconnecting immediately from each, and I want the terminal output on the remote to be preserved after the command exits. There is a "tag" that should be supplied to each command as an argument. I tried to do this with parallel, something like this:
$ cat servers.txt
user1#server1.example.com/tag1
user2#server2.example.com/tag2
# ...
$ cat run.sh
grep -v '^#' servers.txt |
parallel ssh -tt '{//}' \
'tag={/}; exec screen slow_command --option1 --option2 $tag other args'
This doesn't work: all of the remote processes are launched, but they are not detached (so the ssh sessions remain live and I don't get my local shell back), and once each command finishes, its screen exits immediately and the output is lost.
How do I fix this script? Note: if this is easier to do with tmux and/or some other marshalling program besides parallel, I'm happy to hear answers that explain how to do it that way.
Something like this:
grep -v '^#' servers.txt |
parallel -q --colsep / ssh {1} "screen -d -m bash -c 'echo do stuff \"{2}\";sleep 1000000'"
The final sleep makes sure the screen does not die. You will have 1000000 seconds to attach to it and kill it.
There is an awful lot of quoting there - especially if do stuff is complex.
It may be easier to make a function that computes tag on the remote machine. You need GNU Parallel 20200522 for this:
env_parallel --session
f() {
sshlogin="$1"
# TODO given $sshlogin compute $tag (e.g. a table lookup)
do_stuff() {
echo "do stuff $tag"
sleep 1000000
}
export -f do_stuff
screen -d -m bash -c do_stuff "$#"
}
env_parallel --nonall --slf servers_without_tag f '$PARALLEL_SSHLOGIN'
env_parallel --endsession
I use crontask to regularly run Rscript. Unfortunately, I need to do this on a small instance of aws and the process may hang, building more and more processes on top of each other until the whole system is lagging.
I would like to write a crontask to kill all R processes lasting longer than one minute. I found another answer on Stack Overflow that I've adapted that I think would solve the problem. I came up with;
if [[ "$(uname)" = "Linux" ]];then killall --older-than 1m "/usr/lib/R/bin/exec/R --slave --no-restore --file=/home/ubuntu/script.R";fi
I copied the task directly from htop, but it does not work as I expect. I get the No such file or directory error but I've checked it a few times.
I need to kill all R processes that have lasted longer than a minute. How can I do this?
You may want to avoid killing processes from another user and try SIGKILL (kill -9) after SIGTERM (kill -15). Here is a script you could execute every minute with a CRON job:
#!/bin/bash
PROCESS="R"
MAXTIME=`date -d '00:01:00' +'%s'`
function killpids()
{
PIDS=`pgrep -u "${USER}" -x "${PROCESS}"`
# Loop over all matching PIDs
for pid in ${PIDS}; do
# Retrieve duration of the process
TIME=`ps -o time:1= -p "${pid}" |
egrep -o "[0-9]{0,2}:?[0-9]{0,2}:[0-9]{2}$"`
# Convert TIME to timestamp
TTIME=`date -d "${TIME}" +'%s'`
# Check if the process should be killed
if [ "${TTIME}" -gt "${MAXTIME}" ]; then
kill ${1} "${pid}"
fi
done
}
# Leave a chance to kill processes properly (SIGTERM)
killpids "-15"
sleep 5
# Now kill remaining processes (SIGKILL)
killpids "-9"
Why imply an additional process every minute with cron?
Would it not be easier to start R with timeout from coreutils, the processes will then be killed automatically after the time you chose.
timeout [option] duration command [arg]…
I think the best option is to do this with R itself. I am no expert, but it seems the future package will allow executing a function in a separate thread. You could run the actual task in a separate thread, and in the main thread sleep for 60 seconds and then stop().
Previous Update
user1747036's answer which recommends timeout is a better alternative.
My original answer
This question is more appropriate for superuser, but here are a few things wrong with
if [[ "$(uname)" = "Linux" ]];then
killall --older-than 1m \
"/usr/lib/R/bin/exec/R --slave --no-restore --file=/home/ubuntu/script.R";
fi
The name argument is either the name of image or path to it. You have included parameters to it as well
If -s signal is not specified killall sends SIGTERM which your process may ignore. Are you able to kill a long running script with this on the command line? You may need SIGKILL / -9
More at http://linux.die.net/man/1/killall
I want to keep polling file till it arrives at the location for 1 hour.
My dir : /home/stage
File Name (which I am looking for): abc.txt
I want to keep polling directory /home/stage for 1 hour but within the 1 hour if abc.txt file arrives then it should stop polling and should display the message file arrived otherwise after 1 hour it should display that file has not arrived.
Is there any way to achieve this in Unix?
Another bash method, not relying on trap handlers and signals, in case your larger scope already uses them for other things:
#!/bin/bash
interval=60
((end_time=${SECONDS}+3600))
directory=${HOME}
file=abc.txt
while ((${SECONDS} < ${end_time}))
do
if [[ -r ${directory}/${file} ]]
then
echo "File has arrived."
exit 0
fi
sleep ${interval}
done
echo "File did not arrive."
exit 1
The following script should work for you. It would poll for the file every minute for an hour.
#!/bin/bash
duration=3600
interval=60
pid=$$
file="/home/stage/abc.txt"
( sleep ${duration}; { ps -p $pid 1>/dev/null && kill -HUP $pid; } ) &
trap "echo \"file has not arrived\"; kill $pid" SIGHUP
while true;
do
[ -f ${file} ] && { echo "file arrived"; exit; }
sleep ${interval}
done
Here's an inotify script to check for abc.txt:
#!/bin/sh
timeout 1h \
inotifywait \
--quiet \
--event create \
--format '%f' \
--monitor /home/stage |
while read FILE; do \
[ "$FILE" = 'abc.txt' ] && echo "File $FILE arrived." && kill $$
done
exit 0
The timeout command quits the process after one hour. In case the file arrives, the process kills itself.
You can use inotify to monitor the directory for modifications and then check to see if the file is abc.txt. The inotifywait(1) command lets you do this directly from the command line in a shell script. Check the man page for details. This is notification based.
A poll based thing would be a loop that checks to see if the file exists and if not, sleep for a period of time before checking again. That's a trivial shell script too.
Here some answer with retry:
cur_poll_c=0
echo "current poll count= $cur_poll_c"
while (($cur_poll_c < $maxpol_count)) && (($SECONDS < $end_time))
do
if [[ -f $s_dir/$input_file ]]
then
echo "File has arrived...
do some operation...
sleep 5
exit 0
fi
sleep $interval
echo "Retring for $cur_poll_c time .."
cur_poll_c=`expr $cur_poll_c+1`;
done
From the nohup documentation in info coreutils 'nohup invocation' it states:
Exit status:
125 if `nohup' itself fails, and `POSIXLY_CORRECT' is not set
126 if COMMAND is found but cannot be invoked
127 if COMMAND cannot be found
the exit status of COMMAND otherwise
However, the only exit codes I've ever gotten from nohup have been 1 and 0. I have a nohup command that's failing from within a script, and I need the exception appropriately...and based on this documentation I would assume that the nohup exit code should be 126. Instead, it is 0.
The command I'm running is: nohup perl myscript.pl &
Is this because perl is exiting successfully?
If your shell script runs the process with:
nohup perl myscript.pl &
you more or less forego the chance to collect the exit status from nohup. The command as a whole succeeds with 0 if the shell forked and fails with 1 if the shell fails to fork. In bash, you can wait for the background process to die and collect its status via wait:
nohup perl myscript.pl &
oldpid=$!
...do something else or this whole rigmarole is pointless...
wait $oldpid
echo $?
The echoed $? is usually the exit status of the specified PID (unless the specified PID had already died and been waited for).
If you run the process synchronously, you can detect the different exit statuses:
(
nohup perl myscript.pl
echo "PID $! exited with status $?" >&2
) &
And now you should be able to spot the different exit statuses from nohup (eg try different misspellings: nohup pearl myscript.pl, etc).
Note that the sub-shell as a whole is run in the background, but the nohup is run synchronously within the sub-shell.
As my understanding, the question was how to get the command status when it was running in nohup. As my experiences it was very little chance that you were able to get the COMMAND exit status even when it failed right away. Most time you just got the 'nohup COMMAND &' exit status unless you wait or synchronize as Jonathan mentioned. To check the COMMAND status right after nohup, I use:
pid=`ps -eo pid,cmd | awk '/COMMAND/ {print $1}'`
if [ -z $pid ]; then
echo "the COMMAND failed"
else
echo "the COMMAND is running in nohup"
fi
I have a shell script that transfers a build.xml file to a remote unix machine (devrsp02) and executes the ANT task wldeploy on that machine (devrsp02). Now, this wldeploy task takes around 15 minutes to complete and while this is running, the last line at the unix console is -
"task {some digit} initialized".
Once this task is complete, we get a "task Completed" msg and the next task in the script is executed only after that.
But sometimes, there might be a problem with the weblogic domain and the deployment might be failing internally, with no effect on the status of the wldeploy task. The unix console will still be stuck at "task {some digit} initialized". The error of the deployment will be getting logged in a file called output.a
So, what I want now is -
Start a time counter before running wldeploy. If the wldeploy runs for more than 15 minutes, the following command should be run -
tail -f output.a ## without terminating the wldeploy
or
cat output.a ## after terminating the wldeploy forcefully
Point to be noted here is - I can't run the wldeploy task in background, as in that case the user won't get to know when the task is complete, which is crucial for this script.
Could you please suggest anything to achieve this?
Create this script (deploy.sh for example):
#!/bin/sh
sleep 900 && pkill -n wldeploy && cat output.a &
wldeploy
Then from the console
chmod +x deploy.sh
Then run
./deploy.sh
This script will start a counter (15 minutes) that will forcibly kill the wldeploy process if it's running, and if the process was running you'll see the contents of output.a.
If the script has terminated then pkill will not return true and output.a will not be shown.
I would call this task monitoring rather than "parallel processing" :)
This will only kill the wldeploy process it started, tell you whether wldeploy returned success or failure, and run no more than 30 seconds after wldeploy finishes.
It should be sh-compatible, but the /bin/sh I've got access to now seems to have a broken wait command.
#!/bin/ksh
wldeploy &
while [ ${slept:-0} -le 900 ]; do
sleep 30 && slept=`expr ${slept:-0} + 30`
if [ $$ = "`ps -o ppid= -p $!`" ]; then
echo wldeploy still running
else
wait $! && echo "wldeploy succeeded" || echo "wldeploy failed"
break
fi
done
if [ $$ = "`ps -o ppid= -p $!`" ]; then
echo "wldeploy did not finish in $slept seconds, killing it"
kill $!
cat output.a
fi
For the part without terminating the wldeploy it is easy, just execute before
{ sleep 900; tail -f output.a; } &
For the part with kill it, it is more complex, as you have determine the PID of the wldeploy process. The answer of pra is exactly doing that, so I would just refer to that.