How to get the proper exit code from nohup - unix

From the nohup documentation in info coreutils 'nohup invocation' it states:
Exit status:
125 if `nohup' itself fails, and `POSIXLY_CORRECT' is not set
126 if COMMAND is found but cannot be invoked
127 if COMMAND cannot be found
the exit status of COMMAND otherwise
However, the only exit codes I've ever gotten from nohup have been 1 and 0. I have a nohup command that's failing from within a script, and I need the exception appropriately...and based on this documentation I would assume that the nohup exit code should be 126. Instead, it is 0.
The command I'm running is: nohup perl myscript.pl &
Is this because perl is exiting successfully?

If your shell script runs the process with:
nohup perl myscript.pl &
you more or less forego the chance to collect the exit status from nohup. The command as a whole succeeds with 0 if the shell forked and fails with 1 if the shell fails to fork. In bash, you can wait for the background process to die and collect its status via wait:
nohup perl myscript.pl &
oldpid=$!
...do something else or this whole rigmarole is pointless...
wait $oldpid
echo $?
The echoed $? is usually the exit status of the specified PID (unless the specified PID had already died and been waited for).
If you run the process synchronously, you can detect the different exit statuses:
(
nohup perl myscript.pl
echo "PID $! exited with status $?" >&2
) &
And now you should be able to spot the different exit statuses from nohup (eg try different misspellings: nohup pearl myscript.pl, etc).
Note that the sub-shell as a whole is run in the background, but the nohup is run synchronously within the sub-shell.

As my understanding, the question was how to get the command status when it was running in nohup. As my experiences it was very little chance that you were able to get the COMMAND exit status even when it failed right away. Most time you just got the 'nohup COMMAND &' exit status unless you wait or synchronize as Jonathan mentioned. To check the COMMAND status right after nohup, I use:
pid=`ps -eo pid,cmd | awk '/COMMAND/ {print $1}'`
if [ -z $pid ]; then
echo "the COMMAND failed"
else
echo "the COMMAND is running in nohup"
fi

Related

Simulate wait's command -n flag in zsh

I am creating two shell jobs as follows
sleep 5 &
completion_pid=$!
sleep 40 && exit 1 &
failure_pid=$!
In bash I am able to get the exit code of the first job to finish by using the -n flag of wait's command
# capture exit code of the first subprocess to exit
wait -n $completion_pid $failure_pid
It seems however that this flag is not available in my MacOS Big Sur's version of wait (probably cause I am using zsh - ? )
▶ wait -n
wait: job not found: -n
Are there any alternative tools to do this that are also available on MacOS?
What perhaps is weird is that I am getting the same error when invoking a script containing wait -n as bash myscript.sh...
Since you are waiting by specifying PIDs, you can simply do a
wait $completion_pid $failure_pid

How to get an Rscript to return a status code in non-interactive bash mode

I am trying to get the status code out of an Rscript run in an non-interactive way in the form of a bash script. This step is part of larger data processing cycle that involves db2 scripts among other things.
So I have the following contents in a script sample.sh:
Rscript --verbose --no-restore --no-save /home/R/scripts/sample.r >> sample.rout
when this sample.sh is run it always returns a status code of 0, irrespective of if the sample.r script run fully or error out in an intermediate step.
I tried the following things but no luck
1 - in the sample.sh file, I added an if and else condition for a return code like the below, but it again wrote back 0 despite sample.r failing in one of the functions inside.
if Rscript --verbose --no-restore --no-save /home/R/scripts/sample.r >> sample.rout
then
echo -e "0"
else
echo -e "1"
fi
2 - I also tried a wrapper script, like in a sample.wrapper.sh file
r=0
a=$(./sample.sh)
r=$?
echo -e "\n return code of the script is: $a\n"
echo -e "\n The process completed with status: $r"
here also I did not get the expected '1' in the case of failure of the sample.r in an intermediate step on both the variables a and r. Ideally, i would like a way to capture the error (as '1') in a.
Could someone please advice how to get rscript to write '0' only in case of completion of the entire script without any errors and '1' in all other cases?
greatly appreciate the input! thank you!
I solved the problem by returning the status code in addition to echo. below is the code snipped from sample.sh script. In addition, in sample.R code i have added trycatch to catch the errors and quit(status = 1).
function fun {
if Rscript --verbose --no-restore --no-save /home/R/scripts/sample.r > sample.rout 2>&1
then
echo -e "0"
return 0
else
echo -e "1"
return 1
fi
}
fun
thanks everyone for your inputs.
The above code works for me. I modified it so that I could reuse the function and have it exit when there's an error
Rscript_with_status () {
rscript=$1
if Rscript --vanilla $rscript
then
return 0
else
exit 1
fi
}
run r scripts by:
Rscript_with_status /path/to/script/sample.r
Your remote script needs to provide a proper exit status.
You can make a 1st test by providing i.e. "exit 1" at the end of the remote script and see that it will make a difference.
remote.sh:
#!/bin/sh
exit 1
From local machine:
ssh -l username remoteip /home/username/remote.sh
echo $?
1
But the remote script should also provide to you the exit status of the last executed command. Experiment further by modifying your remote script:
#!/bin/sh
#exit 1
/bin/false
The exit status of the remote command will now also be 1.

How to get supervisord to restart hung workers?

I have a number of Python workers managed by supervisord that should continuously print to stdout (after each completed task) if they are working properly. However, they tend to hang, and we've had difficulty finding the bug. Ideally supervisord would notice that they haven't printed in X minutes and restart them; the tasks are idempotent, so non-graceful restarts are fine. Is there any supervisord feature or addon that can do this? Or another supervisor-like program that has this out of the box?
We are already using http://superlance.readthedocs.io/en/latest/memmon.html to kill if memory usage skyrockets, which mitigates some of the hangs, but a hang that doesn't cause a memory leak can still cause the workers to reach a standstill.
One possible solution would be to wrap your python script in a bash script that'd monitor it and exit if there isn't output to stdout for a period of time.
For example:
kill-if-hung.sh
#!/usr/bin/env bash
set -e
TIMEOUT=60
LAST_CHANGED="$(date +%s)"
{
set -e
while true; do
sleep 1
kill -USR1 $$
done
} &
trap check_output USR1
check_output() {
CURRENT="$(date +%s)"
if [[ $((CURRENT - LAST_CHANGED)) -ge $TIMEOUT ]]; then
echo "Process STDOUT hasn't printed in $TIMEOUT seconds"
echo "Considering process hung and exiting"
exit 1
fi
}
STDOUT_PIPE=$(mktemp -u)
mkfifo $STDOUT_PIPE
trap cleanup EXIT
cleanup() {
kill -- -$$ # Send TERM to child processes
[[ -p $STDOUT_PIPE ]] && rm -f $STDOUT_PIPE
}
$# >$STDOUT_PIPE || exit 2 &
while true; do
if read tmp; then
echo "$tmp"
LAST_CHANGED="$(date +%s)"
fi
done <$STDOUT_PIPE
Then you would run a python script in supervisord like: kill-if-hung.sh python -u some-script.py (-u to disable output buffering, or set PYTHONUNBUFFERED).
I'm sure you could imagine a python script that'd do something similar.

Unable to capture failure of rsh

I have the below rsh code as a part of a script. This code runs in a loop within the main script. In case the rsh fails, I wish to capture the exit code in a log for which the below If part was created. But it does not seem to be working as it always returns 0 for $? even when the remote server refuses connections.
I cannot use ssh as it is not configured.
rsh ${machine} -l ${osusernm} nohup ${ScrDir}/${LoadJobNm}.scr ${osusernm} ${machine} ${SIDFile} ${logon_id} ${calling_machine} &
if [ $? -ne 0 ]
then
echo "ERROR : Failed to execute ${LoadJobNm}.scr in ${machine} for file ${SIDFile}" >> ${LogDir}/${JobNm}.log
break
fi

How to do parallel processing in Unix Shell script?

I have a shell script that transfers a build.xml file to a remote unix machine (devrsp02) and executes the ANT task wldeploy on that machine (devrsp02). Now, this wldeploy task takes around 15 minutes to complete and while this is running, the last line at the unix console is -
"task {some digit} initialized".
Once this task is complete, we get a "task Completed" msg and the next task in the script is executed only after that.
But sometimes, there might be a problem with the weblogic domain and the deployment might be failing internally, with no effect on the status of the wldeploy task. The unix console will still be stuck at "task {some digit} initialized". The error of the deployment will be getting logged in a file called output.a
So, what I want now is -
Start a time counter before running wldeploy. If the wldeploy runs for more than 15 minutes, the following command should be run -
tail -f output.a ## without terminating the wldeploy
or
cat output.a ## after terminating the wldeploy forcefully
Point to be noted here is - I can't run the wldeploy task in background, as in that case the user won't get to know when the task is complete, which is crucial for this script.
Could you please suggest anything to achieve this?
Create this script (deploy.sh for example):
#!/bin/sh
sleep 900 && pkill -n wldeploy && cat output.a &
wldeploy
Then from the console
chmod +x deploy.sh
Then run
./deploy.sh
This script will start a counter (15 minutes) that will forcibly kill the wldeploy process if it's running, and if the process was running you'll see the contents of output.a.
If the script has terminated then pkill will not return true and output.a will not be shown.
I would call this task monitoring rather than "parallel processing" :)
This will only kill the wldeploy process it started, tell you whether wldeploy returned success or failure, and run no more than 30 seconds after wldeploy finishes.
It should be sh-compatible, but the /bin/sh I've got access to now seems to have a broken wait command.
#!/bin/ksh
wldeploy &
while [ ${slept:-0} -le 900 ]; do
sleep 30 && slept=`expr ${slept:-0} + 30`
if [ $$ = "`ps -o ppid= -p $!`" ]; then
echo wldeploy still running
else
wait $! && echo "wldeploy succeeded" || echo "wldeploy failed"
break
fi
done
if [ $$ = "`ps -o ppid= -p $!`" ]; then
echo "wldeploy did not finish in $slept seconds, killing it"
kill $!
cat output.a
fi
For the part without terminating the wldeploy it is easy, just execute before
{ sleep 900; tail -f output.a; } &
For the part with kill it, it is more complex, as you have determine the PID of the wldeploy process. The answer of pra is exactly doing that, so I would just refer to that.

Resources