Unix parallel processing with error handling - unix

I need to run multiple scripts in parallel. The scripts needs to ensure that x scripts are always running. I trigger 6 jobs in parallel, as soon as one finishes it triggers the next one, but id something fails it should stop processing any further jobs. I was able to get hold of a script which helped me in first part, but not sure how to handle the error tracking. I do not have GNU parallel
PROCS=6
SLEEP=20
while read LINE
do
while true
do
NUM=$(jobs | wc -l)
echo NUM=$NUM
if [ $NUM -lt $PROCS ]
then
stuff "$LINE" &
break
else
sleep $SLEEP
fi
done
done < textfile
wait
So above ensures 6 processes are always running. I want a way to find out if one of the process is failed, then do not trigger anymore jobs.

Related

Kill all R processes that hang for longer than a minute

I use crontask to regularly run Rscript. Unfortunately, I need to do this on a small instance of aws and the process may hang, building more and more processes on top of each other until the whole system is lagging.
I would like to write a crontask to kill all R processes lasting longer than one minute. I found another answer on Stack Overflow that I've adapted that I think would solve the problem. I came up with;
if [[ "$(uname)" = "Linux" ]];then killall --older-than 1m "/usr/lib/R/bin/exec/R --slave --no-restore --file=/home/ubuntu/script.R";fi
I copied the task directly from htop, but it does not work as I expect. I get the No such file or directory error but I've checked it a few times.
I need to kill all R processes that have lasted longer than a minute. How can I do this?
You may want to avoid killing processes from another user and try SIGKILL (kill -9) after SIGTERM (kill -15). Here is a script you could execute every minute with a CRON job:
#!/bin/bash
PROCESS="R"
MAXTIME=`date -d '00:01:00' +'%s'`
function killpids()
{
PIDS=`pgrep -u "${USER}" -x "${PROCESS}"`
# Loop over all matching PIDs
for pid in ${PIDS}; do
# Retrieve duration of the process
TIME=`ps -o time:1= -p "${pid}" |
egrep -o "[0-9]{0,2}:?[0-9]{0,2}:[0-9]{2}$"`
# Convert TIME to timestamp
TTIME=`date -d "${TIME}" +'%s'`
# Check if the process should be killed
if [ "${TTIME}" -gt "${MAXTIME}" ]; then
kill ${1} "${pid}"
fi
done
}
# Leave a chance to kill processes properly (SIGTERM)
killpids "-15"
sleep 5
# Now kill remaining processes (SIGKILL)
killpids "-9"
Why imply an additional process every minute with cron?
Would it not be easier to start R with timeout from coreutils, the processes will then be killed automatically after the time you chose.
timeout [option] duration command [arg]…
I think the best option is to do this with R itself. I am no expert, but it seems the future package will allow executing a function in a separate thread. You could run the actual task in a separate thread, and in the main thread sleep for 60 seconds and then stop().
Previous Update
user1747036's answer which recommends timeout is a better alternative.
My original answer
This question is more appropriate for superuser, but here are a few things wrong with
if [[ "$(uname)" = "Linux" ]];then
killall --older-than 1m \
"/usr/lib/R/bin/exec/R --slave --no-restore --file=/home/ubuntu/script.R";
fi
The name argument is either the name of image or path to it. You have included parameters to it as well
If -s signal is not specified killall sends SIGTERM which your process may ignore. Are you able to kill a long running script with this on the command line? You may need SIGKILL / -9
More at http://linux.die.net/man/1/killall

Write to one output file from a few parallel LSF bsub jobs, avoiding writing at the same time

I have developed a code that composed of two files:
An 'envelop bash file', which does a few things and writes to a log-file, and then at some point runs into a for loop in which within it it executes one job at a time using bsub.
And 'an internal bash file', which gets as input the name of the log-file (in addition to other input values that necessary for its execution), and executes process X (using the input values it received from the 'envelop file'.
Once process X is finished, the 'internal script' writes to the log-file that process X (with its specific serial number) has been completed.
Since the for-loop of the envelop file loops 10 times, there are at least 10 parallel processes that being executed and run in parallel, and they all being executed with bsub given the SAME log-file name. The idea is that they would all report to the same log-file once they completed their execution of Process X.
The general procedure works well, and in each case process X is being executed, and the log-file accumulates as required all the notifications regarding the completion of process X. However, in some incidences we see that the writing to the log-file get disturbed and output lines of two parallel runs are running into each other.
I would like to lock the log-file in such manner that would allow it to receive text only from one parallel run at a time. The idea is to avoid cases where the text becomes mixed due to two processes that write by chance to the log-file exactly at the same time.
Here is the part of my envelop file which call to the bsub (I reduced the content to the minimum necessary):
for ((i=1;i<=$batchesnumber; i++));
do
bsub -J $SerialName -q normal "bash FetchFasta.bash $genome_fa ${SerialFileName}".bed" $logfile"
done
Here is the part of my internal file that echo to the log-file:
(
echo "~~~~~~~~~~~~~~~~~~"
echo "^^^^^^^^^^^^^^^^^^"
echo -n "Completed running "; bedtools -version
echo "bedtools getfasta -s -fi $genome_fasta -bed $mySerialFile -fo ${mySerialFile%.*}".fa" "
echo "Run's completion time is: $timedate"
echo -e "~~~~~~~~~~~~~~~~~~\n"
) >> $logfile
I would appreciate any useful solution!
There's a couple of ways I can think of going about this:
Have each job write its output to a different file (use $LSB_JOBID inside each job to name the file). Then use another "cleanup" job to concatenate all of the ouptut into a single file. You can use job dependencies (bsub -w) to make sure the cleanup job runs after all the other jobs are done.
Implement a lock inside your "internal" job to make sure only one of them writes to a file at a time. This is a lot simpler than it might sound, one way to do it is to have each job try to create the same directory with mkdir before writing to the file and then delete the directory after its done. If they fail to create the directory it's because another one of the jobs got to it first and is currently writing to the file.
Here's a snippet illustrating #2 in bash:
# Try to get the lock every second
while ! mkdir lock &> /dev/null ; do
sleep 1
done
# Got the lock, write to the logfile
echo blahblahblah >> $logfile
# Release the lock
rmdir lock
I should mention an important caveat here though: if one of your jobs dies while it's "holding the lock" (say someone sends it a kill signal at the wrong time) then it'll never remove the directory and all the other jobs won't be able to create it, so they'll just keep sleeping forever.

Shell Script to Check for Status of a informatica workflow

We have two Informatica jobs that run in parallel.
One starts at 11.40 CET and it has around 300 Informatica workflows in it out of which one is fact_sales.
The other job runs at 3.40 CET and it has around 115 workflows in it many of which are dependent on fact_sales in term of data consistency.
The problem is fact_sales should finish before certain workflows in process 2 starts for data to be accurate, but this doesnt happen generally.
What we are trying to do is to split the process 2 in such a way that fact_sales dependent workflows run only after the fact_sales has finished.
Can you provide me a way to go about writing a unix shell script that check the status of this fact_sales and if it successfull then kicks off other dependent workflows and if not then it should send a failure mail.
thanks
I don't see the need to write a custom shell script for this. Most of this is pretty standard/common functionality that can be implemented using Command Task and event waits.
**Process1 - runs at 11:50**
....workflow
...
fact_sales workflow. **Add a command task at the end
**that drops a flag, say, fact_sales_0430.done
...
....workflow..500
And all the dependent processes will have an event wait that waits on this .done file. Since there are multiple dependant workflows, make sure none of them deletes the file right away. You can drop this .done file at the end of the day or when the load starts for the next day.
workflow1
.....
dependantworkflow1 -- Event wait, waiting on fact_sales_0430.done (do not delete file).
dependantworkflow2 -- Event wait, waiting on fact_sales_0430.done (do not delete file).
someOtherWorkflow
dependantworkflow3 -- Event wait, waiting on fact_sales_0430.done (do not delete file).
....
......
A second approach can be as follows -
You must be running some kind of scheduler for launching these workflows.. since Informatica cant schedule multiple workflows in a set, it can only handle worklet/sessions at that level of dependency mgmt.
From the scheduler, create a dependency across the sales fact load wf and the other dependent workflows..
I think below mentioned script will work for you. Please udpate the parameters.
WAIT_LOOP=1
while [ ${WAIT_LOOP} -eq 1 ]
do
WF_STATUS=`pmcmd getworkflowdetails -sv $INFA_INTEGRATION_SERVICE -d $INFA_DOMAIN -uv INFA_USER_NAME -pv INFA_PASSWORD -usd Client -f $FOLDER_NAME $WORKFLOW_NAME(fact_sales) | grep "Workflow run status:" | cut -d'[' -f2 | cut -d']' -f1`
echo ${WF_STATUS} | tee -a $LOG_FILE_NAME
case "${WF_STATUS}" in
Aborted)
WAIT_LOOP=0
;;
Disabled)
WAIT_LOOP=0
;;
Failed)
WAIT_LOOP=0
;;
Scheduled)
WAIT_LOOP=0
;;
Stopped)
WAIT_LOOP=0
;;
Succeeded)
WAIT_LOOP=0
;;
Suspended)
WAIT_LOOP=0
;;
Terminated)
WAIT_LOOP=0
;;
Unscheduled)
WAIT_LOOP=0
;;
esac
if [ ${WAIT_LOOP} -eq 1 ]
then
sleep $WAIT_SECONDS
fi
done
if [ ${WF_STATUS} == "Succeeded" ]
then
pmcmd startworkflow -sv $INFA_INTEGRATION_SERVICE -d $INFA_DOMAIN -uv INFA_USER_NAME -pv INFA_PASSWORD -usd Client -f $FOLDER_NAME -paramfile $PARAMETER_FILE $WORKFLOW_NAME(dependent_one) | tee $LOG_FILE_NAME
else
(echo "Please find attached Logs for Run" ; uuencode $LOG_FILE_NAME $LOG_FILE_NAME )| mailx -s "Execution logs" $EMAIL_LIST
exit 1
fi
I can see you have main challenge - keep dependency between large number of infa workflows.
You have two options-
You can use some automated scheduling tool to set the dependency and run them one by one properly. There are many free tool but depending on your comfort/time/cost etc. you should choose. link here.
Secondly you can create your custom job-scheduler. I did a similar scheduler using UNIX script, oracle table. So here are steps for that -
Categorize all your workflows into groups. independent flow should go to group 1 and dependent flows on group 1 goes to group2 and so on.
Set your process to pick up one by one from above groups and kick them off. If kick off queue is empty then it should wait. call it loop2.
Keep a polling loop that will check status of kicked off flows. If failed, aborted etc. fail the process, mail to user and mark all 'in-queue/dependent' flows to failed. If running keep on polling. If succeeded give control to loop 2.
-if kick off queue is empty then go to next group only if all workflow in that group succeeded.
This is a bit tricky process but it paid off once you set it up. You can add as many workflows as you want and your maintenance will be much more smoother compared to infa scheduler or infa worklet etc.
You can fire a query from repository database using tables such REP_SESS_LOG and check if the status of the fact sales has succeeded or not. Then only you can proceed with the second job.

KSH: Block two process from running at the same time

I have two process that running at random time and I want to force them not to ever run at the same time due to reader-writer problem. My thought is whenever a process run, I create a LOCK file, both process has a logic of checking whether a LOCK exist. If LOCK is existed, then sleep for bit and wake up and check it again. Here is a small piece of it
if [[ ! -f ${INPUT_DIR}/LOCK ]]
then
#Create LOCK file
cat /dev/null > ${INPUT_DIR}/LOCK
retcode=${?}
if [[ ${retcode} -ne 0 ]]
then
echo `date` "Error in creating LOCK file by processA.sh - Error code: " ${retcode} >> ${CORE_LOG}
exit
fi
echo `date` "LOCK turns on by processA.sh" >> ${CORE_LOG}
...
rm ${INPUT_DIR}/LOCK
fi
Howver this does not QUITE stop the two process from running at the same time. There are rare time when both process would get pass the first IF checking if the log exist (if both process get invoke at the same time and there was no LOCK exist, very likely that it will get pass the first IF statement), both try to create a LOCK file, since cat /dev/null > ${INPUT_DIR}/LOCK will not generate an error, even when LOCK is already exist. Is there a solution to this?
For the main versions of unix, the preferred solution is to use a lock directory, I would assume this is true for linux, but I haven't had to test it recently.
Creating a directory is an atomic process, and only 1 of the processes will succeed, assuming that you are making a static name like /bin/mkdir -p /tmp/myProjWorkSpace/LOCK. If you need to have information embedded in your lock, then you need a file, and you need sepqrate subdirs per process, possibly add the processID (.$$) to the dir name.
I hope this helps.

Shell script task status monitoring

I'm running an ANT task in background and checking in 60 second intervals whether that task is complete or not. If it is not, every 60 seconds, a message should be displayed on screen - "Deploy process is still running. $slept seconds since deploy started", where $slept is 60, 120, 180 n so on.
There's a limit of 1200 seconds, after which the script will show the log via 'ant log' command and ask the user whether to continue. If the user chooses to continue, 300 seconds are added to the time limit and the process repeats.
The code that I am using for this task is -
ant deploy &
limit=1200
deploy_check()
{
while [ ${slept:-0} -le $limit ]; do
sleep 60 && slept=`expr ${slept:-0} + 60`
if [ $$ = "`ps -o ppid= -p $!`" ]; then
echo "Deploy process is still running. $slept seconds since deploy started."
else
wait $! && echo "Application ${New_App_Name} deployed successfully" || echo "Deployment of ${New_App_Name} failed"
break
fi
done
}
deploy_check
if [ $$ = "`ps -o ppid= -p $!`" ]; then
echo "Deploy process did not finish in $slept seconds. Here's the log."
ant log
echo "Do you want to kill the process? Press Ctrl+C to kill. Press Enter to continue."
read log
limit=`expr ${limit} + 300`
deploy_check
fi
Now, the problem is - this code is not working. This looks like a perfectly good code and yet, this is not working. Can anyone point out what is wrong with this code, please.
If the user chooses to continue, deploy_check gets run again, but there's never another opportunity for the user to continue or cancel (although Ctrl-C could be pressed at any time). So you may want to wrap that in a while loop.
Also, pressing Ctrl-C is probably not going to kill the child process. You need to prompt for yes or no and if yes do a kill $!.
Edit:
Here is how your code flows:
Call deploy_check
Loop until slept exceeds limit (it will be 1260 at that point)
Break if deployed successfully
If it's still running, prompt the user
If the user presses Ctrl-C, the script exits but ant deploy is left running (unless you have a trap set elsewhere in your script that processes the Ctrl-C and does a kill on the ant deploy process)
If the user presses enter, the limit is raised to 1500
Call deploy_check testing slept==1260 against limit==1500
Loop until slept exceeds limit (it will be 1560 at that point)
Break if deployed successfully
The script (or this section of it) ends without prompting the user again
You would need a loop to cause the prompt to occur again.

Resources