Oozie Subtraction Calculation in Workflow - oozie

I'm trying to calculate the result of a simple equation in oozie.
Inside config_default.xml:
<property>
<name>mintime</name>
<value>${time - period}</value>
</property>
I pass in both -D time=3600 and -D period=3600
I would like the variable mintime to be accessible inside the worklfow with the value of 0.
${time - period} doesn't compute this calculation.
time - this is the number of seconds into the current day the co-ordinator has been run. However as we may want to run this adhoc without a co-ordinator this is specified via -D
period - this is the seconds between each application run, as we want to run this adhoc without a co-ordinator this is also specified via -D

Related

Can Snakemake work if a rule's shell command is a cluster job?

In below example, if shell script shell_script.sh sends a job to cluster, is it possible to have snakemake aware of that cluster job's completion? That is, first, file a should be created by shell_script.sh which sends its own job to the cluster, and then once this cluster job is completed, file b should be created.
For simplicity, let's assume that snakemake is run locally meaning that the only cluster job originating is from shell_script.sh and not by snakemake .
localrules: that_job
rule all:
input:
"output_from_shell_script.txt",
"file_after_cluster_job.txt"
rule that_job:
output:
a = "output_from_shell_script.txt",
b = "file_after_cluster_job.txt"
shell:
"""
shell_script.sh {output.a}
touch {output.b}
"""
PS - At the moment, I am using sleep command to give it a waiting time before the job is "completed". But this is an awful workaround as this could give rise to several problems.
Snakemake can manage this for you with the --cluster argument on the command line.
You can supply a template for the jobs to be executed on the cluster.
As an example, here is how I use snakemake on a SGE managed cluster:
template which will encapsulate the jobs which I called sge.sh:
#$ -S /bin/bash
#$ -cwd
#$ -V
{exec_job}
then I use directly on the login node:
snakemake -rp --cluster "qsub -e ./logs/ -o ./logs/" -j 20 --jobscript sge.sh --latency-wait 30
--cluster will tell which queuing system to use
--jobscript is the template in which jobs will be encapsulated
--latency-wait is important if the file system takes a bit of time to write the files. You job might end and return before the output of the rules are actually visible to the filesystem which will cause an error
Note that you can specify rules not to be executed on the nodes in the Snakefile with the keyword localrules:
Otherwise, depending on your queuing system, some options exist to wait for job sent to cluster to finish:
SGE:
Wait for set of qsub jobs to complete
SLURM:
How to hold up a script until a slurm job (start with srun) is completely finished?
LSF:
https://superuser.com/questions/46312/wait-for-one-or-all-lsf-jobs-to-complete

How to set timers for console commands in symfony 2.x?

In my app I've got one task that needs to be done every 48 hours on server side. I've created a console command in order to automatize my job. However I don't know how can I set timer to keep invoking that command. Can you point my a way to do that?
You should see on a cron commands.
Cron will run your command every X (frequency) times.
TO create a cron, (on unix) use: crontab -e
For example
0 0 */2 * * bin/console app:command >/dev/null 2>&1
will run every odd days, bin/console app:command
to help you generating a cron
https://crontab-generator.org/

Difference between Cron and Crontab?

I am not able to understand the answer for this question: "What's the difference between cron and crontab." Are they both schedulers with one executing the files once and the other executing the files on a regular interval OR does cron schedule a job and crontab stores them in a table or file for execution?
Wiki page for Cron mentions :
Cron is driven by a crontab (cron table) file, a configuration file
that specifies shell commands to run periodically on a given schedule.
But wiki.dreamhost for crontab mentiones :
The crontab command, found in Unix and Unix-like operating systems, is
used to schedule commands to be executed periodically. It reads a
series of commands from standard input and collects them into a file
known as a "crontab" which is later read and whose instructions are
carried out.
Specifically, When I schedule a job to be repeated : (Quoting from wiki)
1 0 * * * printf > /var/log/apache/error_log
or executing a job only once
at -f myScripts/call_show_fn.sh 1:55 2014-10-14
Am I doing a cron function in both the commands which is pushed in crontab OR is the first one a crontab and the second a cron function?
cron is the general name for the service that runs scheduled actions. crond is the name of the daemon that runs in the background and reads crontab files. A crontab is a file containing jobs in the format
minute hour day-of-month month day-of-week command
crontabs are normally stored by the system in /var/spool/<username>/crontab. These files are not meant to be edited directly. You can use the crontab command to invoke a text editor (what you have defined for the EDITOR env variable) to modify a crontab file.
There are various implementations of cron. Commonly there will be per-user crontab files (accessed with the command crontab -e) as well as system crontabs in /etc/cron.daily, /etc/cron.hourly, etc.
In your first example you are scheduling a job via a crontab. In your second example you're using the at command to queue a job for later execution.

Getting pid of a particular instance of the application in unix

I have one application and multiple instances of the same application are running in the system. Every instance of the application is invoked with different argument.
I want to get the pid of a particular process which had been invoked with some argument i.e. pid of particular instance of the application according to the argument passed.
Is there any way to get it?
I would probably check output of ps -eo pid,args and grep for the parameters I need and then cut the pid from the beginning of the output:
ps -eo pid,args | grep --parameter=x | cut -c 1-5
Check man page of grep. There are a lot of (somewhat confusing) options that will allow you shape the output of the command. In the above example -e selects all the processes to be shown and -o lets user to choose what to output.

Shell Script to Check for Status of a informatica workflow

We have two Informatica jobs that run in parallel.
One starts at 11.40 CET and it has around 300 Informatica workflows in it out of which one is fact_sales.
The other job runs at 3.40 CET and it has around 115 workflows in it many of which are dependent on fact_sales in term of data consistency.
The problem is fact_sales should finish before certain workflows in process 2 starts for data to be accurate, but this doesnt happen generally.
What we are trying to do is to split the process 2 in such a way that fact_sales dependent workflows run only after the fact_sales has finished.
Can you provide me a way to go about writing a unix shell script that check the status of this fact_sales and if it successfull then kicks off other dependent workflows and if not then it should send a failure mail.
thanks
I don't see the need to write a custom shell script for this. Most of this is pretty standard/common functionality that can be implemented using Command Task and event waits.
**Process1 - runs at 11:50**
....workflow
...
fact_sales workflow. **Add a command task at the end
**that drops a flag, say, fact_sales_0430.done
...
....workflow..500
And all the dependent processes will have an event wait that waits on this .done file. Since there are multiple dependant workflows, make sure none of them deletes the file right away. You can drop this .done file at the end of the day or when the load starts for the next day.
workflow1
.....
dependantworkflow1 -- Event wait, waiting on fact_sales_0430.done (do not delete file).
dependantworkflow2 -- Event wait, waiting on fact_sales_0430.done (do not delete file).
someOtherWorkflow
dependantworkflow3 -- Event wait, waiting on fact_sales_0430.done (do not delete file).
....
......
A second approach can be as follows -
You must be running some kind of scheduler for launching these workflows.. since Informatica cant schedule multiple workflows in a set, it can only handle worklet/sessions at that level of dependency mgmt.
From the scheduler, create a dependency across the sales fact load wf and the other dependent workflows..
I think below mentioned script will work for you. Please udpate the parameters.
WAIT_LOOP=1
while [ ${WAIT_LOOP} -eq 1 ]
do
WF_STATUS=`pmcmd getworkflowdetails -sv $INFA_INTEGRATION_SERVICE -d $INFA_DOMAIN -uv INFA_USER_NAME -pv INFA_PASSWORD -usd Client -f $FOLDER_NAME $WORKFLOW_NAME(fact_sales) | grep "Workflow run status:" | cut -d'[' -f2 | cut -d']' -f1`
echo ${WF_STATUS} | tee -a $LOG_FILE_NAME
case "${WF_STATUS}" in
Aborted)
WAIT_LOOP=0
;;
Disabled)
WAIT_LOOP=0
;;
Failed)
WAIT_LOOP=0
;;
Scheduled)
WAIT_LOOP=0
;;
Stopped)
WAIT_LOOP=0
;;
Succeeded)
WAIT_LOOP=0
;;
Suspended)
WAIT_LOOP=0
;;
Terminated)
WAIT_LOOP=0
;;
Unscheduled)
WAIT_LOOP=0
;;
esac
if [ ${WAIT_LOOP} -eq 1 ]
then
sleep $WAIT_SECONDS
fi
done
if [ ${WF_STATUS} == "Succeeded" ]
then
pmcmd startworkflow -sv $INFA_INTEGRATION_SERVICE -d $INFA_DOMAIN -uv INFA_USER_NAME -pv INFA_PASSWORD -usd Client -f $FOLDER_NAME -paramfile $PARAMETER_FILE $WORKFLOW_NAME(dependent_one) | tee $LOG_FILE_NAME
else
(echo "Please find attached Logs for Run" ; uuencode $LOG_FILE_NAME $LOG_FILE_NAME )| mailx -s "Execution logs" $EMAIL_LIST
exit 1
fi
I can see you have main challenge - keep dependency between large number of infa workflows.
You have two options-
You can use some automated scheduling tool to set the dependency and run them one by one properly. There are many free tool but depending on your comfort/time/cost etc. you should choose. link here.
Secondly you can create your custom job-scheduler. I did a similar scheduler using UNIX script, oracle table. So here are steps for that -
Categorize all your workflows into groups. independent flow should go to group 1 and dependent flows on group 1 goes to group2 and so on.
Set your process to pick up one by one from above groups and kick them off. If kick off queue is empty then it should wait. call it loop2.
Keep a polling loop that will check status of kicked off flows. If failed, aborted etc. fail the process, mail to user and mark all 'in-queue/dependent' flows to failed. If running keep on polling. If succeeded give control to loop 2.
-if kick off queue is empty then go to next group only if all workflow in that group succeeded.
This is a bit tricky process but it paid off once you set it up. You can add as many workflows as you want and your maintenance will be much more smoother compared to infa scheduler or infa worklet etc.
You can fire a query from repository database using tables such REP_SESS_LOG and check if the status of the fact sales has succeeded or not. Then only you can proceed with the second job.

Resources