I have a Coordinator, which has a data dependency on a parquet directory, partitioned by date. And it runs every day in the morning. If the file isn't available for that day, the workflow goes into "WAITING" status.
Now I want to use the Oozie SLA feature to alert for this condition. Unfortunately I am not able to get this setup following the standard Oozie SLA feature. This feature is working, if my jobs is in "RUNNING" status, but takes too long to complete, but not if the workflow is in "WAITING" status. Oozie documentation doesn't have any references to this, so appreciate any advice on how to set it up.
https://oozie.apache.org/docs/4.1.0/DG_SLAMonitoring.html
Related
My Airflow Scheduler went down for some reason, and when I re-started it, all the DAGS triggered simultaneously. It was as if it was catching up from the missed jobs. Also, it seems when I modify a DAG, the workflow triggers. These unexpected triggers corrupt my data and loses trust in the system.
Is there a way to prevent a DAG running unexpectedly unless it is the exact time (no catch-up) or unless it is manually triggered?
The airflow scheduler will, at a minimum, attempt to run the current schedule interval when it is online to do so. This means that if the scheduler process is offline for a period of time, when it comes back online it will reconcile which jobs should have run and attempt to start those jobs.
There is some control using catchup, which tells the scheduler that only the latest job should be run and schedule intervals other than the latest that were missed do not need to be run.
Some info on catchup here: https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html#catchup
Is there a way to prevent a DAG running unexpectedly unless it is the exact time (no catch-up) or unless it is manually triggered?
There is no way to tell Airflow to only attempt to schedule the job at the exact time the job is supposed to run (and never attempt again after the fact). You can set the schedule interval to None and the job will never be scheduled, however. You can manually trigger the job through the UI or via the Airflow API in this case.
https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html#cron-presets
preset | meaning
-------+----------------------------------------------------------------
None | Don’t schedule, use for exclusively “externally triggered” DAGs
An oozie coordinator we own has been killed for operational reasons about a week ago. The cluster is now back up and running and ready for business. Can we revive it somehow so it will keep its run history and backfill all missing runs, or do we have to schedule a brand new one?
oozie job -resume xxxxxxx-xxxxxxxxxxxxxxx-oozie-oozi-C doesn't error out, but it also doesn't change the status of the coordinator back to RUNNING.
Have you tried out the killed -> ignored -> running transition? Based on the docs it should be possible.
It's a two step process: first one is based -ignore, second one is -change.
I've never tried to do this though :)
we have source files are arrived in hdfs every day except holidays.
our oozie coordinator watch these files to start every day. I do not want the oozie to run on holidays defined. How to do that. Coodinator should not timeout if it is holiday.
One possible solution is run job regularly and exclude all the job actions through switch case using decision nodes for holidays. For this start to java action which will check if this is holiday, propagate this value to decision action and then decide if the required actions will run or not for this value.(oozie supports propagation of value in workflow from one action to other). For each of the two scenario provide different message for your confirmation, 'todays holiday required actions skipped' else 'No holidays job succeeded'.
When i changed the start time of a coordinator job in job.properties in oozie, the job is not taking the changed time, instead its running in the old scheduled time.
Old job.properties:
startMinute=08
startTime=${startDate}T${startHour}:${startMinute}Z
New job.properties:
startMinute=07
startTime=${startDate}T${startHour}:${startMinute}Z
The job is not running at the changed time:07th minute,its running at 08th minute in every hour.
Please can you let me know the solution, how i can make the job pickup the updated properties(changed timing) without restarting or killing the job.
You can't really change the timing of the co-ordinator via any methods given by Oozie(v3.3.2) . When you submit a job the contents properties are stored in the database whereas the actual workflow is in the HDFS.
Everytime you execute the co-ordinator it is necessary to have the workflow in the path specified in properties during job submission but the properties file is not needed. What I mean to imply is the properties file does not come into the picture after submitting the job.
One hack is to update the time directly in the database using SQL query.But I am not sure about the implications of it.The property might become inconsistent across the database.
You have to kill the job and resubmit a new one.
Note: oozie provides a way to change the concurrency,endtime and pausetime as specified in the official docs.
I have a coordinator on oozie that runs a series of tasks, each of which depends on the output of the last.
Each task outputs a dated folder and looks for the output of its predecessor using
${coord:latest(0)}
This all worked fine on my dev cluster when nothing else was running; every 5 minutes oozie would queue up another job, and in that 5 minutes the previous job had run so when the new job was set up it would see the directory it needed.
I run into problems on the production cluster; the jobs get submitted, but are put in a queue and don't run for a while, but still every 5 minutes oozie queues up another one, and in its initialization stage it is assigned its 'previous' folder, which hasn't been created yet as its predecessor hasn't run so the 'latest' function gives it the same input as the previous job. I then end up with 10 jobs all taking the same input...
What I need is a way of strictly preventing the next job in a coordinator sequence from even being created until its predecessor has finished running.
Is there a way this can be done?
Thanks for reading
This is the exact use case that Oozie designed to solve. Oozie will wait all data dependency before launch.
Please try to understand the following configs in your coordinator.xml
<datasets>
<dataset name="my_data" frequency="${coord:days(1)}" initial-instance="2013-01-27T00:00Z">
<uri-template>YOUR_DATA/${YEAR}${MONTH}${DAY}</uri-template>
</dataset>
...
<datasets>
<input-events>
<data-in name="my_data" dataset="my_data">
<instance>${coord:current(-1)}</instance>
</data-in>
</input-events>
<output-events>
<data-out name="my_data" dataset="my_data">
<instance>${coord:current(0)}</instance>
</data-out>
</output-events>
the "coord:current(-1)" in input-events means the previous output. It will interpret the dataset URI teamplate to "yesterday", and Oozie will check whether the data exist in HDFS by checking a success flag, which by default is an empty file named "_SUCCESS", right under the output directory. Oozie will keep waiting this flag before launching the current workflow.
btw, you can also set
<coordinator-app name="my_coordinator" frequency="${coord:days(1)}" start="${start_time}" end="${end_time}" ...>
to define start time and end time of a coordinator job, so you can catch up backlog data.