Airflow - are UTC time and DAG time the same? - airflow

I am quite confused about time concept in airflow. As far as I know, there are at least 5 different time concept in airflow:
UTC time in web UI
Execution time, e.g. start_date, in DAG file
RUN time.
time saved in database
system time
Could anyone explain which are the same and which are different?

The UTC time in the Web UI(1) is the system time (5) being presented.
The run time (3) shows when a task was actually run in UTC and this is the time saved in the database (4) as it corresponds to when a certain task was actually ran in a DAG.
The start_date should be a static non-changing date as it corresponds to when the DAG was first started and should be backfilled if certain parameters like catchup or depends_on_past is set to true.
So 5 and 1 are the same, 3 and 4 are the same, and 5 is it's own entity of time.

Related

Is there a way to make an Airflow process start at the exact same time every week?

I have an airflow process that runs every Sunday at 12:00am. Is there a way to trigger this process exactly at the same time (absolute time) every week regardless of previous run duration or outcome. I see that the start time of the process keeps creeping forward to the point that after a couple of weeks it now gets triggered a full 16 hours later than the scheduled time. How do I make it start exactly at the same time regardless of the previous run outcome or whether previously triggerred manually or not (cron like behaviour) ?
Add depends_on_past argument in your DAG's default_args, false value will make sure new dagruns will be created every interval without depending on the previous dagrun status:
'depends_on_past': False
It might not be necessary, but I recommend restart your scheduler after making this change.

How to run Airflow DAG for specific number of times?

How to run airflow dag for specified number of times?
I tried using TriggerDagRunOperator, This operators works for me.
In callable function we can check states and decide to continue or not.
However the current count and states needs to be maintained.
Using above approach I am able to repeat DAG 'run'.
Need expert opinion, Is there is any other profound way to run Airflow DAG for X number of times?
Thanks.
I'm afraid that Airflow is ENTIRELY about time based scheduling.
You can set a schedule to None and then use the API to trigger runs, but you'd be doing that externally, and thus maintaining the counts and states that determine when and why to trigger externally.
When you say that your DAG may have 5 tasks which you want to run 10 times and a run takes 2 hours and you cannot schedule it based on time, this is confusing. We have no idea what the significance of 2 hours is to you, or why it must be 10 runs, nor why you cannot schedule it to run those 5 tasks once a day. With a simple daily schedule it would run once a day at approximately the same time, and it won't matter that it takes a little longer than 2 hours on any given day. Right?
You could set the start_date to 11 days ago (a fixed date though, don't set it dynamically), and the end_date to today (also fixed) and then add a daily schedule_interval and a max_active_runs of 1 and you'll get exactly 10 runs and it'll run them back to back without overlapping while changing the execution_date accordingly, then stop. Or you could just use airflow backfill with a None scheduled DAG and a range of execution datetimes.
Do you mean that you want it to run every 2 hours continuously, but sometimes it will be running longer and you don't want it to overlap runs? Well, you definitely can schedule it to run every 2 hours (0 0/2 * * *) and set the max_active_runs to 1, so that if the prior run hasn't finished the next run will wait then kick off when the prior one has completed. See the last bullet in https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled.
If you want your DAG to run exactly every 2 hours on the dot [give or take some scheduler lag, yes that's a thing] and to leave the prior run going, that's mostly the default behavior, but you could add depends_on_past to some of the important tasks that themselves shouldn't be run concurrently (like creating, inserting to, or dropping a temp table), or use a pool with a single slot.
There isn't any feature to kill the prior run if your next schedule is ready to start. It might be possible to skip the current run if the prior one hasn't completed yet, but I forget how that's done exactly.
That's basically most of your options there. Also you could create manual dag_runs for an unscheduled DAG; creating 10 at a time when you feel like (using the UI or CLI instead of the API, but the API might be easier).
Do any of these suggestions address your concerns? Because it's not clear why you want a fixed number of runs, how frequently, or with what schedule and conditions, it's difficult to provide specific recommendations.
This functionality isn't natively supported by Airflow
But by exploiting the meta-db, we can cook-up this functionality ourselves
we can write a custom-operator / python operator
before running the actual computation, check if 'n' runs for the task (TaskInstance table) already exist in meta-db. (Refer to task_command.py for help)
and if they do, just skip the task (raise AirflowSkipException, reference)
This excellent article can be used for inspiration: Use apache airflow to run task exactly once
Note
The downside of this approach is that it assumes historical runs of task (TaskInstances) would forever be preserved (and correctly)
in practise though, I've often found task_instances to be missing (we have catchup set to False)
furthermore, on large Airflow deployments, one might need to setup routinal cleanup of meta-db, which would make this approach impossible

Airflow: Why is there a start_date for operators?

I don't understand why do we need a 'start_date' for the operators(task instances). Shouldn't the one that we pass to the DAG suffice?
Also, if the current time is 7th Feb 2018 8.30 am UTC, and now I set the start_date of the dag to 7th Feb 2018 0.00 am with my cron expression for schedule interval being 30 9 * * * (daily at 9.30 am, i.e expecting to run in next 1 hour). Will my DAG run today at 9.30 am or tomorrow (8th Feb at 9.30 am )?
Regarding start_date on task instance, personally I have never used this, I always just have a single DAG start_date.
However from what I can see this would allow you to specify certain tasks to start at a different time from the main DAG. It appears this is a legacy feature and from reading the FAQ they recommend using time sensors for that type of thing instead and just having one start_date for all tasks passed through the DAG.
Your second question:
The execution date for a run is always the previous period based on your schedule.
From the docs (Airflow Docs)
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
To clarify:
If set on a daily schedule, on the 8th it will execute the 7th.
If set to a weekly schedule to run on a Sunday, the execution date for this Sunday would be last Sunday.
Some complex requirements may need specific timings at the task level. For example, I may want my DAG to run each day for a full week before some aggregation logging task starts running, so to achieve this I could set different start dates at the task level.
A bit more useful info... looking through the airflow DAG class source it appears that setting the start_date at the DAG level simply means it is passed through to the task when no default value for task start_date was passed in to the DAG via the default_args dict, or when no specific start_date is are defined on a per task level. So for any case where you want all tasks in a DAG to kick off at the same time (dependencies aside), setting start_date at the DAG level is sufficient.
Just to add to what is already here. A task that depends on another task(s) must have a start date >= to the start date of its dependencies.
For example:
if task_a depends on task_b
you cannot have
task_a start_date = 1/1/2019
task_b start_date = 1/2/2019
Otherwise, task_a will not be runnable for 1/1/2019 as task_b will not run for that date and you cannot mark it as complete either
Why would you want this?
I would have liked this logic for a task, which was an external task sensor waiting for the completion of another dag. But the other dag had a start date after the current dag. Therefore, I didn't want the dependency in place for days when the other dag didn't exist
it's likely to not set the dag parameter of your tasks as stated by :
https://stackoverflow.com/a/61749549/1743724

Oozie coordinator-app: execute job every Nth minute divisible by M

I have an Hive script that I am executing using Oozie coordinator every 10 minutes. When I launched my Oozie coordinator-app, suppose I have started at 08:03, the first workflow starts at that time, the next 08:13, and then 08:23, and so on.
What I want is to execute the workflow every clock time hh:mm, where mm is divisible by 10. Assuming the same scenario above, what I want to happen is this: the first workflow will execute at 08:10, and then 08:20, and so on.
How do I do this in Oozie? How about when every 5 minutes (The last m of the minute is either 5 or 0)? Thanks for your input.
In order to run a coordinator job at a frequency, you can use the following directives
<coordinator-app name="app" frequency="10" start="2015-07-10T12:00Z" end="2016-01-01T00:00Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.1">
This would run every 10 minutes, starting exactly at 12:00 UTC time today. The same goes for running every 5 minutes, just replace frequency="10" with frequency="5". To to have it run every Nth minute divisible by M, you will have to ensure that your start parameter is set correctly.
Another option if you are using a more recent version of Oozie (4.1.0) would be to use the cron like scheduler. This would allow you to schedule Oozie coordinators in a cron-like fashion if you're familiar. See http://blog.cloudera.com/blog/2014/04/how-to-use-cron-like-scheduling-in-apache-oozie/ and https://issues.apache.org/jira/browse/OOZIE-1306

Running a Job from Autosys every X minutes

I am looking to run a console application triggered from Autosys every X minutes.
The following commands do not seem to provide this capability
start_times: Exact time each day a job will run [cannot be
used with start_mins]
start_mins: Minutes after each hour a job will execute
[cannot be used with start_times]
The solution that I can see at the moment is to set start_mins : 0,5,10,15,20,25,30,35,40,45,50,55
This is ok if the time interval is 5 minutes, but becomes a little cumbersome if the interval is 1 or 2 minutes.
Is there any way to configure Autosys to easily repeat a job every x minutes ?
there is only one way for autosys to start a job every minute -
start_mins: 0,1,2..59

Resources