I don't understand why do we need a 'start_date' for the operators(task instances). Shouldn't the one that we pass to the DAG suffice?
Also, if the current time is 7th Feb 2018 8.30 am UTC, and now I set the start_date of the dag to 7th Feb 2018 0.00 am with my cron expression for schedule interval being 30 9 * * * (daily at 9.30 am, i.e expecting to run in next 1 hour). Will my DAG run today at 9.30 am or tomorrow (8th Feb at 9.30 am )?
Regarding start_date on task instance, personally I have never used this, I always just have a single DAG start_date.
However from what I can see this would allow you to specify certain tasks to start at a different time from the main DAG. It appears this is a legacy feature and from reading the FAQ they recommend using time sensors for that type of thing instead and just having one start_date for all tasks passed through the DAG.
Your second question:
The execution date for a run is always the previous period based on your schedule.
From the docs (Airflow Docs)
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
To clarify:
If set on a daily schedule, on the 8th it will execute the 7th.
If set to a weekly schedule to run on a Sunday, the execution date for this Sunday would be last Sunday.
Some complex requirements may need specific timings at the task level. For example, I may want my DAG to run each day for a full week before some aggregation logging task starts running, so to achieve this I could set different start dates at the task level.
A bit more useful info... looking through the airflow DAG class source it appears that setting the start_date at the DAG level simply means it is passed through to the task when no default value for task start_date was passed in to the DAG via the default_args dict, or when no specific start_date is are defined on a per task level. So for any case where you want all tasks in a DAG to kick off at the same time (dependencies aside), setting start_date at the DAG level is sufficient.
Just to add to what is already here. A task that depends on another task(s) must have a start date >= to the start date of its dependencies.
For example:
if task_a depends on task_b
you cannot have
task_a start_date = 1/1/2019
task_b start_date = 1/2/2019
Otherwise, task_a will not be runnable for 1/1/2019 as task_b will not run for that date and you cannot mark it as complete either
Why would you want this?
I would have liked this logic for a task, which was an external task sensor waiting for the completion of another dag. But the other dag had a start date after the current dag. Therefore, I didn't want the dependency in place for days when the other dag didn't exist
it's likely to not set the dag parameter of your tasks as stated by :
https://stackoverflow.com/a/61749549/1743724
Related
For example, I created a new DAG at 11:30. The DAG is supposed to run on the first minute of every hour (Cron: 0 * * * *). I expected the DAG to start at 12:00. However the DAG first started at 13:00. The Second started at 14:00 while Next run in UI is 13:00
I know that, it's mentioned in docs
**Let’s Repeat That**, the scheduler runs your job one schedule AFTER the start date, at the END of the interval.
But I don't know What's the use of doing that? Why don't run at the start of the day like cron job?
Does anyone have any documentation explaining this issue?
thanks
The dags set in 1 minute interval time, but we need to set up the dags in 30 seconds intervals. Is there any way to achieve this in Airflow?
If you check the schedule_interval documentation, you have two choices:
preset: The minimum period is hourly
cron expression: The standard cron expression doesn't handle seconds
So achieve this with Airflow schedule interval is not possible. But you can handle it in your code. You can for exemple create 2 DAGs for the same schedule interval, and for one DAG wait the second 30 to run the code.
I don't understand what the use is for specifying a DAG start_date in the past. I've read about catchup and backfill but I still don't get it. In what context would I want to specify a start_date in the past?
for a scheduled run, airflow scheduler waits for the completion of interval time period before running your DAG.
for instance, say you want run your dag on monthly basis and scheduled it as 0 3 11 * * , which means to run your dag at 3 AM on 11th day of the month.
Now, say you have deployed your dag on 10th day of January, 2021 then you would expect it to run on the next day. But In reality, airflow won't trigger your DAG till next month ie. 11th Feb,2021. So the airflow will wait for about one month before actually triggering your DAG that was supposed to run on 11th of Jan, 2021.
In this scenario, when you deploy your DAG you can mention your start_date as 10th Dec, 2020 so that when the actual day (11th Jan,2021) comes, scheduler will mark as completion of interval time period and start your triggering your DAG.
for more reference, you can read up : https://www.astronomer.io/guides/scheduling-tasks
I have an airflow process that runs every Sunday at 12:00am. Is there a way to trigger this process exactly at the same time (absolute time) every week regardless of previous run duration or outcome. I see that the start time of the process keeps creeping forward to the point that after a couple of weeks it now gets triggered a full 16 hours later than the scheduled time. How do I make it start exactly at the same time regardless of the previous run outcome or whether previously triggerred manually or not (cron like behaviour) ?
Add depends_on_past argument in your DAG's default_args, false value will make sure new dagruns will be created every interval without depending on the previous dagrun status:
'depends_on_past': False
It might not be necessary, but I recommend restart your scheduler after making this change.
I am quite confused about time concept in airflow. As far as I know, there are at least 5 different time concept in airflow:
UTC time in web UI
Execution time, e.g. start_date, in DAG file
RUN time.
time saved in database
system time
Could anyone explain which are the same and which are different?
The UTC time in the Web UI(1) is the system time (5) being presented.
The run time (3) shows when a task was actually run in UTC and this is the time saved in the database (4) as it corresponds to when a certain task was actually ran in a DAG.
The start_date should be a static non-changing date as it corresponds to when the DAG was first started and should be backfilled if certain parameters like catchup or depends_on_past is set to true.
So 5 and 1 are the same, 3 and 4 are the same, and 5 is it's own entity of time.