Issue with Airflow version 1.10.1 - airflow

Recently I have upgraded airflow version to 1.10.1. I turned on some of the dags turned on which were marked OFF earlier.
I am always using start_date for dag as today's date.
After turning ON the dags it brought below issue.
Scheduler is however starting those DAG's but it is not picking related task's. On task Instance Detail its showing "The execution date is 2018-12-04T13:00:00+00:00 but this is before the task's start date 2019-02-04T00:00:00+00:00." It runs only after triggering it manually.
Is there any other way (apart from fixing the start_date for DAG) this issue can be fixed. i.e; using some config or any other option where I can by-pass the above check of execution date and task's start date.
My main purpose is to run dag's old schedule without manual intervention.

You should not use dynamic start date especially not today's date or datetime.now(). Have a read on official docs https://airflow.readthedocs.io/en/stable/faq.html#what-s-the-deal-with-start-date for more details.
I know you asked for the suggestion apart from start date but your start date definitely needs to be before the task execution date. Hence, I would strongly recommend changing your start_date to something like datetime(2018, 1, 1).

Related

Run a new task under the Airflow DAG which is already succeeded

I've added a new task under the existing DAG and since its deployment onwards (say, 2022-03-08), it is running fine.
However, I also want to run this task for days before its deployed date. Say I want to run a task from 2022-03-01 till 2022-03-07 (Because - I have to load previous data) and the existing already finished successfully for those dates.
How could I achieve that? Instead of manually running the newly added tasks for those dates.
In the below picture, we can see that the new task is running fine from its deployed date but how would I trigger it for the previous dates? At least dates for which I have the data.
The solution might be not a perfect one but at least it works:
I am marking the step as failed
And after just clear it
Note: second step could be easily done in bulk HOST/taskinstance/list page
My Solution - Created a separate DAG with only those tasks that were newly added and provided start_data and end_date in the DAG initialization. This separate DAG helped in loading the data.
I performed this on Airflow version - 1.10.10
Here is solution:
If you need to backfill then choose date from base date field.
click on Go button
Click on newly added task
Clear the task with downstream and future selection
Note: this will clear all future downstream task

Airflow - Scheduler is always one week in delay

I'm trying to make my DAGs run every Monday at 08:00 AM. For this purpose, I have defined the correspondent schedule interval schedule_interval= '0 8 * * 1'.
However, two problems arise - which are likely due to the same issue:
My DAGs never seem to trigger
When I force the DAGs to run, they always run to the previous Monday, e.g. if I force the start today (21-10-2021) it will actually trigger a run on the previous week's Monday, 11-09-2021.
Why does this occur and how can I fix it?
It's not delayed.
Airflow schedule tasks at the END of the interval. You can check this answer for more details.
This behavior make sense in the ETL domain as normally you run ETL at the end of a specific time interval. To give example: Today you are parsing yesterday data.
That said - on Airflow >= 2.2.0 a new concept of Timetables has been introduced with the completion of AIP-39 Richer scheduler_interval see release notes. In simple words Airflow decoupled the when to run (Timetable) from the on what interval of time to process (Data Interval) thus resolved the issue you experience from the root. You can read the documentation about it here.

Why does airflow skips a day for a daily dag?

I have a newly created daily dag and I have set it up yesterday (Jan. 25th), once it is loaded by airflow I can see it is run once (scheduled_2021-0124T00:00:00+00:00), and then I manually triggered it once just to see if it works and it did (manual_2021-01-25).
Now time is 08:24 UTC Jan 26th. But I did not see any run for 01-25. I have used airflow dags next-execution and found out airflow is planning to execute the dag for 01-26 directly, possibly on 01-27 00:00 UTC. So it will skip 01-25 entirely.
I am wondering why this behaviour? Is there any reason behind this?
This is THE most difficult concept to grasp in Airflow. After you get this the rest of the system is fairly straightforward. But this one design spec is brutal, I have seen it being seasoned engineers to their knees, sobbing in fits of rage.
As the other poster mentioned in the Airflow docs, Airflow runs your job at the end of the period. This is easiest for me to visualize for a DAG that has a daily schedule. The DAG run date for 01/01/2021, with a start time of 00:01 AM, will not execute until 01/02/2021 00:01 AM.
The confusing part of this is WHY!? When you stop to think about why Airflow was written it begins to make sense. This execution pattern ensures that the data for the run date 01/01/2021 is complete and ready when your orchestration pipeline runs to act on this data. Think about it as a business process. If you are a business analyst and come into work on 01/02/2021 you will be looking at data from the day before, not data from today. The data from today has not yet been collected.
The same pattern is true for weekly or monthly intervals as well. The data for that week or month is not going to be ready to act on until the end of the period.
This also makes more sense when you start using the macros and jinja templating.
Hopefully this is now clear as Mud.
This is actually a bug in Airflow 2.0.0 release which was fixed in 2.0.1: https://github.com/apache/airflow/issues/13434
This is a feature of Airflow that confused me too, in the beginning. From the Airflow docs:
If you run a DAG on a schedule_interval of one day, the run with execution_date 2019-11-21 triggers soon after 2019-11-21T23:59.
Let’s Repeat That, the scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.

Airflow - How to skip next day's instance?

I have a DAG 'abc' scheduled to run every day at 7 AM CST. For some reason, I do not want to run tomorrow's instance. How can I skip that particular instance. Is there any way to do that using command line ? Appreciate any help on this.
I believe you can preemptively create a DAG Run for the future date at in the UI under Browse->DAG Run -> Create, initializing it in the success (or failed) state, which should prevent the scheduler from creating a new run when the time comes. I think you can do this on the CLI with trigger_dag as well, but you'll just need to separately update its state cause it'll default to running.
I think you can set the start_date for the day after tomorrow or whatever date you want your dag run as long as it is in the future. but the schedule interval will stay the same every 7AM. You can start date in Default_Args

Why does Airflow reschedule tasks that did not exist at the time when clearing other tasks

When clearing a task of a DAG for January and Februrary 2019, I noticed that all tasks of this DAG that did not exist at the time were triggered.
I'm wondering why this happens. I suppose the scheduler is kind of "forced" to look at the DAG runs of January and February, and because the tasks that did not exist at the time never ran for these execution dates, they get triggered. But I'd like to put concrete words on this vague understanding of the situation.
Can I avoid this? This creates unexpected behavior and has me doubting before launching a big replay of a month that is long past :)
We have also encountered this problem and I think it makes sense. As per Airflow documentation stated.
Once you clear a DAG, it will be cleared as if it never runs.
so in my understanding, it will check all dag and task instance all over again, run all the task until it reached the schedule time.
Can I avoid this? I'm no airflow expert but I think as of now, we can't. What we normally do is to duplicate the DAG we want to rerun and set start_date and end_date, so it will not intervene with the current DAG that is running normally.

Resources