For some reason, Airflow doesn't seem to trigger the latest run for a dag with a weekly schedule interval.
Current Date:
$ date
$ Tue Aug 9 17:09:55 UTC 2016
DAG:
from datetime import datetime
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
dag = DAG(
dag_id='superdag',
start_date=datetime(2016, 7, 18),
schedule_interval=timedelta(days=7),
default_args={
'owner': 'Jon Doe',
'depends_on_past': False
}
)
BashOperator(
task_id='print_date',
bash_command='date',
dag=dag
)
Run scheduler
$ airflow scheduler -d superdag
You'd expect a total of four DAG Runs as the scheduler should backfill for 7/18, 7/25, 8/1, and 8/8.
However, the last run is not scheduled.
EDIT 1:
I understand that Vineet although that doesn’t seem to explain my issue.
In my example above, the DAG’s start date is July 18.
First DAG Run: July 18
Second DAG Run: July 25
Third DAG Run: Aug 1
Fourth DAG Run: Aug 8 (not run)
Where each DAG Run processes data from the previous week.
Today being Aug 9, I would expect the Fourth DAG Run to have executed with a execution date of Aug 8 which processes data for the last week (Aug 1 until Aug 8) but it doesn’t.
Airflow always schedules for the previous period. So if you have a dag that is scheduled to run daily, on Aug 9th, it will schedule a run with execution_date Aug 8th. Similarly if the schedule interval is weekly, then on Aug 9th, it will schedule for 1 week back i.e. Aug 2nd, though this gets run on Aug 9th itself. This is just airflow bookkeeping. You can find this in the airflow wiki (https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls):
Understanding the execution date
Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if I want to summarize data for 2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be right after all data for 2016-02-19 becomes available.
This date is available to you in both Jinja and a Python callable's context in many forms as documented here. As a note ds refers to date_string, not date start as may be confusing to some.
The similar issue happened to me as well.
I solved it by manually run
airflow backfill -s start_date -e end_date DAG_NAME
where start_date and end_date covers the missing execution_date, in your case, 2016-08-08.
For example,
airflow backfill -s 2016-08-07 -e 2016-08-09 DAG_NAME
I have also encountered a similar problem these days while learning apache airflow.
I think as Vineet has explained, given the way airfow works, you should probably use the execution date as the beginning of DAG execution, and not as the end of DAG execution as you said below.
I understand that Vineet although that doesn’t seem to explain my issue.
In my example above, the DAG’s start date is July 18.
First DAG Run: July 18
Second DAG Run: July 25
Third DAG Run: Aug 1
Fourth DAG Run: Aug 8 (not run)
Where each DAG Run processes data from
the previous week.
To make it work, you should probably use, for instance, July 18 as the start date of the DAG execution for the week July 18 to July 22, instead of the end of the DAG execution for the week July 11 to July 15 for the week.
Related
I have an Airflow DAG set up to run monthly (with the #monthly time_interal). The next dag runs seem to be scheduled but they don't appear as "queued" in the Airflow UI. I don't understant because everythink seems good otherwise. Here is how my DAG is configured :
with DAG(
"dag_name",
start_date=datetime(2023, 1, 1),
schedule_interval="#monthly",
catchup=True,
default_args={"retries": 5, "retry_delay": timedelta(minutes=1)},
) as dag:
Do you get no runs at all when you unpause your DAG or one that is being backfilled and it says Last run 2023-01-01, 00:00:00?
In the latter case Airflow is behaving as intended, the run that just happened was the one that would have actually been queued and ran at midnight on 2023-02-01. :)
I used your configuration on a new simple DAG and it gave me one backfilled successful run with the run ID scheduled__2023-01-01T00:00:00+00:00 so running for the data interval 2023-01-01 (logical_date) to 2023-02-01, which means the Run that would have actually been queued at midnight on 2023-02-01.
The next run is scheduled for the logical date 2023-02-01 which means for the data from 2023-02-01 to 2023-03-01. This run will only actually be queued and happen at midnight 2023-03-01 as the Run After date shows:
This guide might help with terminology Airflow uses around schedules.
I'm assuming you wanted the DAG to backfill two runs, one that would have happened on 2023-01-01 and one that would have happened on 2023-02-01. This DAG should do that:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.empty import EmptyOperator
with DAG(
"dag_name_3",
start_date=datetime(2022, 12, 1),
schedule_interval="#monthly",
catchup=True,
default_args={"retries": 5, "retry_delay": timedelta(minutes=1)},
) as dag:
t1 = EmptyOperator(task_id="t1")
I wanted to have my dags to have first run on 2:00 AM on 25th and then onwards Tuesday to Sat daily run at 2:00 am.
following is how my scheduling look like.
with DAG(
dag_id='in__xxx__agnt_brk_com',
schedule_interval='0 2 * * 2-6',
start_date=datetime(2022, 10, 24),
catchup=False,
) as dag:
And on Airflow UI also it shows that my first run should be on 25th 2:00 AM. But unfortunately, dags didn't execute on time.
What I am missing here ?
Airflow is scheduling at 2am in your local time, that is 6am in UTC.
Take a look at this link on how to specify the timezone in your dag: https://airflow.apache.org/docs/apache-airflow/stable/timezone.html
You should consult the documentation on dag intervals.
Your dag did not run on the 25th:
Your start date is 2022-10-24
This creates a dag interval of 2022-10-24 - 2022-10-25.
You set catchup=False
You created the dag after midnight on the 25th. The dag interval for the 24th has passed and you've denied catchup.
The next dag is scheduled for 2022-10-25
This creates a dag interval of 2022-10-25 - 2022-10-26
Your dag will run at 2am UTC on the 26th.
I have the following dag config:
with DAG(
dag_id='dag_example',
catchup=False,
start_date=datetime.datetime(2022, 5, 26),
schedule_interval='0 6,7,9,11,15,19,23 * * *',
max_active_runs=1,
default_args=default_args
)
I would like to know why my dag that is scheduled to run at 7 AM is running at 9 AM (next scheduled date...). I'm using airflow 2.1.2. When I was using airflow v1 the dag runs correclty.
This is how Airflow works.
DAGs are scheduled at the end of the interval.
So in your case run_id of 2022-05-27 10:00 will start running on 2022-05-27 12:00 because the interval you set is of 2 hours and Airflow schedule at the end of the interval.
Note: This is consistent with batch processing practices.
If you run a daily job then today you are processing yesterday data.
If you run hourly job then at 10:00 you are processing the interval between 09:00 to 10:00, in other words the run_id of 09:00 will actually run at the end of the hourly interval which is 10:00
You can read Problem with start date and scheduled date in Apache Airflow for more information
Should you want reference specific interval from your DAG this is just a question of what macro to use. See Templates reference
I have the following airflow DAG:
from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
#Runs every 1 minute
dag = DAG(dag_id='example_dag', start_date=datetime(2020, 1, 1), schedule_interval='*/1 * * * *')
t1 = BashOperator(task_id='bash_task', bash_command='echo Hello!', dag=dag)
Problem here is that Airflow is scheduling and executing tasks from past dates like the first minute of 2020, the second minute of 2020, the third minute of 2020 and so on.
I want Airflow to schedule and execute only the tasks that occur after the dag deploy (i.e. if I deploy today, I want the first task to be executed in the next minute) and not to execute expired tasks.
Any advice? Thanks!
I found the answer here. Read the "Catchup and Idempotent DAG" section.
Situation:
Airflow 1.10.6
it's November, 18th, 8.pm
airflow.cfg.default_timezone = system (i.e. Europe/Berlin)
I want to run my new "sample_job" every day at 8.05 p.m.
My configuration:
default_args = {
'owner': 'Airflow',
'start_date': datetime.datetime(year=2019,month=11,day=18,hour=20,minute=0),
'execution_timeout' : timedelta(hours=13)
}
dag = DAG(
'sample_job',
default_args=default_args,
catchup=False,
max_active_runs=1,
schedule_interval='05 20 * * *')
Now when I activate the job at 8.03 pm I realize that the job is executed immediately with yesterday's date as last_run date.
How do I have to change my settings so that the job is not executed before 8.05 pm?
The very first DAG run is triggered soon after start_date + schedule_interval [1]. Your schedule interval is one day and you want the first DAG run to start after 2019-11-18 20:05, so your start_date should be 2019-11-17 20:05.
As to why a DAG run is started as soon as you turns the DAG on, I suspect the reason for this is that you scheduled this DAG with a different start_date or schedule_interval before. If start_date or schedule_interval is changed it is recommended to change the dag_id too [2], because then a fresh set of metadata (and a new schedule) is created for the renamed DAG.