Airflow v2.4.2 - New monthly DAG not running when scheduled - airflow

I have the following in the dag.py file, this is a newly pushed to prod DAG, it should have run at 14UTC (9EST) it should have ran a few hours ago but it still hasn't run even thought in the UI is still saying it will run at 14UTC.
DAG_NAME = "revenue_monthly"
START_DATE = datetime(2023, 1, 12)
SCHEDULE_INTERVAL = "0 14 3 * *"
default_args = {
'owner': 'airflow',
'start_date': START_DATE,
'depends_on_past': False
}
dag = DAG(DAG_NAME,
default_args=default_args,
schedule_interval=SCHEDULE_INTERVAL,
doc_md=doc_md,
max_active_runs=1,
catchup=False,
)
See picture below of the UI:

The date and time you are seeing as Next Run is the logical_date which is the start of the data interval. With the current configuration the first DAGrun will be on data from 2023-02-03 to 2023-03-03 so the DAG will only actually be running on 2023-03-03 (the Run After date, you can see that one when you are viewing the DAG and hover over the schedule in the upper right corner:
Assuming you want the DAG to do the run it would have done on 2023-02-03 (today) you can achieve that by backfilling one run, either by manually backfilling. Or by using catchup=True with a start_date before 2023-01-03:
from airflow import DAG
from pendulum import datetime
from airflow.operators.empty import EmptyOperator
DAG_NAME = "revenue_monthly_1"
START_DATE = datetime(2023, 1, 1)
SCHEDULE_INTERVAL = "0 14 3 * *"
doc_md="documentation"
default_args = {
'owner': 'airflow',
'start_date': START_DATE,
'depends_on_past': False
}
with DAG(
DAG_NAME,
default_args=default_args,
schedule_interval=SCHEDULE_INTERVAL,
doc_md=doc_md,
max_active_runs=1,
catchup=True,
) as dag:
t1 = EmptyOperator(task_id="t1")
gave me one run with the run id scheduled__2023-01-03T14:00:00+00:00 and the next_run date interval 2023-02-03 to 2023-03-03 which will Run after 2023-03-03.
This guide might help with terminology Airflow uses around schedules.

Related

Airflow job not runinng as scheduled

I have a job that i had set to run a 9:00 UTC on Wednesday. It didn't run as planned by the end of the delay interval, which I thought was curious because I believe I have everything defined properly.
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'noncomp_trial',
default_args=default_args,
description='test of dag',
schedule_interval='0 9 * * 3',
dagrun_timeout=timedelta(minutes=20))
If anyone has any advice here that would be greatly appreciated!
The Airflow Scheduler runs tasks once the start_date + one schedule_interval value has passed. In your example, the DAG won't run until 9:00AM on Wednesday the following week occurs.
See more information about the relationship between the start_date and schedule_interval here.
You could try setting your start_date to a static date in the past by a week or two to see if that works? And to make sure the Scheduler doesn't try to execute every start_date + schedule_interval occurrence between that new start_date and now, you can set catchup=False on the DAG. For example:
from datetime import datetime
dag = DAG(
'noncomp_trial',
default_args= {
'start_date': datetime(2021, 7, 1),
'retries': 1,
'retry_delay': timedelta(minutes=5)
},
description='test of dag',
schedule_interval='0 9 * * 3',
dagrun_timeout=timedelta(minutes=20),
catchup=False,
)

Apache airflow scheduler not triggering once in a month task as expected

Here is the dag, which I want to execute on fixed date of every month, as of now I kept it on 18th of every month.
But the task is getting triggered daily by the scheduler. catchup_by_default = False is set to false in airflow.cfg file
default_args = {
'owner': 'muddassir',
'depends_on_past': True,
'start_date': datetime(2021, 3, 18),
'retries': 1,
'schedule_interval': '0 0 18 * *',
}
Image 1
Image 2
Image 3
Image 4
you have mentioned schedule_interval in your default_args, which is not the way to schedule the DAG. default_args are actually applied to the tasks, as they are passed to Operators and not the DAG itself.
you can modify your code as follows, just by removing the schedule_interval from default_args and passing it in the DAG instance as follows and you can set catchup flag as False to avoid any backfills :
# assuming this is how you initialize your DAG
dag = DAG('your DAG UI name', default_args=default_args, schedule_interval = '0 0 18 * *', catchup=False)

How to avoid run of task when already running

I have an airflow task which scheduled to run every 3 minutes.
Sometimes the duration of the task is longer than 3 minutes, and the next schedule start (or queued), despite it is already running.
Is there a way to define the dag to NOT even queue the task if it is already in run?
# airflow related
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators import MsSqlOperator
# other packages
from datetime import datetime
from datetime import timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 7, 22, 15,00,00),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
dag = DAG(
dag_id='sales',
description='Run sales',
schedule_interval = '*/3 4,5,6,7,8,9,10,11,12,13,14,15,16,17 * * 0-5',
default_args=default_args,
catchup = False)
job1 = BashOperator(
task_id='sales',
bash_command='python2 /home/manager/ETL/sales.py',
dag = dag)
job2 = MsSqlOperator(
task_id='refresh_tabular',
mssql_conn_id='mssql_globrands',
sql="USE msdb ; EXEC dbo.sp_start_job N'refresh Management-sales' ; ",
dag = dag)
job1>>job2

Airflow Scheduling: how to run initial setup task only once?

If my DAG is this
[setup] -> [processing-task] -> [end].
How can I schedule this DAG to run periodically, while running [setup] task only once (on first scheduled run) and skipping it for all later runs?
Check out this post in medium which describes how to implement a "run once" operator. I have successfully used this several times.
Here is a way to do it without need to create a new class. I found this simpler than the accepted answer and it worked well for my use case.
Might be useful for others!
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import BranchPythonOperator
with DAG(
dag_id='your_dag_id',
default_args={
'depends_on_past': False,
'email': ['you#email.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
},
description='Dag with initial setup task that only runs on start_date',
start_date=datetime(2000, 1, 1),
# Runs daily at 1 am
schedule_interval='0 1 * * *',
# catchup must be true if start_date is before datetime.now()
catchup=True,
max_active_runs=1,
) as dag:
def branch_fn(**kwargs):
# Have to make sure start_date will equal data_interval_start on first run
# This dag is daily but since the schedule_interval is set to 1 am data_interval_start would be
# 2000-01-01 01:00:00 when it needs to be
# 2000-01-01 00:00:00
date = kwargs['data_interval_start'].replace(hour=0, minute=0, second=0, microsecond=0)
if date == dag.start_date:
return 'initial_task'
else:
return 'skip_initial_task'
branch_task = BranchPythonOperator(
task_id='branch_task',
python_callable=branch_fn,
provide_context=True
)
initial_task = DummyOperator(
task_id="initial_task"
)
skip_initial_task = DummyOperator(
task_id="skip_initial_task"
)
next_task = DummyOperator(
task_id="next_task",
# This is important otherwise next_task would be skipped
trigger_rule="one_success"
)
branch_task >> [initial_task, skip_initial_task] >> next_task

Apache Airflow: DAG executed twice before start_date

.Hi Everyone,
From the Airflow UI, we are trying to understand how to start a DAG run in the future at a specific time, but we always get 2 additional runs in catch-up mode (even though catch-up is disabled)
Example
Create a DAG run with the below parameters
start_date: 10:30
execution_date: not defined
interval = 3 minutes (from the .py file)
catchup_by_default = False
Turn the ON switch at Current time: 10:28. What we get is Airflow triggers 2 DAG runs with execution_date at:
10:24
10:27
and these 2 DAG runs are run in catch-up mode one after the other, and that's not what we want :-(
What are we doing wrong?
We maybe understand the 10:27 run (ETL concept), but we do not get the 10:24 one :-(
Thank you for the help :-)
DETAILS:
OS: RedHat 7
Python: 2.7
Airflow: v1.8.0
DAG python file:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2017, 9, 7, 10, 30),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, schedule_interval=timedelta(minutes=3))
dag.catchup = False
create_command = "/script.sh "
t1 = BashOperator(
task_id='task',
bash_command='date',
dag=dag)
I tried with Airflow v.1.8.0, python v.3.5, db on SQLite. The following DAG, unpaused at 10:28, is quite similar to yours, and works as it should (only one run, at 10:33, for 10:30).
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
def print_hello_3min():
return ('Hello world! %s' % datetime.now())
dag = DAG('hello_world_3min', description='Simple tutorial DAG 3min',
schedule_interval='*/3 * * * *',
start_date=datetime(2017, 9, 18, 10, 30),
catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task_3min', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task_3min',
python_callable=print_hello_3min, dag=dag)
dummy_operator >> hello_operator
Written with StackEdit.
I'm not sure about my solution whether good enough, but I'd like to present my understanding.
There are 2 things to consider together:
schedule_interval mode, such as 'hourly', 'daily', 'weekly','annually'.
hourly = (* 1 * * *) = “At every minute past hour 1.”
daily = (0 1 * * *) = “At 01:00.”
monthly = (0 1 1 * *) = “At 01:00 on day-of-month 1.”
start_date
hourly = datetime(2019, 4, 5, 1, 30)
daily = datetime(2019, 4, 5)
monthly = datetime(2019, 4, 1)
My strategy is to set [start_date] by doing minus the expecting start date & time by the 1 unit of your interval mode.
Example:
To start the first job at 2019-4-5 01:00 and the interval are hourly.
schedule_interval mode = hourly
expecting start datetime = 2019-4-5 01:00
so, start_date = 2019-4-5 00:00
minus hour by 1 hour
CRON = ( * 1 * * * ) which means “At every minute past hour 1.”
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2019, 4, 5, 0, 0),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, catchup = False, schedule_interval='* 1 * * *')
To start the first job at 2019-4-5 01:00 and the interval are daily.
schedule_interval mode = daily
expecting start datetime date = 2019-4-5 01:00
so, start_date = 2019-4-4
minus day by 1 day
CRON = ( 0 1 * * * ) which means “At 01:00.”
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2019, 4, 4),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, catchup = False, schedule_interval='0 1 * * *')
To start the first job at 2019-4-5 01:00 and the interval are monthly.
schedule_interval mode = monthly
expecting start datetime date = 2019-4-5 01:00
so, start_date = 2019-4-4
minus day by 1 day
CRON = ( 0 1 1 * * ) which means “At 01:00 on day-of-month 1.”
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2019, 4, 4),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, catchup = False, schedule_interval='0 1 1 * *')
So far, the strategy is useful for me, but if anyone got better, please kindly share.
PS. I'm using [https://crontab.guru] to generate a perfect cron-schedule.
This appears to happen exclusively when providing a timedelta as a schedule. Switch your schedule interval to be cron formatted and it won't run twice anymore.

Resources