Airflow Scheduling: how to run initial setup task only once? - airflow

If my DAG is this
[setup] -> [processing-task] -> [end].
How can I schedule this DAG to run periodically, while running [setup] task only once (on first scheduled run) and skipping it for all later runs?

Check out this post in medium which describes how to implement a "run once" operator. I have successfully used this several times.

Here is a way to do it without need to create a new class. I found this simpler than the accepted answer and it worked well for my use case.
Might be useful for others!
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import BranchPythonOperator
with DAG(
dag_id='your_dag_id',
default_args={
'depends_on_past': False,
'email': ['you#email.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
},
description='Dag with initial setup task that only runs on start_date',
start_date=datetime(2000, 1, 1),
# Runs daily at 1 am
schedule_interval='0 1 * * *',
# catchup must be true if start_date is before datetime.now()
catchup=True,
max_active_runs=1,
) as dag:
def branch_fn(**kwargs):
# Have to make sure start_date will equal data_interval_start on first run
# This dag is daily but since the schedule_interval is set to 1 am data_interval_start would be
# 2000-01-01 01:00:00 when it needs to be
# 2000-01-01 00:00:00
date = kwargs['data_interval_start'].replace(hour=0, minute=0, second=0, microsecond=0)
if date == dag.start_date:
return 'initial_task'
else:
return 'skip_initial_task'
branch_task = BranchPythonOperator(
task_id='branch_task',
python_callable=branch_fn,
provide_context=True
)
initial_task = DummyOperator(
task_id="initial_task"
)
skip_initial_task = DummyOperator(
task_id="skip_initial_task"
)
next_task = DummyOperator(
task_id="next_task",
# This is important otherwise next_task would be skipped
trigger_rule="one_success"
)
branch_task >> [initial_task, skip_initial_task] >> next_task

Related

Airflow v2.4.2 - New monthly DAG not running when scheduled

I have the following in the dag.py file, this is a newly pushed to prod DAG, it should have run at 14UTC (9EST) it should have ran a few hours ago but it still hasn't run even thought in the UI is still saying it will run at 14UTC.
DAG_NAME = "revenue_monthly"
START_DATE = datetime(2023, 1, 12)
SCHEDULE_INTERVAL = "0 14 3 * *"
default_args = {
'owner': 'airflow',
'start_date': START_DATE,
'depends_on_past': False
}
dag = DAG(DAG_NAME,
default_args=default_args,
schedule_interval=SCHEDULE_INTERVAL,
doc_md=doc_md,
max_active_runs=1,
catchup=False,
)
See picture below of the UI:
The date and time you are seeing as Next Run is the logical_date which is the start of the data interval. With the current configuration the first DAGrun will be on data from 2023-02-03 to 2023-03-03 so the DAG will only actually be running on 2023-03-03 (the Run After date, you can see that one when you are viewing the DAG and hover over the schedule in the upper right corner:
Assuming you want the DAG to do the run it would have done on 2023-02-03 (today) you can achieve that by backfilling one run, either by manually backfilling. Or by using catchup=True with a start_date before 2023-01-03:
from airflow import DAG
from pendulum import datetime
from airflow.operators.empty import EmptyOperator
DAG_NAME = "revenue_monthly_1"
START_DATE = datetime(2023, 1, 1)
SCHEDULE_INTERVAL = "0 14 3 * *"
doc_md="documentation"
default_args = {
'owner': 'airflow',
'start_date': START_DATE,
'depends_on_past': False
}
with DAG(
DAG_NAME,
default_args=default_args,
schedule_interval=SCHEDULE_INTERVAL,
doc_md=doc_md,
max_active_runs=1,
catchup=True,
) as dag:
t1 = EmptyOperator(task_id="t1")
gave me one run with the run id scheduled__2023-01-03T14:00:00+00:00 and the next_run date interval 2023-02-03 to 2023-03-03 which will Run after 2023-03-03.
This guide might help with terminology Airflow uses around schedules.

Apache airflow scheduler not triggering once in a month task as expected

Here is the dag, which I want to execute on fixed date of every month, as of now I kept it on 18th of every month.
But the task is getting triggered daily by the scheduler. catchup_by_default = False is set to false in airflow.cfg file
default_args = {
'owner': 'muddassir',
'depends_on_past': True,
'start_date': datetime(2021, 3, 18),
'retries': 1,
'schedule_interval': '0 0 18 * *',
}
Image 1
Image 2
Image 3
Image 4
you have mentioned schedule_interval in your default_args, which is not the way to schedule the DAG. default_args are actually applied to the tasks, as they are passed to Operators and not the DAG itself.
you can modify your code as follows, just by removing the schedule_interval from default_args and passing it in the DAG instance as follows and you can set catchup flag as False to avoid any backfills :
# assuming this is how you initialize your DAG
dag = DAG('your DAG UI name', default_args=default_args, schedule_interval = '0 0 18 * *', catchup=False)

How to avoid run of task when already running

I have an airflow task which scheduled to run every 3 minutes.
Sometimes the duration of the task is longer than 3 minutes, and the next schedule start (or queued), despite it is already running.
Is there a way to define the dag to NOT even queue the task if it is already in run?
# airflow related
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators import MsSqlOperator
# other packages
from datetime import datetime
from datetime import timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 7, 22, 15,00,00),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
dag = DAG(
dag_id='sales',
description='Run sales',
schedule_interval = '*/3 4,5,6,7,8,9,10,11,12,13,14,15,16,17 * * 0-5',
default_args=default_args,
catchup = False)
job1 = BashOperator(
task_id='sales',
bash_command='python2 /home/manager/ETL/sales.py',
dag = dag)
job2 = MsSqlOperator(
task_id='refresh_tabular',
mssql_conn_id='mssql_globrands',
sql="USE msdb ; EXEC dbo.sp_start_job N'refresh Management-sales' ; ",
dag = dag)
job1>>job2

How to retry an upstream task?

task a > task b > task c
If C fails I want to retry A. Is this possible? There are a few other tickets which involve subdags, but I would like to just be able to clear A.
I'm hoping to use on_retry_callback in task C but I don't know how to call task A.
There is another question which does this in a subdag, but I am not using subdags.
I'm trying to do this, but it doesn't seem to work:
def callback_for_failures(context):
print("*** retrying ***")
if context['task'].upstream_list:
context['task'].upstream_list[0].clear()
As other comments mentioned, I would use caution to make sure you aren't getting into an endless loop of clearing/retries. But you can call a bash command as part of your on_failure_callback and then specify which tasks you want to clear, and if you want downstream/upstream tasks cleared etc.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
def clear_upstream_task(context):
execution_date = context.get("execution_date")
clear_tasks = BashOperator(
task_id='clear_tasks',
bash_command=f'airflow tasks clear -s {execution_date} -t t1 -d -y clear_upstream_task'
)
return clear_tasks.execute(context=context)
# Default settings applied to all tasks
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
with DAG('clear_upstream_task',
start_date=datetime(2021, 1, 1),
max_active_runs=3,
schedule_interval=timedelta(minutes=5),
default_args=default_args,
catchup=False
) as dag:
t0 = DummyOperator(
task_id='t0'
)
t1 = DummyOperator(
task_id='t1'
)
t2 = DummyOperator(
task_id='t2'
)
t3 = BashOperator(
task_id='t3',
bash_command='exit 123',
on_failure_callback=clear_upstream_task
)
t0 >> t1 >> t2 >> t3

Apache Airflow: DAG executed twice before start_date

.Hi Everyone,
From the Airflow UI, we are trying to understand how to start a DAG run in the future at a specific time, but we always get 2 additional runs in catch-up mode (even though catch-up is disabled)
Example
Create a DAG run with the below parameters
start_date: 10:30
execution_date: not defined
interval = 3 minutes (from the .py file)
catchup_by_default = False
Turn the ON switch at Current time: 10:28. What we get is Airflow triggers 2 DAG runs with execution_date at:
10:24
10:27
and these 2 DAG runs are run in catch-up mode one after the other, and that's not what we want :-(
What are we doing wrong?
We maybe understand the 10:27 run (ETL concept), but we do not get the 10:24 one :-(
Thank you for the help :-)
DETAILS:
OS: RedHat 7
Python: 2.7
Airflow: v1.8.0
DAG python file:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2017, 9, 7, 10, 30),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, schedule_interval=timedelta(minutes=3))
dag.catchup = False
create_command = "/script.sh "
t1 = BashOperator(
task_id='task',
bash_command='date',
dag=dag)
I tried with Airflow v.1.8.0, python v.3.5, db on SQLite. The following DAG, unpaused at 10:28, is quite similar to yours, and works as it should (only one run, at 10:33, for 10:30).
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
def print_hello_3min():
return ('Hello world! %s' % datetime.now())
dag = DAG('hello_world_3min', description='Simple tutorial DAG 3min',
schedule_interval='*/3 * * * *',
start_date=datetime(2017, 9, 18, 10, 30),
catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task_3min', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task_3min',
python_callable=print_hello_3min, dag=dag)
dummy_operator >> hello_operator
Written with StackEdit.
I'm not sure about my solution whether good enough, but I'd like to present my understanding.
There are 2 things to consider together:
schedule_interval mode, such as 'hourly', 'daily', 'weekly','annually'.
hourly = (* 1 * * *) = “At every minute past hour 1.”
daily = (0 1 * * *) = “At 01:00.”
monthly = (0 1 1 * *) = “At 01:00 on day-of-month 1.”
start_date
hourly = datetime(2019, 4, 5, 1, 30)
daily = datetime(2019, 4, 5)
monthly = datetime(2019, 4, 1)
My strategy is to set [start_date] by doing minus the expecting start date & time by the 1 unit of your interval mode.
Example:
To start the first job at 2019-4-5 01:00 and the interval are hourly.
schedule_interval mode = hourly
expecting start datetime = 2019-4-5 01:00
so, start_date = 2019-4-5 00:00
minus hour by 1 hour
CRON = ( * 1 * * * ) which means “At every minute past hour 1.”
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2019, 4, 5, 0, 0),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, catchup = False, schedule_interval='* 1 * * *')
To start the first job at 2019-4-5 01:00 and the interval are daily.
schedule_interval mode = daily
expecting start datetime date = 2019-4-5 01:00
so, start_date = 2019-4-4
minus day by 1 day
CRON = ( 0 1 * * * ) which means “At 01:00.”
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2019, 4, 4),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, catchup = False, schedule_interval='0 1 * * *')
To start the first job at 2019-4-5 01:00 and the interval are monthly.
schedule_interval mode = monthly
expecting start datetime date = 2019-4-5 01:00
so, start_date = 2019-4-4
minus day by 1 day
CRON = ( 0 1 1 * * ) which means “At 01:00 on day-of-month 1.”
default_args = {
'owner': 'aa',
'depends_on_past': False,
'start_date': datetime(2019, 4, 4),
'run_as_user': 'aa'
}
dag = DAG(
'dag3', default_args=default_args, catchup = False, schedule_interval='0 1 1 * *')
So far, the strategy is useful for me, but if anyone got better, please kindly share.
PS. I'm using [https://crontab.guru] to generate a perfect cron-schedule.
This appears to happen exclusively when providing a timedelta as a schedule. Switch your schedule interval to be cron formatted and it won't run twice anymore.

Resources