How to prevent catch up of a DAG? - airflow

I have the following DAG:
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2018, 07, 19, 11,0,0),
'email': ['me#me.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 5,
'retry_delay': timedelta(minutes=2),
'catchup' : False,
'depends_on_past' : False,
}
with DAG('some_dag', schedule_interval=timedelta(minutes=30), max_active_runs=1, default_args=default_args) as dag:
This dag runs every 30 minutes. It rewrite data in the table (delete all and write). So if Airflow was down for 2 days there is no point in running all the missing dag runs during that time.
However the above definition does not work. After 2 days that airflow was down it still try to run all the missing tasks.
How can i solve this?

OK. I solved it.
Apparently there is no meaning for 'catchup' : False on the default_args . It does nothing.
I changed the code to
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2018, 07, 19, 11,0,0),
'email': ['me#me.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 5,
'retry_delay': timedelta(minutes=2),
'depends_on_past' : False,
}
with DAG('some_dag', catchup=False, schedule_interval=timedelta(minutes=30), max_active_runs=1, default_args=default_args) as dag:
now it works.

according to the docs: https://airflow.apache.org/scheduler.html#backfill-and-catchup
Adding dag.catchup = False to the DAG args.
Adding catchup_by_default = False to the airflow.cfg
Depends on your use case, a good practice could be set catchup_by_default = False and then only use dag.catchup = True if a given DAG requires the catchup.

Related

Different way to set Airflow DAG fields

I would like to create an Airflow DAG and want to learn which parameters should be set in field_1 vs default_args vs args?
my_dag = DAG(
"my_dag",
"field_1"="xxx",
default_agrs=default_args,
**args
)
I checked with documentation, I understand that some parameters such as "owner" have to be set through the default_args and can't be in field_1. But looks like there's no difference for the most of parameters. I tested some fields such as "catchup" and "on_failure_callback", and they all work in all these three places.
So I wonder what's the best practice of setting parameters when create a dag?
The best practice is something like the Airflow tutorial
with DAG(
'tutorial',
# These args will get passed on to each operator
# You can override them on a per-task basis during operator initialization
default_args={
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
# 'wait_for_downstream': False,
# 'sla': timedelta(hours=2),
# 'execution_timeout': timedelta(seconds=300),
# 'on_failure_callback': some_function,
# 'on_success_callback': some_other_function,
# 'on_retry_callback': another_function,
# 'sla_miss_callback': yet_another_function,
# 'trigger_rule': 'all_success'
},
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
start_date=datetime(2021, 1, 1),
catchup=False,
tags=['example'],
) as dag:
...
Reference: https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html#example-pipeline-definition
But I use something like that and is enough:
import pendulum
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'retries': 1,
'retry_delay': timedelta(minutes=1)
}
with DAG(
default_args=default_args,
dag_id='dag_etl',
catchup=False,
start_date=pendulum.datetime(year=2022, month=1, day=1, tz='America/Chicago'),
schedule_interval='0 8 * * *', # https://crontab.guru/#0_8_*_*_*
description='DAG Extract Transform Load'
) as dag:
...

Airflow The requested task could not be added to the DAG because a task with task_id ... is already in the DAG

I've seen a few responses to this before but they haven't worked for me.
I'm running the Bridge 1.10.15 airflow version so we can migrate to Airflow 2, and I ran airflow upgrade_check and I'm seeing the below error:
/usr/local/lib/python3.7/site-packages/airflow/models/dag.py:1342:
PendingDeprecationWarning: The requested task could not be added to
the DAG with dag_id snapchat_snowflake_daily because a task with
task_id snp_bl_global_content_reporting is already in the DAG.
Starting in Airflow 2.0, trying to overwrite a task will raise an
exception.
same error is happening but with task_id: snp_bl_global_article_reporting and snp_bl_global_video_reporting
I've also seen someone recommend setting load_examples = False in the airflow.cfg file which I already have.
Here is my code:
DAG_NAME = 'snapchat_snowflake_daily'
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 6, 12),
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'provide_context': True,
'on_failure_callback': task_fail_slack_alert,
'sla': timedelta(hours=24),
}
dag = DAG(
DAG_NAME,
default_args=default_args,
catchup=False,
schedule_interval='0 3 * * *')
with dag:
s3_to_snowflake = SnowflakeLoadOperator(
task_id=f'load_into_snowflake_for_{region}',
pool='airflow_load',
retries=0, )
snp_il_global = SnowflakeQueryOperator(
task_id='snp_il_global',
sql='queries/snp_il_gl.sql',
retries=0)
snp_bl_global_video_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_video_reporting',
sql='snp_bl_gl_reporting.sql',
retries=0)
snp_bl_global_content_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_content_reporting',
sql='snp_bl_global_c.sql')
snp_bl_global_article_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_article_reporting',
sql='snp_bl_global_a.sql',
retries=0)
s3_to_snowflake >> snp_il_global >> [
snp_bl_global_video_reporting,
snp_bl_global_content_reporting,
snp_bl_global_article_reporting
]

Problem with start date and scheduled date in Apache Airflow

I am working with Apache Airflow and I have a problem with the scheduled day and the starting day.
I want a DAG to run every day at 8:00 AM UTC. So, I did:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 12, 7, 10, 0,0),
'email': ['example#emaiil.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(hours=5)
}
# Never run
dag = DAG(dag_id='id', default_args=default_args, schedule_interval='0 8 * * *',catchup=True)
The day I upload the DAG was 2020-12-07 and I wanted to run it on 2020-12-08 at 08:00:00.
I set the start_date at 2020-12-07 at 10:00:00 to avoid running it at 2020-12-07 at 08:00:00 and only trigger it the next day, but it didn't work.
Then I modified the starting day:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 12, 7, 7, 59,0),
'email': ['example#emaiil.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(hours=5)
}
# Never run
dag = DAG(dag_id='etl-ca-cpke-spark_dev_databricks', default_args=default_args, schedule_interval='0 8 * * *',catchup=True)
Now the start date is 1 minute before the DAG should run, and indeed, because the catchup is set to True, the DAG has been triggered for 2020-12-07 at 08:00:00, but it has not being triggered for 2020-12-08 at 08:00:00.
Why?
Airflow schedules tasks at the end of the interval (See documentation reference)
Meaning that when you do:
start_date: datetime(2020, 12, 7, 8, 0,0)
schedule_interval: '0 8 * * *'
The first run will kick in at 2020-12-08 at 08:00+- (depends on resources)
This run's execution_date will be: 2020-12-07 08:00
The next run will kick in at 2020-12-09 at 08:00
This run's execution_date of 2020-12-08 08:00.
Since today is 2020-12-08 the next run didn't kick in because it's not the end of the interval yet.

How to avoid run of task when already running

I have an airflow task which scheduled to run every 3 minutes.
Sometimes the duration of the task is longer than 3 minutes, and the next schedule start (or queued), despite it is already running.
Is there a way to define the dag to NOT even queue the task if it is already in run?
# airflow related
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators import MsSqlOperator
# other packages
from datetime import datetime
from datetime import timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 7, 22, 15,00,00),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
dag = DAG(
dag_id='sales',
description='Run sales',
schedule_interval = '*/3 4,5,6,7,8,9,10,11,12,13,14,15,16,17 * * 0-5',
default_args=default_args,
catchup = False)
job1 = BashOperator(
task_id='sales',
bash_command='python2 /home/manager/ETL/sales.py',
dag = dag)
job2 = MsSqlOperator(
task_id='refresh_tabular',
mssql_conn_id='mssql_globrands',
sql="USE msdb ; EXEC dbo.sp_start_job N'refresh Management-sales' ; ",
dag = dag)
job1>>job2

Airflow dag not running when scheduled (while others scheduled at same time, do)?

I have 2 dags in airflow, both of which are scheduled to run at 22 UTC time (12PM my time (HST)). I find that only one of these dags is running at this time and am not sure why this would be happening. I can manually start the other dag while the one that works is running, but it just does not start on it's own.
Here is the dag configs for the dag that is running on schedule
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 10, 13),
'email': [
'me#co.org'
],
'email_on_failure': True,
'retries': 0,
'retry_delay': timedelta(minutes=5),
'max_active_runs': 1,
}
dag = DAG('my_dag_1', default_args=default_args, catchup=False, schedule_interval="0 22 * * *")
Here is the dag configs for the dag that is failing to run
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 10, 13),
'email': [
'me#co.org',
],
'email_on_failure': True,
'retries': 0,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('my_dag_2', default_args=default_args,
max_active_runs=1,
catchup=False, schedule_interval=f"0 19,22,1 * * *")
# run setup dag and trigger at 9AM, 12PM,and 3PM (need to convert from UTC time (-2 HST))
From the airflow.cfg file, some of the settings that I think are relevant are set as...
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
#parallelism = 32
parallelism = 8
# The number of task instances allowed to run concurrently by the scheduler
#dag_concurrency = 16
dag_concurrency = 3
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# The maximum number of active DAG runs per DAG
#max_active_runs_per_dag = 16
max_active_runs_per_dag = 1
Not sure what could be going on here. Is there some setting that I am mistakenly switching on that stops multiple different dags from running at the same time? Any more debugging info to get to add to this question?

Resources