Not Receiving Email on Airflow Dag Time out - airflow

I am using Airflow 1.8. I want the Airflow to send email when DAG times out. Currently it sends emails when the a task times out or when there is a retry. but not when the DAG is timing out.The DAG is intentionally set to run every minute. the tasks 10 seconds but DAG time out is 5 seconds. the dag fails but it doesn't send any email.
here is my code for DAG:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1),
'email': ['email#email.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'execution_timeout': timedelta(seconds=60)
}
schedule = '* * * * *'
dag = DAG('leader_dag',
default_args=default_args,catchup=False,dagrun_timeout=timedelta(seconds=5),
schedule_interval=schedule)
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='sleep 10',
dag=dag)
here is smtp part from airflow.cfg:
email_backend = airflow.utils.email.send_email_smtp
#[smtp]
# If you want airflow to send emails on retries, failure, and you want to
# the airflow.utils.send_email function, you have to configure an smtp
# server here
smtp_host = "***********.amazonaws.com
smtp_starttls = True
smtp_ssl = False
smtp_user = user
smtp_port = 25
smtp_password = password
smtp_mail_from = no-reply#example.com

Related

Airflow The requested task could not be added to the DAG because a task with task_id ... is already in the DAG

I've seen a few responses to this before but they haven't worked for me.
I'm running the Bridge 1.10.15 airflow version so we can migrate to Airflow 2, and I ran airflow upgrade_check and I'm seeing the below error:
/usr/local/lib/python3.7/site-packages/airflow/models/dag.py:1342:
PendingDeprecationWarning: The requested task could not be added to
the DAG with dag_id snapchat_snowflake_daily because a task with
task_id snp_bl_global_content_reporting is already in the DAG.
Starting in Airflow 2.0, trying to overwrite a task will raise an
exception.
same error is happening but with task_id: snp_bl_global_article_reporting and snp_bl_global_video_reporting
I've also seen someone recommend setting load_examples = False in the airflow.cfg file which I already have.
Here is my code:
DAG_NAME = 'snapchat_snowflake_daily'
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 6, 12),
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'provide_context': True,
'on_failure_callback': task_fail_slack_alert,
'sla': timedelta(hours=24),
}
dag = DAG(
DAG_NAME,
default_args=default_args,
catchup=False,
schedule_interval='0 3 * * *')
with dag:
s3_to_snowflake = SnowflakeLoadOperator(
task_id=f'load_into_snowflake_for_{region}',
pool='airflow_load',
retries=0, )
snp_il_global = SnowflakeQueryOperator(
task_id='snp_il_global',
sql='queries/snp_il_gl.sql',
retries=0)
snp_bl_global_video_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_video_reporting',
sql='snp_bl_gl_reporting.sql',
retries=0)
snp_bl_global_content_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_content_reporting',
sql='snp_bl_global_c.sql')
snp_bl_global_article_reporting = SnowflakeQueryOperator(
task_id='snp_bl_global_article_reporting',
sql='snp_bl_global_a.sql',
retries=0)
s3_to_snowflake >> snp_il_global >> [
snp_bl_global_video_reporting,
snp_bl_global_content_reporting,
snp_bl_global_article_reporting
]

How to avoid run of task when already running

I have an airflow task which scheduled to run every 3 minutes.
Sometimes the duration of the task is longer than 3 minutes, and the next schedule start (or queued), despite it is already running.
Is there a way to define the dag to NOT even queue the task if it is already in run?
# airflow related
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators import MsSqlOperator
# other packages
from datetime import datetime
from datetime import timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020, 7, 22, 15,00,00),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
dag = DAG(
dag_id='sales',
description='Run sales',
schedule_interval = '*/3 4,5,6,7,8,9,10,11,12,13,14,15,16,17 * * 0-5',
default_args=default_args,
catchup = False)
job1 = BashOperator(
task_id='sales',
bash_command='python2 /home/manager/ETL/sales.py',
dag = dag)
job2 = MsSqlOperator(
task_id='refresh_tabular',
mssql_conn_id='mssql_globrands',
sql="USE msdb ; EXEC dbo.sp_start_job N'refresh Management-sales' ; ",
dag = dag)
job1>>job2

Airflow dag not running when scheduled (while others scheduled at same time, do)?

I have 2 dags in airflow, both of which are scheduled to run at 22 UTC time (12PM my time (HST)). I find that only one of these dags is running at this time and am not sure why this would be happening. I can manually start the other dag while the one that works is running, but it just does not start on it's own.
Here is the dag configs for the dag that is running on schedule
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 10, 13),
'email': [
'me#co.org'
],
'email_on_failure': True,
'retries': 0,
'retry_delay': timedelta(minutes=5),
'max_active_runs': 1,
}
dag = DAG('my_dag_1', default_args=default_args, catchup=False, schedule_interval="0 22 * * *")
Here is the dag configs for the dag that is failing to run
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 10, 13),
'email': [
'me#co.org',
],
'email_on_failure': True,
'retries': 0,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('my_dag_2', default_args=default_args,
max_active_runs=1,
catchup=False, schedule_interval=f"0 19,22,1 * * *")
# run setup dag and trigger at 9AM, 12PM,and 3PM (need to convert from UTC time (-2 HST))
From the airflow.cfg file, some of the settings that I think are relevant are set as...
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
#parallelism = 32
parallelism = 8
# The number of task instances allowed to run concurrently by the scheduler
#dag_concurrency = 16
dag_concurrency = 3
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# The maximum number of active DAG runs per DAG
#max_active_runs_per_dag = 16
max_active_runs_per_dag = 1
Not sure what could be going on here. Is there some setting that I am mistakenly switching on that stops multiple different dags from running at the same time? Any more debugging info to get to add to this question?

How to retry an upstream task?

task a > task b > task c
If C fails I want to retry A. Is this possible? There are a few other tickets which involve subdags, but I would like to just be able to clear A.
I'm hoping to use on_retry_callback in task C but I don't know how to call task A.
There is another question which does this in a subdag, but I am not using subdags.
I'm trying to do this, but it doesn't seem to work:
def callback_for_failures(context):
print("*** retrying ***")
if context['task'].upstream_list:
context['task'].upstream_list[0].clear()
As other comments mentioned, I would use caution to make sure you aren't getting into an endless loop of clearing/retries. But you can call a bash command as part of your on_failure_callback and then specify which tasks you want to clear, and if you want downstream/upstream tasks cleared etc.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
def clear_upstream_task(context):
execution_date = context.get("execution_date")
clear_tasks = BashOperator(
task_id='clear_tasks',
bash_command=f'airflow tasks clear -s {execution_date} -t t1 -d -y clear_upstream_task'
)
return clear_tasks.execute(context=context)
# Default settings applied to all tasks
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=5)
}
with DAG('clear_upstream_task',
start_date=datetime(2021, 1, 1),
max_active_runs=3,
schedule_interval=timedelta(minutes=5),
default_args=default_args,
catchup=False
) as dag:
t0 = DummyOperator(
task_id='t0'
)
t1 = DummyOperator(
task_id='t1'
)
t2 = DummyOperator(
task_id='t2'
)
t3 = BashOperator(
task_id='t3',
bash_command='exit 123',
on_failure_callback=clear_upstream_task
)
t0 >> t1 >> t2 >> t3

Airflow Scheduling: how to run initial setup task only once?

If my DAG is this
[setup] -> [processing-task] -> [end].
How can I schedule this DAG to run periodically, while running [setup] task only once (on first scheduled run) and skipping it for all later runs?
Check out this post in medium which describes how to implement a "run once" operator. I have successfully used this several times.
Here is a way to do it without need to create a new class. I found this simpler than the accepted answer and it worked well for my use case.
Might be useful for others!
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import BranchPythonOperator
with DAG(
dag_id='your_dag_id',
default_args={
'depends_on_past': False,
'email': ['you#email.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
},
description='Dag with initial setup task that only runs on start_date',
start_date=datetime(2000, 1, 1),
# Runs daily at 1 am
schedule_interval='0 1 * * *',
# catchup must be true if start_date is before datetime.now()
catchup=True,
max_active_runs=1,
) as dag:
def branch_fn(**kwargs):
# Have to make sure start_date will equal data_interval_start on first run
# This dag is daily but since the schedule_interval is set to 1 am data_interval_start would be
# 2000-01-01 01:00:00 when it needs to be
# 2000-01-01 00:00:00
date = kwargs['data_interval_start'].replace(hour=0, minute=0, second=0, microsecond=0)
if date == dag.start_date:
return 'initial_task'
else:
return 'skip_initial_task'
branch_task = BranchPythonOperator(
task_id='branch_task',
python_callable=branch_fn,
provide_context=True
)
initial_task = DummyOperator(
task_id="initial_task"
)
skip_initial_task = DummyOperator(
task_id="skip_initial_task"
)
next_task = DummyOperator(
task_id="next_task",
# This is important otherwise next_task would be skipped
trigger_rule="one_success"
)
branch_task >> [initial_task, skip_initial_task] >> next_task

Resources