Airflow on_failure_callback - airflow

I have an Airflow DAG with two tasks:
read_csv
process_file
They work fine on their own. I purposely created a typo in a pandas Dataframe to learn how on_failure_callback works and to see if it is being triggered. It seems likes from the log that it doesn't:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1197, in handle_failure
task.on_failure_callback(context)
TypeError: on_failure_callback() takes 0 positional arguments but 1 was given
Why isn't on_failure_callback working?
Here is a visual representation of the DAG:
Here is the code:
try:
from datetime import timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
import pandas as pd
# Setting up Triggers
from airflow.utils.trigger_rule import TriggerRule
# for Getting Variables from airlfow
from airflow.models import Variable
print("All Dag modules are ok ......")
except Exception as e:
print("Error {} ".format(e))
def read_csv(**context):
data = [{"name":"Soumil","title":"Full Stack Software Engineer"}, { "name":"Nitin","title":"Full Stack Software Engineer"},]
df = pd.DataFramee(data=data)
dag_config = Variable.get("VAR1")
print("VAR 1 is : {} ".format(dag_config))
context['ti'].xcom_push(key='mykey', value=df)
def process_file(**context):
instance = context.get("ti").xcom_pull(key='mykey')
print(instance.head(2))
return "Process complete "
def on_failure_callback(**context):
print("Fail works ! ")
with DAG(dag_id="invoices_dag",
schedule_interval="#once",
default_args={
"owner": "airflow",
"start_date": datetime(2020, 11, 1),
"retries": 1,
"retry_delay": timedelta(minutes=1),
'on_failure_callback': on_failure_callback,
},
catchup=False) as dag:
read_csv = PythonOperator(
task_id="read_csv",
python_callable=read_csv,
op_kwargs={'filename': "Soumil.csv"},
provide_context=True
)
process_file = PythonOperator(
task_id="process_file",
python_callable=process_file,
provide_context=True
)
read_csv >> process_file
# ====================================Notes====================================
# all_success -> triggers when all tasks arecomplete
# one_success -> trigger when one task is complete
# all_done -> Trigger when all Tasks are Done
# all_failed -> Trigger when all task Failed
# one_failed -> one task is failed
# none_failed -> No Task Failed
# ==============================================================================
# ============================== Executor====================================
# There are Three main types of executor
# -> Sequential Executor run single task in linear fashion wih no parllelism default Dev
# -> Local Exector run each task in seperate process
# -> Celery Executor Run each worker node within multi node architecture Most scalable
# ===========================================================================

You need to specify one argument to your function that can receive the context this is due to how Airflow triggers on_failure_callback
def on_failure_callback(context):
print("Fail works ! ")
Note that with your implementation you can't tell from the message which task has failed so you might want to add to your error message the task details like:
def on_failure_callback(context):
ti = context['task_instance']
print(f"task {ti.task_id } failed in dag { ti.dag_id } ")

Related

Airflow XCOMs communication from BashOperator to PythonOperator

I'm new to Apache Airflow and trying to write my first Dag which has a task based on another task (using ti.xcom_pull)
PS : I run Airflow in WSL Ubuntu 20.04 using VScode.
I created a task 1 (task_id = "get_datetime") that runs the "date" bash command (and it works)
then I created another task (task_id='process_datetime') which takes the datetime of the first task and processes it, and I set the python_callable and everything is fine..
the issue is that dt = ti.xcom_pull gives a NoneType when I run "airflow tasks test first_ariflow_dag process_datetime 2022-11-1" in the terminal, but when I see the log in the Airflow UI, I find that it works normally.
could someone give me a solution please?
`
from datetime import datetime
from airflow.models import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
def process_datetime(ti):
dt = ti.xcom_pull(task_ids=['get_datetime'])
if not dt :
raise Exception('No datetime value')
dt = str(dt[0]).split()
return{
'year':int(dt[-1]),
'month':dt[1],
'day':int(dt[2]),
'time':dt[3],
'day_of_week':dt[0]
}
with DAG(
dag_id='first_ariflow_dag',
schedule_interval='* * * * *',
start_date=datetime(year=2022, month=11, day=1),
catchup=False
) as dag:
# 1. Get the current datetime
task_get_datetime= BashOperator(
task_id = 'get_datetime',
bash_command='date'
)
# 2. Process the datetime
task_process_datetime= PythonOperator(
task_id = 'process_datetime',
python_callable=process_datetime
)
`
I get this error :
[2022-11-02 00:51:45,420] {taskinstance.py:1851} ERROR - Task failed with exception
Traceback (most recent call last):
File "/mnt/c/Users/Salim/Desktop/A-Learning/Airflow_Conda/airflow_env/lib/python3.8/site-packages/airflow/operators/python.py", line 175, in execute
return_value = self.execute_callable()
File "/mnt/c/Users/Salim/Desktop/A-Learning/Airflow_Conda/airflow_env/lib/python3.8/site-packages/airflow/operators/python.py", line 193, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/home/salim/airflow/dags/first_dag.py", line 12, in process_datetime
raise Exception('No datetime value')
Exception: No datetime value
According to the documentation, to upload data to xcom you need to set the variable do_xcom_push (Airflow 2) or xcom_push (Airflow 1).
If BaseOperator.do_xcom_push is True, the last line written to stdout
will also be pushed to an XCom when the bash command completes
BashOperator should look like this:
task_get_datetime= BashOperator(
task_id = 'get_datetime',
bash_command='date',
do_xcom_push=True
)

Airflow triggering the "on_failure_callback" when the "dagrun_timeout" is exceeded

Currently working on setting up alerts for long running tasks in Airflow. To cancel/fail the airflow dag I've put "dagrun_timeout" in the default_args, and it does what I need, fails/errors the dag when its been running for too long (usually stuck). The only problem is that the function in "on_failure_callback" doesn't get called when the dagrun_timeout is exceeded, because the "on_failure_callback" is on the task level (I think) while the dagrun_timeout is on the dag level.
How can I execute the "on_failure_callback" when the dagrun_timeout is exceeded, or how can I specify a function to be called when a dag fails? Or should I re-think my approach?
Try setting on_failure_callback during DAG declaration:
with DAG(
dag_id="failure_callback_example",
on_failure_callback=_on_dag_run_fail,
...
) as dag:
...
The explanation is that on_failure_callback defined in default_args will get passed only to the Tasks being created and not to the DAG object.
Here is an example to try this behaviour:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.models import TaskInstance
from airflow.operators.bash import BashOperator
def _on_dag_run_fail(context):
print("***DAG failed!! do something***")
print(f"The DAG failed because: {context['reason']}")
print(context)
def _alarm(context):
print("** Alarm Alarm!! **")
task_instance: TaskInstance = context.get("task_instance")
print(f"Task Instance: {task_instance} failed!")
default_args = {
"owner": "mi_empresa",
"email_on_failure": False,
"on_failure_callback": _alarm,
}
with DAG(
dag_id="failure_callback_example",
start_date=datetime(2021, 9, 7),
schedule_interval=None,
default_args=default_args,
catchup=False,
on_failure_callback=_on_dag_run_fail,
dagrun_timeout=timedelta(seconds=45),
) as dag:
delayed = BashOperator(
task_id="delayed",
bash_command='echo "waiting..";sleep 60; echo "Done!!"',
)
will_fail = BashOperator(
task_id="will_fail",
bash_command="exit 1",
# on_failure_callback=_alarm,
)
delayed >> will_fail
You can find the logs of the callbacks execution in the Scheduler logs AIRFLOW_HOME/logs/scheduler/date/failure_callback_example :
[2021-09-24 13:12:34,285] {logging_mixin.py:104} INFO - [2021-09-24 13:12:34,285] {dag.py:862} INFO - Executing dag callback function: <function _on_dag_run_fail at 0x7f83102e8670>
[2021-09-24 13:12:34,336] {logging_mixin.py:104} INFO - ***DAG failed!! do something***
[2021-09-24 13:12:34,345] {logging_mixin.py:104} INFO - The DAG failed because: timed_out
Edit:
Within the context dict the key reason is passed in order to specify the cause of the DAG run failure. Some values are: 'reason': 'timed_out' or 'reason': 'task_failure' . This could be use to perfom specific behaviour in the callback based on the reason of the DAG Run failure.

Assign airflow task to several DAGs

I am trying to reuse an existing airflow task by assigning it to different dags.
def create_new_task_for_dag(task: BaseOperator,
dag: models.DAG) -> BaseOperator:
"""Create a deep copy of given task and associate it with given dag
"""
new_task = copy.deepcopy(task)
new_task.dag = dag
return new_task
print_datetime_task = python_operator.PythonOperator(
task_id='print_datetime', python_callable=_print_datetime)
# define a new dag ...
# add to the new dag
create_new_task_for_dag(print_datetime_task, new_dag)
Then it gives the error Task is missing the start_date parameter.
If I define the dag when creating the operator, print_datetime_task = PythonOperator(task_id='print_datetime', python_callable=_print_datetime, dag=new_dag), then it is OK.
I have searched around, and this seems to be the root cause: https://github.com/apache/airflow/pull/5598, but PR has been marked as stale.
I wonder if there is any other approach to reuse an existing airflow task assign to a different dag.
I am using apache-airflow[docker,kubernetes]==1.10.10
While I don't know the solution to your problem with current design (code-layout), it can be made to work by tweaking the design slightly (note that the following code-snippets have NOT been tested)
Instead of copying a task from a DAG,
def create_new_task_for_dag(task: BaseOperator,
dag: models.DAG) -> BaseOperator:
"""Create a deep copy of given task and associate it with given dag
"""
new_task = copy.deepcopy(task)
new_task.dag = dag
return new_task
you can move the instantiation of task (as well as it's assignment to the DAG) to a separate utility function.
from datetime import datetime
from typing import Dict, Any
from airflow.models.dag import DAG
from airflow.operators.python_operator import PythonOperator
def add_new_print_datetime_task(my_dag: DAG,
kwargs: Dict[str, Any]) -> PythonOperator:
"""
Creates and adds a new 'print_datetime' (PythonOperator) task in 'my_dag'
and returns it's reference
:param my_dag: reference to DAG object in which to add the task
:type my_dag: DAG
:param kwargs: dictionary of args for PythonOperator / BaseOperator
'task_id' is mandatory
:type kwargs: Dict[str, Any]
:return: PythonOperator
"""
def my_callable() -> None:
print(datetime.now())
return PythonOperator(dag=my_dag, python_callable=my_callable, **kwargs)
Thereafter you can call that function everytime you want to instantiate that same task (and assign to any DAG)
with DAG(dag_id="my_dag_id", start_date=datetime(year=2020, month=8, day=22, hour=16, minute=30)) as my_dag:
print_datetime_task_kwargs: Dict[str, Any] = {
"task_id": "my_task_id",
"depends_on_past": True
}
print_datetime_task: PythonOperator = add_new_print_datetime_task(my_dag=my_dag, kwargs=print_datetime_task_kwargs)
# ... other tasks and their wiring
References / good reads
Astronomer.io: Dynamically Generating DAGs in Airflow
Apache Airflow | With Statement and DAG

airflow trigger_rule using ONE_FAILED cause dag failure

what i wanted to achieve is to create a task where will send notification if any-one of the task under the dag is failed. I am applying trigger rule to the task where:
batch11 = BashOperator(
task_id='Error_Buzz',
trigger_rule=TriggerRule.ONE_FAILED,
bash_command='python /home/admin/pythonwork/home/codes/notifications/dagLevel_Notification.py') ,
dag=dag,
catchup = False
)
batch>>batch11
batch1>>batch11
The problem for now is when there no other task failed, the batch11 task will not execute due to trigger_rule, which is what i wanted, but it will result the dag failure since the default trigger_rule for dag is ALL_SUCCESS. Is there a way to end the loop hole to make the dag runs successfully ?
screenshot of outcome :
We do something similar in our Airflow Deployment. The idea is to notify slack when a task in a dag fails. You can set a dag level configuration on_failure_callback as documented https://airflow.apache.org/code.html#airflow.models.BaseOperator
on_failure_callback (callable) – a function to be called when a task
instance of this task fails. a context dictionary is passed as a
single parameter to this function. Context contains references to
related objects to the task instance and is documented under the
macros section of the API.
Here is an example of how I use it. if any of the task fails or succeeds airflow calls notify function and I can get notification wherever I want.
import sys
import os
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from airflow.utils.dates import days_ago
from util.airflow_utils import AirflowUtils
schedule = timedelta(minutes=5)
args = {
'owner': 'user',
'start_date': days_ago(1),
'depends_on_past': False,
'on_failure_callback': AirflowUtils.notify_job_failure,
'on_success_callback': AirflowUtils.notify_job_success
}
dag = DAG(
dag_id='demo_dag',
schedule_interval=schedule, default_args=args)
def task1():
return 'Whatever you return gets printed in the logs!'
def task2():
return 'cont'
task1 = PythonOperator(task_id='task1',
python_callable=task1,
dag=dag)
task2 = PythonOperator(task_id='task2',
python_callable=task1,
dag=dag)
task1 >> task2

apache airflow - Cannot load the dag bag to handle failure

I have created a on_failure_callback function(refering Airflow default on_failure_callback) to handle task's failure.
It works well when there is only one task in a DAG, however, if there are 2 more tasks, a task is randomly failed since the operator is null, it can resume later by manully . In airflow-scheduler.out the log is:
[2018-05-08 14:24:21,237] {models.py:1595} ERROR - Executor reports
task instance %s finished (%s) although the task says its %s. Was the
task killed externally? NoneType [2018-05-08 14:24:21,238]
{jobs.py:1435} ERROR - Cannot load the dag bag to handle failure for
. Setting task to FAILED without
callbacks or retries. Do you have enough resources?
The DAG code is:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
import airflow
from devops.util import WechatUtil
from devops.util import JiraUtil
def on_failure_callback(context):
ti = context['task_instance']
log_url = ti.log_url
owner = ti.task.owner
ti_str = str(context['task_instance'])
wechat_msg = "%s - Owner:%s"%(ti_str,owner)
WeChatUtil.notify_team(wechat_msg)
jira_desc = "Please check log from url %s"%(log_url)
JiraUtil.create_incident("DW",ti_str,jira_desc,owner)
args = {
'queue': 'default',
'start_date': airflow.utils.dates.days_ago(1),
'retry_delay': timedelta(minutes=1),
'on_failure_callback': on_failure_callback,
'owner': 'user1',
}
dag = DAG(dag_id='test_dependence1',default_args=args,schedule_interval='10 16 * * *')
load_crm_goods = BashOperator(
task_id='crm_goods_job',
bash_command='date',
dag=dag)
load_crm_memeber = BashOperator(
task_id='crm_member_job',
bash_command='date',
dag=dag)
load_crm_order = BashOperator(
task_id='crm_order_job',
bash_command='date',
dag=dag)
load_crm_eur_invt = BashOperator(
task_id='crm_eur_invt_job',
bash_command='date',
dag=dag)
crm_member_cohort_analysis = BashOperator(
task_id='crm_member_cohort_analysis_job',
bash_command='date',
dag=dag)
crm_member_cohort_analysis.set_upstream(load_crm_goods)
crm_member_cohort_analysis.set_upstream(load_crm_memeber)
crm_member_cohort_analysis.set_upstream(load_crm_order)
crm_member_cohort_analysis.set_upstream(load_crm_eur_invt)
crm_member_kpi_daily = BashOperator(
task_id='crm_member_kpi_daily_job',
bash_command='date',
dag=dag)
crm_member_kpi_daily.set_upstream(crm_member_cohort_analysis)
I had tried to update the airflow.cfg by adding the default memory from 512 to even 4096, but no luck. Would anyone have any advice ?
Ialso try to updated my JiraUtil and WechatUtil as following, encoutering the same error
WechatUtil:
import requests
class WechatUtil:
#staticmethod
def notify_trendy_user(user_ldap_id, message):
return None
#staticmethod
def notify_bigdata_team(message):
return None
JiraUtil:
import json
import requests
class JiraUtil:
#staticmethod
def execute_jql(jql):
return None
#staticmethod
def create_incident(projectKey, summary, desc, assignee=None):
return None
(I'm shooting tracer bullets a bit here, so bear with me if this answer doesn't get it right on the first try.)
The null operator issue with multiple task instances is weird... it would help approaching troubleshooting this if you could boil the current code down to a MCVE e.g., 1–2 operators and excluding the JiraUtil and WechatUtil parts if they're not related to the callback failure.
Here are 2 ideas:
1. Can you try changing the line that fetches the task instance out of the context to see if this makes a difference?
Before:
def on_failure_callback(context):
ti = context['task_instance']
...
After:
def on_failure_callback(context):
ti = context['ti']
...
I saw this usage in the Airflow repo (https://github.com/apache/incubator-airflow/blob/c1d583f91a0b4185f760a64acbeae86739479cdb/airflow/contrib/hooks/qubole_check_hook.py#L88). It's possible it can be accessed both ways.
2. Can you try adding provide_context=True on the operators either as a kwarg or in default_args?

Resources