I have Airflow dag , which runs series of Databricks notebook.
Now What i want is , if any one the notebook got failed? How to send mail to user that this notebook got failed with few details like execution date.
and is there any error handling way?
The BaseOperator has parameters which allow to configure sending emails in case of failure thus it's available to all operators:
DatabricksSubmitRunOperator(...,email_on_failure=True,
email='your.email#where.com')
Step1: Set email_on_failure to False and use the operators’s on_failure_callback. on_failure_callback the function described below.
from airflow.utils.email import send_email
def notify_email(contextDict, **kwargs):
"""Send custom email alerts."""
# email title.
title = "Airflow alert: {task_name} Failed".format(**contextDict)
# email contents
body = """
Hi Everyone, <br>
<br>
There's been an error in the {task_name} job.<br>
<br>
Forever yours,<br>
Airflow bot <br>
""".format(**contextDict)
send_email('you_email#address.com', title, body)
Step2: Short example airflow dag below
from airflow.models import DAG
from airflow.operators import PythonOperator
from airflow.utils.dates import days_ago
args = {
'owner': 'me',
'description': 'my_example',
'start_date': days_ago(1)
}
# run every day at 12:05 UTC
dag = DAG(dag_id='example_dag', default_args=args, schedule_interval='0 5 * * *')
def print_hello():
return 'hello!'
py_task = PythonOperator(task_id='example',
python_callable=print_hello,
on_failure_callback=notify_email,
dag=dag)
py_task
Note where set on_failure_callback equal to notify_email in the PythonOperator.
Related
I'm struggling to understand the difference between a task and a DAG and when to use one over the other. I know a task is more granular and called within a DAG, but so much of Airflow documentation mentions creating DAGs on the go or calling other DAGs instead of tasks. Is there any significant difference between using either of these two options?
A DAG is a collection of tasks with schedule information. Each task can perform different work based on our requirement. Let us consider below DAG code as an example. In the below code we are printing current time and then sending an e-mail notification after that.
#importing operators and modules
from airflow import DAG
from airflow.operators.python_operator import PythonOperator ##to call a python object
from airflow.operators.email_operator import EmailOperator ##to send email
from datetime import datetime,timedelta,timezone ##to play with date and time
import dateutil
#setting default arguments
default_args = {
'owner': 'test dag',
'depends_on_past': False,
'start_date': datetime(2021, 1, 1),
'email': ['myemailid#example.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 0
}
def print_time(**context):
now_utc = datetime.now(timezone.utc)
print("current time",now_utc)
with DAG('example_dag', schedule_interval='* 12 * * *', max_active_runs=1, catchup=False,default_args=default_args) as dag: ##dag name is 'example_dag'
current_time = PythonOperator(task_id='current_time', python_callable=print_time,
provide_context=True,
dag=dag) ##task to call print_time definition
send_email = EmailOperator(task_id='send_email', to='myemailid#example.com',
subject='DAG completed successfully',
html_content="<p>Hi,<br><br>example DAG completed successfully<br>", dag=dag) ## task to send email
current_time >> send_email ##defining tasks dependency
Here current_time and send_email are 2 different tasks performing different work. Now we have a dependency here like we have to send an email once the current time is printed so we have established that task dependency at the end. Also we have given a scheduled_interval to run the DAG everyday at 12 PM. This together forms a DAG.
While executing the following python script using cloud-composer, I get *** Task instance did not exist in the DB under the gcs2bq task Log in Airflow
Code:
import datetime
import os
import csv
import pandas as pd
import pip
from airflow import models
#from airflow.contrib.operators import dataproc_operator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils import trigger_rule
from airflow.contrib.operators import gcs_to_bq
from airflow.contrib.operators import bigquery_operator
print('''/-------/--------/------/
-------/--------/------/''')
yesterday = datetime.datetime.combine(
datetime.datetime.today() - datetime.timedelta(1),
datetime.datetime.min.time())
default_dag_args = {
# Setting start date as yesterday starts the DAG immediately when it is
# detected in the Cloud Storage bucket.
'start_date': yesterday,
# To email on failure or retry set 'email' arg to your email and enable
# emailing here.
'email_on_failure': False,
'email_on_retry': False,
# If a task fails, retry it once after waiting at least 5 minutes
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'project_id': 'data-rubrics'
#models.Variable.get('gcp_project')
}
try:
# [START composer_quickstart_schedule]
with models.DAG(
'composer_agg_quickstart',
# Continue to run DAG once per day
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
# [END composer_quickstart_schedule]
op_start = BashOperator(task_id='Initializing', bash_command='echo Initialized')
#op_readwrite = PythonOperator(task_id = 'ReadAggWriteFile', python_callable=read_data)
op_load = gcs_to_bq.GoogleCloudStorageToBigQueryOperator( \
task_id='gcs2bq',\
bucket='dr-mockup-data',\
source_objects=['sample.csv'],\
destination_project_dataset_table='data-rubrics.sample_bqtable',\
schema_fields = [{'name':'a', 'type':'STRING', 'mode':'NULLABLE'},{'name':'b', 'type':'FLOAT', 'mode':'NULLABLE'}],\
write_disposition='WRITE_TRUNCATE',\
dag=dag)
#op_write = PythonOperator(task_id = 'AggregateAndWriteFile', python_callable=write_data)
op_start >> op_load
UPDATE:
Can you remove dag=dag from gcs2bq task as you are already using with models.DAG and run your dag again?
It might be because you have a dynamic start date. Your start_date should never be dynamic. Read this FAQ: https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an #hourly DAG would never get to an hour after now as now() moves along.
Make your start_date static or use Airflow utils/macros:
import airflow
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
}
Okay, this was a stupid question on my part and apologies for everyone who wasted time here. I had a Dag running due to which the one I was shooting off was always in the que. Also, I did not write the correct value in destination_project_dataset_table. Thanks and apologies to all who spent time.
what i wanted to achieve is to create a task where will send notification if any-one of the task under the dag is failed. I am applying trigger rule to the task where:
batch11 = BashOperator(
task_id='Error_Buzz',
trigger_rule=TriggerRule.ONE_FAILED,
bash_command='python /home/admin/pythonwork/home/codes/notifications/dagLevel_Notification.py') ,
dag=dag,
catchup = False
)
batch>>batch11
batch1>>batch11
The problem for now is when there no other task failed, the batch11 task will not execute due to trigger_rule, which is what i wanted, but it will result the dag failure since the default trigger_rule for dag is ALL_SUCCESS. Is there a way to end the loop hole to make the dag runs successfully ?
screenshot of outcome :
We do something similar in our Airflow Deployment. The idea is to notify slack when a task in a dag fails. You can set a dag level configuration on_failure_callback as documented https://airflow.apache.org/code.html#airflow.models.BaseOperator
on_failure_callback (callable) – a function to be called when a task
instance of this task fails. a context dictionary is passed as a
single parameter to this function. Context contains references to
related objects to the task instance and is documented under the
macros section of the API.
Here is an example of how I use it. if any of the task fails or succeeds airflow calls notify function and I can get notification wherever I want.
import sys
import os
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from airflow.utils.dates import days_ago
from util.airflow_utils import AirflowUtils
schedule = timedelta(minutes=5)
args = {
'owner': 'user',
'start_date': days_ago(1),
'depends_on_past': False,
'on_failure_callback': AirflowUtils.notify_job_failure,
'on_success_callback': AirflowUtils.notify_job_success
}
dag = DAG(
dag_id='demo_dag',
schedule_interval=schedule, default_args=args)
def task1():
return 'Whatever you return gets printed in the logs!'
def task2():
return 'cont'
task1 = PythonOperator(task_id='task1',
python_callable=task1,
dag=dag)
task2 = PythonOperator(task_id='task2',
python_callable=task1,
dag=dag)
task1 >> task2
I need to have pipeline that will be executed either manually or programmatically, is possible with Airflow? Looks like right now each workflow MUST be tied to a schedule.
Just set the schedule_interval to None when you create the DAG:
dag = DAG('workflow_name',
template_searchpath='path',
schedule_interval=None,
default_args=default_args)
From the Airflow Manual:
Each DAG may or may not have a schedule, which informs how DAG Runs
are created. schedule_interval is defined as a DAG arguments, and
receives preferably a cron expression as a str, or a
datetime.timedelta object.
The manual then goes on to list some cron 'presets' one of which is None.
Yes, this can be achieved by passing None to schedule_interval in default_args.
Check this documation on DAG run.
For example:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 12, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval': None, # Check this line
}
In Airflow, every DAG is required to have a start date and schedule interval*, for example hourly:
import datetime
dag = DAG(
dag_id='my_dag',
schedule_interval=datetime.timedelta(hours=1),
start_date=datetime(2018, 5, 23),
)
(Without a schedule how would it know when to run?)
Alternatively to a cron schedule, you can set the schedule to #once to only run once.
*One exception: You can omit the schedule for externally triggered DAGs because Airflow will not schedule them itself.
However, that said, if you omit the schedule, then you need to trigger the DAG externally somehow. If you want to be able to call a DAG programmatically, for instance, as a result of a separate condition occurring in another DAG, you can do that with the TriggerDagRunOperator. You might also hear this idea called externally triggered DAGs.
Here's a usage example from the Airflow Example DAGs:
File 1 - example_trigger_controller_dag.py:
"""This example illustrates the use of the TriggerDagRunOperator. There are 2
entities at work in this scenario:
1. The Controller DAG - the DAG that conditionally executes the trigger
2. The Target DAG - DAG being triggered (in example_trigger_target_dag.py)
This example illustrates the following features :
1. A TriggerDagRunOperator that takes:
a. A python callable that decides whether or not to trigger the Target DAG
b. An optional params dict passed to the python callable to help in
evaluating whether or not to trigger the Target DAG
c. The id (name) of the Target DAG
d. The python callable can add contextual info to the DagRun created by
way of adding a Pickleable payload (e.g. dictionary of primitives). This
state is then made available to the TargetDag
2. A Target DAG : c.f. example_trigger_target_dag.py
"""
from airflow import DAG
from airflow.operators.dagrun_operator import TriggerDagRunOperator
from datetime import datetime
import pprint
pp = pprint.PrettyPrinter(indent=4)
def conditionally_trigger(context, dag_run_obj):
"""This function decides whether or not to Trigger the remote DAG"""
c_p = context['params']['condition_param']
print("Controller DAG : conditionally_trigger = {}".format(c_p))
if context['params']['condition_param']:
dag_run_obj.payload = {'message': context['params']['message']}
pp.pprint(dag_run_obj.payload)
return dag_run_obj
# Define the DAG
dag = DAG(dag_id='example_trigger_controller_dag',
default_args={"owner": "airflow",
"start_date": datetime.utcnow()},
schedule_interval='#once')
# Define the single task in this controller example DAG
trigger = TriggerDagRunOperator(task_id='test_trigger_dagrun',
trigger_dag_id="example_trigger_target_dag",
python_callable=conditionally_trigger,
params={'condition_param': True,
'message': 'Hello World'},
dag=dag)
File 2 - example_trigger_target_dag.py:
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from datetime import datetime
import pprint
pp = pprint.PrettyPrinter(indent=4)
# This example illustrates the use of the TriggerDagRunOperator. There are 2
# entities at work in this scenario:
# 1. The Controller DAG - the DAG that conditionally executes the trigger
# (in example_trigger_controller.py)
# 2. The Target DAG - DAG being triggered
#
# This example illustrates the following features :
# 1. A TriggerDagRunOperator that takes:
# a. A python callable that decides whether or not to trigger the Target DAG
# b. An optional params dict passed to the python callable to help in
# evaluating whether or not to trigger the Target DAG
# c. The id (name) of the Target DAG
# d. The python callable can add contextual info to the DagRun created by
# way of adding a Pickleable payload (e.g. dictionary of primitives). This
# state is then made available to the TargetDag
# 2. A Target DAG : c.f. example_trigger_target_dag.py
args = {
'start_date': datetime.utcnow(),
'owner': 'airflow',
}
dag = DAG(
dag_id='example_trigger_target_dag',
default_args=args,
schedule_interval=None)
def run_this_func(ds, **kwargs):
print("Remotely received value of {} for key=message".
format(kwargs['dag_run'].conf['message']))
run_this = PythonOperator(
task_id='run_this',
provide_context=True,
python_callable=run_this_func,
dag=dag)
# You can also access the DagRun object in templates
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: '
'{{ dag_run.conf["message"] if dag_run else "" }}" ',
dag=dag)
EDITED
I have edited this question by considering the inputs from #tobi6
I copied the subdag operator from Airflow source code
Source code: https://github.com/apache/incubator-airflow/blob/master/airflow/operators/subdag_operator.py
I modified a few things in the execute method. The changes were made to trigger the SubDag and wait until the SubDag completes execution. The trigger is working great but the tasks are not being executed (DAG is in the running/Green state while the tasks are in the null/White state).
Please refer below for the changes I made:
from airflow.exceptions import AirflowException
from airflow.models import BaseOperator, Pool
from airflow.utils.decorators import apply_defaults
from airflow.utils.db import provide_session
from airflow.utils.state import State
from airflow.executors import GetDefaultExecutor
from time import sleep
import logging
from datetime import datetime
class SubDagOperator(BaseOperator):
template_fields = tuple()
ui_color = '#555'
ui_fgcolor = '#fff'
#provide_session
#apply_defaults
def __init__(
self,
subdag,
executor=GetDefaultExecutor(),
*args, **kwargs):
"""
Yo dawg. This runs a sub dag. By convention, a sub dag's dag_id
should be prefixed by its parent and a dot. As in `parent.child`.
:param subdag: the DAG object to run as a subdag of the current DAG.
:type subdag: airflow.DAG
:param dag: the parent DAG
:type subdag: airflow.DAG
"""
import airflow.models
dag = kwargs.get('dag') or airflow.models._CONTEXT_MANAGER_DAG
if not dag:
raise AirflowException('Please pass in the `dag` param or call '
'within a DAG context manager')
session = kwargs.pop('session')
super(SubDagOperator, self).__init__(*args, **kwargs)
# validate subdag name
if dag.dag_id + '.' + kwargs['task_id'] != subdag.dag_id:
raise AirflowException(
"The subdag's dag_id should have the form "
"'{{parent_dag_id}}.{{this_task_id}}'. Expected "
"'{d}.{t}'; received '{rcvd}'.".format(
d=dag.dag_id, t=kwargs['task_id'], rcvd=subdag.dag_id))
# validate that subdag operator and subdag tasks don't have a
# pool conflict
if self.pool:
conflicts = [t for t in subdag.tasks if t.pool == self.pool]
if conflicts:
# only query for pool conflicts if one may exist
pool = (
session
.query(Pool)
.filter(Pool.slots == 1)
.filter(Pool.pool == self.pool)
.first()
)
if pool and any(t.pool == self.pool for t in subdag.tasks):
raise AirflowException(
'SubDagOperator {sd} and subdag task{plural} {t} both '
'use pool {p}, but the pool only has 1 slot. The '
'subdag tasks will never run.'.format(
sd=self.task_id,
plural=len(conflicts) > 1,
t=', '.join(t.task_id for t in conflicts),
p=self.pool
)
)
self.subdag = subdag
self.executor = executor
def execute(self, context):
dag_run = self.subdag.create_dagrun(
conf=context['dag_run'].conf,
state=State.RUNNING,
execution_date=context['execution_date'],
run_id='trig__' + str(datetime.utcnow()),
external_trigger=True
)
while True:
if dag_run.get_state() == State.FAILED or dag_run.get_state() == State.SUCCESS:
break
else:
sleep(10)
continue
Below is the code that shows how I'm using the same
from airflow import DAG
from operators.sd_operator import SubDagOperator # My SubDag Operator
from airflow.operators.python_operator import PythonOperator
import logging
from datetime import datetime
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017, 7, 17),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
}
def print_dag_details(**kwargs):
logging.info(str(kwargs['dag_run'].conf))
with DAG('example_dag', schedule_interval=None, catchup=False, default_args=default_args) as dag:
task_1 = SubDagOperator(
subdag=sub_dag_func('example_dag', 'sub_dag_1'),
task_id='sub_dag_1'
)
task_2 = SubDagOperator(
subdag=sub_dag_func('example_dag', 'sub_dag_2'),
task_id='sub_dag_2',
)
print_kwargs = PythonOperator(
task_id='print_kwargs',
python_callable=print_dag_details,
provide_context=True
)
print_kwargs >> task_1 >> task_2
Any information you provide would be helpful. Thanks in advance.
It is a bit hard to understand your question without context.
"I copied the subdag operator and modified a few things in the execute method."
From where was this copied?
"The trigger is working great ..."
How does this look like?
There are a few things I saw in the code:
It might be helpful to add assigned fields to the function call of sub_dag_func, e.g. sub_dag_func(subdag='parent_dag'...).
In the binary shift definition, used to set upstream / downstream there are tasks defined I cannot find in the DAG (df_job_1, df_job_2). This might be connected to SubDAGs (haven't looked into them yet).
The name of the sub dag seems inconsistent with the comment in the code saying By convention, a sub dag's dag_id should be prefixed by its parent and a dot but it is sub_dag_1, sub_dag_2