get dag's url to alert message In slack - airflow

I'm using arflow 2.2.1.
I have a dag, that send alert message to slack, when its failed.
dag:
from alert.alert_slack import alert_slack
...
default_args = {
'owner': 'name',
'start_date': dt.datetime(2022, 11, 18),
'retries': 2,
'retry_delay': dt.timedelta(seconds=10),
'on_failure_callback': alert_slack('myslackid')
}
alert_slack:
from airflow.providers.slack.operators.slack import SlackAPIPostOperator
def alert_slack(channel: str):
def failure(context):
last_task = context.get('task_instance')
task_name = last_task.task_id
dag_name = last_task.dag_id
log_link = f"<{last_task.log_url}|{task_name}>"
error_message = context.get('exception') or context.get('reason')
execution_date = context.get('execution_date')
title = f':red_circle: DAG Failed.'
msg_parts = {
'*Dag*': dag_name,
'*Owner*': owner,
'*Task*': task_name,
'*Log*': log_link,
'*Error*': error_message,
'*Execution date*': execution_date
}
msg = "\n".join([title,
*[f"{key}: {value}" for key, value in msg_parts.items()]
]).strip()
SlackAPIPostOperator(
task_id="alert",
slack_conn_id="slack_alert",
text=msg,
channel=channel,
).execute(context=None)
return failure
I need to add an url of failed dag from airflow site (like https://myairflow.dev/code?dag_id=my_test_dag) to alert message
I tried to add the following code:
dagcode = context.get('dag_code')
dag_url = dagcode.source_code
dagcode return None.
But I think it's completely wrong and I don't know where to look for this url.
Can anyone please help me find where dag's url is and how do i pass it to alert message?

So the log link that you are including in the alert should take you to the Airflow UI. It will be the page with the log for that failed DAG run.
If you want a specific page in the UI for that DAG, say you want the 'Grid' view of a particular DAG like the picture I attach, you can simply hardcode the URL and include this URL in your slack message.
I.e.
last_task = context.get('task_instance')
dag_id = last_task.dag_id
# airflow_server_id is whatever the address
# to your Airflow webserver (e.g. myairflow.dev).
base_url = 'https://{airflow_server_id}/dags/{dag_id}/grid'
msg_parts = {
'*Dag*': dag_id,
'*Link to Dag grid page*': base_url
}
.
.
.
Or, say you want the Code UI page for this DAG,
last_task = context.get('task_instance')
dag_id = last_task.dag_id
# airflow_server_id is whatever the address
# to your Airflow webserver (e.g. myairflow.dev).
base_url = 'https://{airflow_server_id}/dags/{dag_id}/code'
msg_parts = {
'*Dag*': dag_id,
'*Link to Dag code page*': base_url
}
.
.
.
I will say, it is odd to want to include a link to a DAG page instead of the page of the logs of the failed DAG run. Usually, the person responding to slack alert would want to see more information about a DAG run NOT a DAG in itself. So if it were me, including the log URL of failed DAG run like you're already doing should be enough.
[![grid_dag_ui][1]][1]
[1]: https://i.stack.imgur.com/1IMII.png

Related

Send email notifications when Airflow dag is timeout

I am using Airflow v2.2.5.
I want to send email notification when a dag is timeout.
So far I am able to send email for task level failure .
Please help.
The code you posted should already satisfy your request.
When the dagrun_timeout is reached the DAG is marked as failed, hence the on_failure_callback is called.
In the callback you can access the context['reason'] field to check if the failure is due to the timeout or another reason:
dag_timed_out = context['reason'] == 'timed_out'
Here is a full example:
from time import sleep
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
def printx(v):
print(v)
with open("/tmp/SO_74153563.log", "a") as f:
f.write(v + "\n")
def dag_callback(ctx):
printx("DAG Failure.\nReason: " + ctx['reason'])
timed_out = ctx['reason'] == 'timed_out'
printx("Timed out: " + str(timed_out))
def long_running_job():
printx("Sleeping...")
sleep(40)
printx("Sleeped")
with DAG(
"SO_74153563",
start_date=datetime.now() - timedelta(days = 2),
schedule_interval=None,
dagrun_timeout=timedelta(seconds = 15),
on_failure_callback=dag_callback
) as dag:
task_1 = PythonOperator(
task_id="task_1",
python_callable=long_running_job
)
The task sleeps for 40 seconds while the DAG has a timeout of 15 seconds, so it will fail. The output will be:
DAG Failure.
Reason: timed_out
Timed out: True
The only difference from your callback is that now it is defined directly on the DAG.

how to pass default values for run time input variable in airflow for scheduled execution

I come across one issue while running DAG in airflow. my code is working in two scenarios where is failing for one.
below are my scenarios,
Manual trigger with input - Running Fine
Manual trigger without input - Running Fine
Scheduled Run - Failing
Below is my code:
def decide_the_flow(**kwargs):
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
print("IP is :",cleanup)
return cleanup
I am getting below error,
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
AttributeError: 'NoneType' object has no attribute 'get'
I tried to define default variables like,
default_dag_args = {
'start_date':days_ago(0),
'params': {
"cleanup": "N"
},
'retries': 0
}
but it wont work.
I am using BranchPythonOperator to call this function.
Scheduling : enter image description here
Can anyone please guide me here. what I am missing ?
For workaround i am using below code,
try:
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
except:
cleanup="N"
You can access the parameters from the context dict params, because airflow defines the default values on this dict after copying the dict dag_run.conf and checking if there is something missing:
from datetime import datetime
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.operators.python import BranchPythonOperator
def decide_the_flow(**kwargs):
cleanup = kwargs['params']["cleanup"]
print(f"IP is : {cleanup}")
return cleanup
with DAG(
dag_id='airflow_params',
start_date=datetime(2022, 8, 25),
schedule_interval="* * * * *",
params={
"cleanup": "N",
},
catchup=False
) as dag:
branch_task = BranchPythonOperator(
task_id='test_param',
python_callable=decide_the_flow
)
task_n = EmptyOperator(task_id="N")
task_m = EmptyOperator(task_id="M")
branch_task >> [task_n, task_m]
I just tested it in scheduled and manual (with and without conf) runs, it works fine.

Cannot Create Extra Operator Link on DatabricksRunNowOperator in Airflow

I'm currently trying to build an extra link on the DatabricksRunNowOperator in airflow so I can quickly access the databricks run without having to rummage through the logs. As a starting point I'm simply trying to add a link to google in the task instance menu. I've followed the procedure shown in this tutorial creating the following code placed within my airflow home plugins folder:
from airflow.plugins_manager import AirflowPlugin
from airflow.models.baseoperator import BaseOperatorLink
from airflow.contrib.operators.databricks_operator import DatabricksRunNowOperator
class DBLogLink(BaseOperatorLink):
name = 'run_link'
operators = [DatabricksRunNowOperator]
def get_link(self, operator, dttm):
return "https://www.google.com"
class AirflowExtraLinkPlugin(AirflowPlugin):
name = "extra_link_plugin"
operator_extra_links = [DBLogLink(), ]
However the extra link does not show up, even after restarting the webserver etc:
Here's the code I'm using to create the DAG:
from airflow import DAG
from airflow.contrib.operators.databricks_operator import DatabricksRunNowOperator
from datetime import datetime, timedelta
DATABRICKS_CONN_ID = '____'
args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2020, 2, 13),
'retries': 0
}
dag = DAG(
dag_id = 'testing_notebook',
default_args = args,
schedule_interval = timedelta(days=1)
)
DatabricksRunNowOperator(
task_id = 'mail_reader',
dag = dag,
databricks_conn_id = DATABRICKS_CONN_ID,
polling_period_seconds=1,
job_id = ____,
notebook_params = {____}
)
I feel like I'm missing something really basic, but I just can't figure it out.
Additional info
Airflow version 1.10.9
Running on ubuntu 18.04.3
I've worked it out. You need to have your webserver running as RBAC. This means setting up airflow with authentication and adding users. RBAC can be turned on by setting rbac = True in your airflow.cfg file.

Airflow schedule getting skipped if previous task execution takes more time

I have two tasks in my airflow DAG. One triggers an API call ( Http operator ) and another one keeps checking its status using another api ( Http sensor ). This DAG is scheduled to run every hour & 10 minutes. But some times one execution can take long time to finish for example 20 hours. In such cases all the schedules while the previous task is running is not executing.
For example say if I the job at 01:10 takes 10 hours to finish. Schedules 02:10, 03:10, 04:10, ... 11:10 etc which are supposed to run are getting skipped and only the one at 12:10 is executed.
I am using local executor. I am running airflow server & scheduler using below script.
start_server.sh
export AIRFLOW_HOME=./airflow_home;
export AIRFLOW_GPL_UNIDECODE=yes;
export AIRFLOW_CONN_REST_API=http://localhost:5000;
export AIRFLOW_CONN_MANAGEMENT_API=http://localhost:8001;
airflow initdb;
airflow webserver -p 7200;
start_scheduler.sh
export AIRFLOW_HOME=./airflow_home;
# Connection string for connecting to REST interface server
export AIRFLOW_CONN_REST_API=http://localhost:5000;
export AIRFLOW_CONN_MANAGEMENT_API=http://localhost:8001;
#export AIRFLOW__SMTP__SMTP_PASSWORD=**********;
airflow scheduler;
my_dag_file.py
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
'email': admin_email_ids,
'email_on_failure': False,
'email_on_retry': False
}
DAG_ID = 'reconciliation_job_pipeline'
MANAGEMENT_RES_API_CONNECTION_CONFIG = 'management_api'
DA_REST_API_CONNECTION_CONFIG = 'rest_api'
recon_schedule = Variable.get('recon_cron_expression',"10 * * * *")
dag = DAG(DAG_ID, max_active_runs=1, default_args=default_args,
schedule_interval=recon_schedule,
catchup=False)
dag.doc_md = __doc__
spark_job_end_point = conf['sip_da']['spark_job_end_point']
fetch_index_record_count_config_key = conf['reconciliation'][
'fetch_index_record_count']
fetch_index_record_count = SparkJobOperator(
job_id_key='fetch_index_record_count_job',
config_key=fetch_index_record_count_config_key,
exec_id_req=False,
dag=dag,
http_conn_id=DA_REST_API_CONNECTION_CONFIG,
task_id='fetch_index_record_count_job',
data={},
method='POST',
endpoint=spark_job_end_point,
headers={
"Content-Type": "application/json"}
)
job_endpoint = conf['sip_da']['job_resource_endpoint']
fetch_index_record_count_status_job = JobStatusSensor(
job_id_key='fetch_index_record_count_job',
http_conn_id=DA_REST_API_CONNECTION_CONFIG,
task_id='fetch_index_record_count_status_job',
endpoint=job_endpoint,
method='GET',
request_params={'required': 'status'},
headers={"Content-Type": "application/json"},
dag=dag,
poke_interval=15
)
fetch_index_record_count>>fetch_index_record_count_status_job
SparkJobOperator & JobStatusSensor my custom class extending SimpleHttpOperator & HttpSensor.
If I set depends_on_past true will it work as expected?. Another problem I have for this option is some time the status check job will fail. But the next schedule should get trigger. How can I achieve this behavior ?
I think the main discussion point here is what you set is catchup=False, more detail can be found here. So airflow scheduler will skip those task execution and you would see the behavior as you mentioned.
This sounds like you would need to perform catchup if the previous process took longer than expected. You can try to change it catchup=True

How to setup Nagios Alerts for Apache Airflow Dags

Is it possible to setup Nagios alerts for airflow dags?
In case the dag is failed, I need to alert the respective groups.
You can add an "on_failure_callback" to any task which will call an arbitrary failure handling function. In that function you can then send an error call to Nagios.
For example:
dag = DAG(dag_id="failure_handling",
schedule_interval='#daily')
def handle_failure(context):
# first get useful fields to send to nagios/elsewhere
dag_id = context['dag'].dag_id
ds = context['ds']
task_id = context['ti'].task_id
# instead of printing these out - you can send these to somewhere else
logging.info("dag_id={}, ds={}, task_id={}".format(dag_id, ds, task_id))
def task_that_fails(**kwargs):
raise Exception("failing test")
task_to_fail = PythonOperator(
task_id='python_task_to_fail',
python_callable=task_that_fails,
provide_context=True,
on_failure_callback=handle_failure,
dag=dag)
If you run a test on this:
airflow test failure_handling task_to_fail 2018-08-10
You get the following in your log output:
INFO - dag_id=failure_handling, ds=2018-08-10, task_id=task_to_fail

Resources