Airflow: Monitoring solutions for a DAG run - airflow

I'm currently trying to setup a monitoring for Airflow, that would ideally send out an email, when a DAG was executed, containing in the mail some information about all the contained tasks, like final status of the task, runtime, etc.
What I currently fail to resolve is/are:
How can I get the states of all task instances associated with a DAG run?
Does it make sense to have the mail-sending as a component within the DAG?
If so, how could I then assure in a simple way, that the task will run after all other tasks?
Next to these I have the additional pickle, that the solution that I look for has to be simple in a sense, that it should either be just 2-3 lines of code or generilizable into a Python function as my less Python-experienced colleagues have to be able to understand and reproduce the steps on other DAGs.
Smarter ideas about how to establish the email sending are very welcome.
Thank you for all the suggestions in advance!

Does it make sense to have the mail-sending as a component within the
DAG?
If so, how could I then assure in a simple way, that the task will run
after all other tasks?
I think that is one way to go about achieving what you want. You can create a task that connect all the "leaves" (tasks with no downstream dependencies) to a final task that emails the state of the DAG (dagrun is still running in this scenario) that has the state of the other tasks.
def send_task_summary_t(**context):
tis = context['dag_run'].get_task_instances()
for ti in tis:
print(ti.__dict__)
dag = DAG(...)
job_status = PythonOperator(
task_id='_job_status',
python_callable=send_task_summary,
provide_context=True,
trigger_rule=TriggerRule.ALL_DONE,
dag=dag
)
leaves = [task for task in dag.tasks if not task.downstream_list]
exclude = ['_job_status']
for l in leaves:
if l.task_id not in exclude:
job_status.set_upstream(l)
How can I get the states of all task instances associated with a DAG
run?
Instead of the EmailOperator, I would suggest the PythonOperator since you will need the context which contains information you need to grab the state of the tasks. Building off of the snippet above, I leveraged the send_email utility to send an email.
from airflow.utils.email import send_email
def send_task_summary_t(**context):
ti = context['task']
dr = context['dag_run']
body = ti.render_template(None, "path/to/template", context)
send_email(to="alan#example.com", subject=f"{dr} summary", html_content=body)
You can also use Jinja templating to build your email.
<html>
<body>
<div>
<table>
{% for ti in dag_run.get_task_instances(): -%}
<tr>
<td class='{{ti.state}}' >
<a href='{{ host_server }}/admin/airflow/log?execution_date={{ts}}&task_id={{ti.task_id}}&dag_id={{dag.dag_id}}'>{{ti.state}}</a></td>
<td class="{{ti.operator}}">
<a href='{{ host_server }}/admin/airflow/graph?root={{ti.task_id}}&dag_id={{dag.dag_id}}&execution_date={{ts}}'>{{ti.task_id}}</a></td>
<td><a href='{{ host_server }}/admin/airflow/tree?base_date={{ts}}&num_runs=50&root={{ti.task_id}}&dag_id={{dag.dag_id}}'>{{ti.start_date}}</a></td>
<td><a href='{{ host_server }}/admin/airflow/gantt?root={{ti.task_id}}&dag_id={{dag.dag_id}}&execution_date={{ts}}'>{{ti.end_date}}</a></td>
<td><a href='{{ host_server }}/admin/airflow/duration?root={{ti.task_id}}&base_date={{ts}}&days=9999&dag_id={{dag.dag_id}}'>{{ti.duration}}</a></td>
</tr>
{% endfor -%}
</table>
</div>
</body>
</html>
Another way you can go about this is to utilize the on_failure_callback for the DAG object.
from airflow.models import DAG
from datetime import datetime
def send_task_summary(context):
tis = context['dag_run'].get_task_instances()
for ti in tis:
print(ti.__dict__)
dag = DAG(
dag_id='my_dag',
schedule_interval='#once',
start_date=datetime(2020, 1, 1),
on_failure_callback=send_task_summary
)

Related

run next tasks in dag if another dag is complete

dag1:
start >> clean >> end
I have a dag where i run a few tasks. But I want to modify it such that the clean steps only runs if another dag "dag2" is not running at the moment.
Is there any way I can import information regarding my "dag2", check its status and if it is in success mode, I can proceed to the clean step
Something like this:
start >> wait_for_dag2 >> clean >> end
How can I achieve the wait_for_dag2 part?
There are some different answers depends on what you want to do:
if you have two dags with the same schedule interval, and you want to make the run of second dag waits the same run of first one, you can use ExternalTaskSensor on the last task of the first dag
if you want to run a dag2, after each run of a dag1 even if it's triggered manually, in this case you need to update dag1 and add a TriggerDagRunOperator and set schedule interval of the second to None
I want to modify it such that the clean steps only runs if another dag "dag2" is not running at the moment.
if you have two dags and you don't want to run them in same time to avoid a conflict on an external server/service, you can use one of the first two propositions or just use higher priority for the task of the first dag, and use the same pool (with 1 slot) for the tasks which lead to the conflict, but you will lose the parallelism on these tasks.
Hossein's Approach is the way people usually go. However if you want to get info about any dag run data, you can use the airlfow functionality to get that info. The following appraoch is good when you do not want(or are not allowed) to modify another dag:
from airflow.models.dagrun import DagRun
from airflow.utils.state import DagRunState
dag_runs = DagRun.find(dag_id='the_dag_id_you_want_to_check')
last_run = dag_runs[-1]
if last_run.state == DagRunState.SUCCESS:
print('the dag run was successfull!')
else:
print('the dag state is -->: ', last_run.state)

How to find out the delayed jobs in airflow

Some of my DAG are waiting to get scheduled, and some are waiting in the queue. I suspect there are reasons for this delay but not sure how I can start to debug this problem. Majority of the pipelines are running Spark jobs.
Can someone help to give me some directions in terms of where to look at to 1) anaylse which DAGs were delayed (did not start at the scheduled time) 2) where are the places I should look at to find out if the resources are enough. I'm quite new to scheduling in Airflow. Many thanks. Please let me know if I can describe the question better.
If you are looking for code that takes advantage of Airflows' wider capabilities.
There are three modules within airflow.models which can be harnessed.
To programmatically retrieve all DAGs which your Airflow is away of, we import DagBag. From the docs "A dagbag is a collection of dags, parsed out of a folder tree and has high"
We utilise DagModel and the method get_current, to initialise each dag_id present in our bag
We check if any DAG is active using the DagModel property is_paused
We retrieve the latest DAG run using the DagRun.find
Sort the individual dag runs by latest to earliest
Here you could just subset [0] to get 1, however, for your debugging purposes I just loop through them all
DagRun returns a lot of information for us to use. In my loop I have output print(i, run.state, run.execution_date, run.start_date). So you can see what is going on under the hood.
id
state
dag_id
queued_at
execution_date
start_date
end_date
run_id
data_interval_start
data_interval_end
last_scheduling_decision
I have commented out an if check for any queued Dags for you to uncomment. Additionally you can do some arithmetic on dates if you desire, to add further conditional functionality.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.models import DagBag, DagModel, DagRun
from airflow.operators.python import PythonOperator
# make a function that returns if a DAG is set to active or paused
def check_dag_active():
bag = DagBag()
for dag_id in bag.dags:
in_bag = DagModel.get_current(dag_id)
if not in_bag.is_paused:
latest = DagRun.find(dag_id=dag_id)
latest.sort(key=lambda x: x.execution_date, reverse=True)
for i, run in enumerate(latest):
print(i, run.state, run.execution_date, run.start_date)
# if run.state == 'queued':
# return [run.dag_id, run.execution_date, run.start_date]
with DAG(
'stack_overflow_ans_3',
tags = ['SO'],
start_date = datetime(2022, 1, 1),
schedule_interval = None,
catchup = False,
is_paused_upon_creation = False
) as dag:
t1 = PythonOperator(
task_id = 'task_that_will_fail',
python_callable = check_dag_active
)
Depending on your version of Airflow and your setup, you should be able to query the Airflow DB directly to get this information.
If you're using Airflow 1.x, there should be an "Ad Hoc Query" executor in the Data Profiling tab in the UI. This was disabled in 2.x though, so if you're running 2.x you'll need to connect directly to your Airflow DB using psql or something similar (this differs from Google to AWS to Docker).
Once you're in, check out this link for some queries on DAG runtime.

Can an Airflow task dynamically generate a DAG at runtime?

I have an upload folder that gets irregular uploads. For each uploaded file, I want to spawn a DAG that is specific to that file.
My first thought was to do this with a FileSensor that monitors the upload folder and, conditional on presence of new files, triggers a task that creates the separate DAGs. Conceptually:
Sensor_DAG (FileSensor -> CreateDAGTask)
|-> File1_DAG (Task1 -> Task2 -> ...)
|-> File2_DAG (Task1 -> Task2 -> ...)
In my initial implementation, CreateDAGTask was a PythonOperator that created DAG globals, by placing them in the global namespace (see this SO answer), like so:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from datetime import datetime, timedelta
from pathlib import Path
UPLOAD_LOCATION = "/opt/files/uploaded"
# Dynamic DAG generation task code, for the Sensor_DAG below
def generate_dags_for_files(location=UPLOAD_LOCATION, **kwargs):
dags = []
for filepath in Path(location).glob('*'):
dag_name = f"process_{filepath.name}"
dag = DAG(dag_name, schedule_interval="#once", default_args={
"depends_on_past": True,
"start_date": datetime(2020, 7, 15),
"retries": 1,
"retry_delay": timedelta(hours=12)
}, catchup=False)
dag_task = DummyOperator(dag=dag, task_id=f"start_{dag_name}")
dags.append(dag)
# Try to place the DAG into globals(), which doesn't work
globals()[dag_name] = dag
return dags
The main DAG then invokes this logic via a PythonOperator:
# File-sensing DAG
default_args = {
"depends_on_past" : False,
"start_date" : datetime(2020, 7, 16),
"retries" : 1,
"retry_delay" : timedelta(hours=5),
}
with DAG("Sensor_DAG", default_args=default_args,
schedule_interval= "50 * * * *", catchup=False, ) as sensor_dag:
start_task = DummyOperator(task_id="start")
stop_task = DummyOperator(task_id="stop")
sensor_task = FileSensor(task_id="my_file_sensor_task",
poke_interval=60,
filepath=UPLOAD_LOCATION)
process_creator_task = PythonOperator(
task_id="process_creator",
python_callable=generate_dags_for_files,
)
start_task >> sensor_task >> process_creator_task >> stop_task
But that doesn't work, because by the time process_creator_task runs, the globals have already been parsed by Airflow. New globals after parse time are irrelevant.
Interim solution
Per Airflow dynamic DAG and task Ids, I can achieve what I'm trying to do by omitting the FileSensor task altogether and just letting Airflow generate the per-file task at each scheduler heartbeat, replacing the Sensor_DAG with just executing generate_dags_for_files: Update: Nevermind -- while this does create a DAG in the dashboard, actual execution runs into the "DAG seems to be missing" issue:
generate_dags_for_files()
This does mean that I can no longer regulate the frequency of folder polling with the poke_interval parameter of FileSensor; instead, Airflow will poll the folder every time it collects DAGs.
Is that the best pattern here?
Other related StackOverflow threads
Run Airflow DAG for each file and Airflow: Proper way to run DAG for each file: identical use case, but the accepted answer uses two static DAGs, presumably with different parameters.
Proper way to create dynamic workflows in Airflow - accepted answer dynamically creates tasks, not DAGs, via a complicated XCom setup.
In short: if the task writes where the DagBag reads from, yes, but it's best to avoid a pattern that requires this. Any DAG you're tempted to custom-create in a task should probably instead be a static, heavily parametrized, conditionally-triggered DAG. y2k-shubham provides an excellent example of such a setup, and I'm grateful for his guidance in the comments on this question.
That said, here are the approaches that would accomplish what the question is asking, no matter how bad of an idea it is, in the increasing degree of ham-handedness:
If you dynamically generate DAGs from a Variable (like so), modify the Variable.
If you dynamically generate DAGs from a list of config files, add a new config file to wherever you're pulling config files from, so that a new DAG gets generated on the next DAG collection.
Use something like Jinja templating to write a new Python file in the dags/ folder.
To retain access to the task after it runs, you'd have to keep the new DAG definition stable and accessible on future dashboard updates / DagBag collection. Otherwise, the Airflow dashboard won't be able to render much about it.
Airflow is suited for building DAGs dynamically; as pointed it out by its creator:
https://youtu.be/Fvu2oFyFCT0?t=411 p.s. thanks to #Yiannis for the video reference
Here is an example of how this could be accomplished:
https://docs.astronomer.io/learn/dynamically-generating-dags

Airflow - Get start time of dag run

Is it possible to get the actual start time of a dag in Airflow? By start time I mean the exact time the first task of a dag starts running.
I know I can use macros to get the execution date. If the job is ran using trigger_dag this is what I would call a start time but if the job is ran on a daily schedule then {{ execution_date }} returns yesterdays date.
I have also tried to place datetime.now().isoformat() in the body of the dag code and then pass it to a task but this seems to return the time the task is first called rather than when the dag itself started.
{{ dag_run.start_date }} provides the actual start time of the dag
This is an old question, but I am answering it because the accepted answer did not work for me. {{ dag_run.start_date }} changes if the DAG run fails and some tasks are retried.
The solution was to use: {{ dag_run.get_task_instance('start').start_date }} which uses the start date of the first task (DummyOperator task with task_id: start).
I am following the way as you stated:
By start time I mean the exact time the first task of a dag starts running
You can still do this with macros on your first task, try this {{ task.start_date }}
All the variables can be found in TaskInstance class:
https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L746

How to define Airflow DAG/task that shouldn't run periodically

The goal is pretty simple: I need to create a DAG for a manual task that should not run periodically, but only when admin presses the "Run" button. Ideally without a need to switch "unpause" and "pause" the DAG (you know someone will surely forget to pause).
So far I only came with schedule_interval="0 0 30 2 *" (30th Feb hopefully never occurs), but there must be a better way!
Is there?
Based on the documentation, you can set the scheduler preset to None (Don’t schedule, use for exclusively “externally triggered” DAGs). Also, you can set it to #once if schedule once and only once.
Set schedule_interval=None.
For example:
from airflow import models
with models.DAG(
'Your DAG',
schedule_interval=None,
start_date=datetime(2021, 1, 1)
) as dag:
...

Resources