How to show DAG progress from Apache Airflow to a custom webapp - airflow

Context
I'm building a react webapp to help users start certain jobs on a VM. These jobs can be of different types (download data followed by process data as a job for example). Each of these jobs can take multiple hours to complete. Whenever user clicks a button, I'm sending a POST request to the Airflow Webserver REST API (sitting on my VM) to trigger a DAG run.
Since I have access to Airflow's UI, I can see the progress of the DAG run and also the progress of each task in the DAG run.
Problem:
I want to show the progress and ETA of the DAG run (which was triggered by a user) in the react webapp for the users to see. I'm not sure how to do it.
Current Solution:
For now, I'm using a workaround. I'm setting Variables as 'checkpoints" so that I can keep polling the Variables endpoint to show progress to the user.
code example:
dag.py
from airflow import DAG
from datetime import timedelta
from datetime import datetime
from airflow.operators.python import PythonOperator
from airflow.models import Variable
import json
def task1_callable(**context):
dag_run_id = context["dag_run"].run_id
# Some code
Variable.set(dag_run_id, json.dumps({"result": "first sub task completed"}))
# Some code
Variable.set(dag_run_id, json.dumps({"result": "second sub task completed"}))
# Some code
Variable.set(dag_run_id, json.dumps({"result": "task1 completed"}))
def task2_callable(**context):
dag_run_id = context["dag_run"].run_id
# Some code
Variable.set(dag_run_id, json.dumps({"result": "first sub task completed"}))
# Some code
Variable.set(dag_run_id, json.dumps({"result": "second sub task completed"}))
# Some code
Variable.set(dag_run_id, json.dumps({"final":True, "result": "task2 completed and your job is also finished"}))
with DAG(
dag_id="some_id",
schedule_interval=None,
default_args={
"owner": "airflow",
"retries": 1,
"retry_delay": timedelta(minutes=5),
"start_date": datetime(2021, 1, 1),
},
catchup=False,
) as f:
task1 = PythonOperator(
task_id="task1",
python_callable = task1_callable,
run_as_user= "user",
)
task2 = PythonOperator(
task_id="task2",
python_callable = task2_callable,
run_as_user= "user",
)
task1 >> task2
Now on the react side, I'm just polling the Variables endpoint (/variables/{variable_key}) every 10 seconds so that I can show the user custom messages to convey their job progress.
But the main problem with this approach is that this is hard to maintain, and is not scalable. So I'm looking for the right way to share DAG progress to my users.
Thanks for reading this so far. Please let me know if I can provide any more details to improve clarity.

Related

Airflow - prevent DAG from running immediately during import

I have a DAG that has below Steps :-
Retrieve a list of items from an API call
For each item in the list, spin up another task that prints the value.
Basically, step 2 is indeterministic until the API call is made. I want the API call to be made only after I trigger a DAG run.
However, the Step1 of the DAG is being executed while importing the DAG itself, and if the API call is not working, then it reports DAG as broken. The entire thing is supposed to be dynamic.
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
import requests
# Default args for the DAG
default_args = {
'owner': 'me',
'start_date': datetime(2025, 1, 1),
'depends_on_past': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
# Create a DAG instance
dag = DAG(
'my_dag_id',
default_args=default_args,
schedule=None,
)
def get_items():
"""
Makes a HTTP request to an API,
retrieves a list of items from the response,
and returns the list
"""
response = requests.get('https://api.example.com/items')
items = response.json()['items']
return items
def process_item(item):
"""
Processes a single item
"""
print(f'Processing item {item}')
# Create a PythonOperator to get the items
get_items_task = PythonOperator(
task_id='get_items',
python_callable=get_items,
dag=dag,
)
# Create a PythonOperator to process each item
for item in get_items():
task = PythonOperator(
task_id=f'process_item_{item}',
python_callable=process_item,
op_args=[item],
dag=dag,
)
task.set_upstream(get_items_task)
Notice that I have set start date to future and schedule=None.
As soon as I save this py file in the /dags folder, it immediately executes the get_items_task and reports that DAG is broken because the get_items api call returned error.
How can I stop the task from getting executed while importing DAG?
I want it to be dynamic i.e., fetch list of items only once the DAG is triggered, and then only create tasks for each of those items dynamically.
You're calling get_items() in the global scope of the DAG file (statement for item in get_items():). This gets evaluated every time Airflow parses the DAG file.
To avoid get_items() getting executed in the global scope, you can place this functionality in a function, to only generate tasks at runtime. For this use case, dynamic task mapping was introduced in Airflow. This allows you to generate a varying number of tasks given a collection of items.
I've refactored your DAG to generate tasks in the process_item task given the output of get_items:
from datetime import datetime, timedelta
import requests
from airflow import DAG
from airflow.decorators import task
# Default args for the DAG
default_args = {
"owner": "me",
"depends_on_past": False,
"retries": 1,
"retry_delay": timedelta(minutes=5),
}
# Create a DAG instance
with DAG(
"my_dag_id",
default_args=default_args,
start_date=datetime(2025, 1, 1),
schedule=None,
):
#task
def get_items():
"""
Makes a HTTP request to an API,
retrieves a list of items from the response,
and returns the list
"""
response = requests.get("https://api.example.com/items")
items = response.json()["items"]
return items
#task
def process_item(item):
"""
Processes a single item
"""
print(f"Processing item {item}")
process_item.expand(item=get_items())
expand() generates a task for each element in the output of get_items(). The TaskFlow API (#task decorator) is convenient when dealing with dynamically generated tasks, read more about it in the docs.

When do I use a task vs a DAG?

I'm struggling to understand the difference between a task and a DAG and when to use one over the other. I know a task is more granular and called within a DAG, but so much of Airflow documentation mentions creating DAGs on the go or calling other DAGs instead of tasks. Is there any significant difference between using either of these two options?
A DAG is a collection of tasks with schedule information. Each task can perform different work based on our requirement. Let us consider below DAG code as an example. In the below code we are printing current time and then sending an e-mail notification after that.
#importing operators and modules
from airflow import DAG
from airflow.operators.python_operator import PythonOperator ##to call a python object
from airflow.operators.email_operator import EmailOperator ##to send email
from datetime import datetime,timedelta,timezone ##to play with date and time
import dateutil
#setting default arguments
default_args = {
'owner': 'test dag',
'depends_on_past': False,
'start_date': datetime(2021, 1, 1),
'email': ['myemailid#example.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 0
}
def print_time(**context):
now_utc = datetime.now(timezone.utc)
print("current time",now_utc)
with DAG('example_dag', schedule_interval='* 12 * * *', max_active_runs=1, catchup=False,default_args=default_args) as dag: ##dag name is 'example_dag'
current_time = PythonOperator(task_id='current_time', python_callable=print_time,
provide_context=True,
dag=dag) ##task to call print_time definition
send_email = EmailOperator(task_id='send_email', to='myemailid#example.com',
subject='DAG completed successfully',
html_content="<p>Hi,<br><br>example DAG completed successfully<br>", dag=dag) ## task to send email
current_time >> send_email ##defining tasks dependency
Here current_time and send_email are 2 different tasks performing different work. Now we have a dependency here like we have to send an email once the current time is printed so we have established that task dependency at the end. Also we have given a scheduled_interval to run the DAG everyday at 12 PM. This together forms a DAG.

AirFlow DAG Get stuck in running state

I created a dag and scheduled it on a daily basis.
It gets queued every day but tasks don't actually run.
This problem already raised in the past here but the answers didn't help me so it seems there is another problem.
My code is shared below. I replaced the SQL of task t2 with a comment.
Each one of the tasks runs successfully when I run them separately on CLI using "airflow test...".
Can you explain what should be done to make the DAG run?
Thanks!
This is the DAG code:
from datetime import timedelta, datetime
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
default_args = {
'owner' : 'me',
'depends_on_past' : 'true',
'start_date' : datetime(2018, 06, 25),
'email' : ['myemail#moovit.com'],
'email_on_failure':True,
'email_on_retry':False,
'retries' : 2,
'retry_delay' : timedelta(minutes=5)
}
dag = DAG('my_agg_table',
default_args = default_args,
schedule_interval = "30 4 * * *"
)
t1 = BigQueryOperator(
task_id='bq_delete_my_agg_table',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
allow_large_results=True,
bql='''
delete `my_project.agg.my_agg_table`
where date = '{{ macros.ds_add(ds, -1)}}'
''',
dag=dag)
t2 = BigQueryOperator(
task_id='bq_insert_my_agg_table',
use_legacy_sql=False,
write_disposition='WRITE_APPEND',
allow_large_results=True,
bql='''
#standardSQL
Select ... the query continue here.....
''', destination_dataset_table='my_project.agg.my_agg_table',
dag=dag)
t1 >> t2
It is usually very easy to find out about the reason why a task is not being run. When in the Airflow web UI:
select any DAG of interest
now click on the task
again, click on Task Instance Details
In the first row there is a panel Task Instance State
In the box Reason next to it is the reason why a task is being run - or why a task is being ignored
It usually makes sense to check the first task which is not being executed since I saw you have setup depends_on_past=True which can lead to problems if used in a wrong scenario.
More on that here: Airflow 1.9.0 is queuing but not launching tasks

airflow trigger_rule using ONE_FAILED cause dag failure

what i wanted to achieve is to create a task where will send notification if any-one of the task under the dag is failed. I am applying trigger rule to the task where:
batch11 = BashOperator(
task_id='Error_Buzz',
trigger_rule=TriggerRule.ONE_FAILED,
bash_command='python /home/admin/pythonwork/home/codes/notifications/dagLevel_Notification.py') ,
dag=dag,
catchup = False
)
batch>>batch11
batch1>>batch11
The problem for now is when there no other task failed, the batch11 task will not execute due to trigger_rule, which is what i wanted, but it will result the dag failure since the default trigger_rule for dag is ALL_SUCCESS. Is there a way to end the loop hole to make the dag runs successfully ?
screenshot of outcome :
We do something similar in our Airflow Deployment. The idea is to notify slack when a task in a dag fails. You can set a dag level configuration on_failure_callback as documented https://airflow.apache.org/code.html#airflow.models.BaseOperator
on_failure_callback (callable) – a function to be called when a task
instance of this task fails. a context dictionary is passed as a
single parameter to this function. Context contains references to
related objects to the task instance and is documented under the
macros section of the API.
Here is an example of how I use it. if any of the task fails or succeeds airflow calls notify function and I can get notification wherever I want.
import sys
import os
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from airflow.utils.dates import days_ago
from util.airflow_utils import AirflowUtils
schedule = timedelta(minutes=5)
args = {
'owner': 'user',
'start_date': days_ago(1),
'depends_on_past': False,
'on_failure_callback': AirflowUtils.notify_job_failure,
'on_success_callback': AirflowUtils.notify_job_success
}
dag = DAG(
dag_id='demo_dag',
schedule_interval=schedule, default_args=args)
def task1():
return 'Whatever you return gets printed in the logs!'
def task2():
return 'cont'
task1 = PythonOperator(task_id='task1',
python_callable=task1,
dag=dag)
task2 = PythonOperator(task_id='task2',
python_callable=task1,
dag=dag)
task1 >> task2

Is it possible to have a pipeline in Airflow that does not tie to any schedule?

I need to have pipeline that will be executed either manually or programmatically, is possible with Airflow? Looks like right now each workflow MUST be tied to a schedule.
Just set the schedule_interval to None when you create the DAG:
dag = DAG('workflow_name',
template_searchpath='path',
schedule_interval=None,
default_args=default_args)
From the Airflow Manual:
Each DAG may or may not have a schedule, which informs how DAG Runs
are created. schedule_interval is defined as a DAG arguments, and
receives preferably a cron expression as a str, or a
datetime.timedelta object.
The manual then goes on to list some cron 'presets' one of which is None.
Yes, this can be achieved by passing None to schedule_interval in default_args.
Check this documation on DAG run.
For example:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 12, 1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'schedule_interval': None, # Check this line
}
In Airflow, every DAG is required to have a start date and schedule interval*, for example hourly:
import datetime
dag = DAG(
dag_id='my_dag',
schedule_interval=datetime.timedelta(hours=1),
start_date=datetime(2018, 5, 23),
)
(Without a schedule how would it know when to run?)
Alternatively to a cron schedule, you can set the schedule to #once to only run once.
*One exception: You can omit the schedule for externally triggered DAGs because Airflow will not schedule them itself.
However, that said, if you omit the schedule, then you need to trigger the DAG externally somehow. If you want to be able to call a DAG programmatically, for instance, as a result of a separate condition occurring in another DAG, you can do that with the TriggerDagRunOperator. You might also hear this idea called externally triggered DAGs.
Here's a usage example from the Airflow Example DAGs:
File 1 - example_trigger_controller_dag.py:
"""This example illustrates the use of the TriggerDagRunOperator. There are 2
entities at work in this scenario:
1. The Controller DAG - the DAG that conditionally executes the trigger
2. The Target DAG - DAG being triggered (in example_trigger_target_dag.py)
This example illustrates the following features :
1. A TriggerDagRunOperator that takes:
a. A python callable that decides whether or not to trigger the Target DAG
b. An optional params dict passed to the python callable to help in
evaluating whether or not to trigger the Target DAG
c. The id (name) of the Target DAG
d. The python callable can add contextual info to the DagRun created by
way of adding a Pickleable payload (e.g. dictionary of primitives). This
state is then made available to the TargetDag
2. A Target DAG : c.f. example_trigger_target_dag.py
"""
from airflow import DAG
from airflow.operators.dagrun_operator import TriggerDagRunOperator
from datetime import datetime
import pprint
pp = pprint.PrettyPrinter(indent=4)
def conditionally_trigger(context, dag_run_obj):
"""This function decides whether or not to Trigger the remote DAG"""
c_p = context['params']['condition_param']
print("Controller DAG : conditionally_trigger = {}".format(c_p))
if context['params']['condition_param']:
dag_run_obj.payload = {'message': context['params']['message']}
pp.pprint(dag_run_obj.payload)
return dag_run_obj
# Define the DAG
dag = DAG(dag_id='example_trigger_controller_dag',
default_args={"owner": "airflow",
"start_date": datetime.utcnow()},
schedule_interval='#once')
# Define the single task in this controller example DAG
trigger = TriggerDagRunOperator(task_id='test_trigger_dagrun',
trigger_dag_id="example_trigger_target_dag",
python_callable=conditionally_trigger,
params={'condition_param': True,
'message': 'Hello World'},
dag=dag)
File 2 - example_trigger_target_dag.py:
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
from datetime import datetime
import pprint
pp = pprint.PrettyPrinter(indent=4)
# This example illustrates the use of the TriggerDagRunOperator. There are 2
# entities at work in this scenario:
# 1. The Controller DAG - the DAG that conditionally executes the trigger
# (in example_trigger_controller.py)
# 2. The Target DAG - DAG being triggered
#
# This example illustrates the following features :
# 1. A TriggerDagRunOperator that takes:
# a. A python callable that decides whether or not to trigger the Target DAG
# b. An optional params dict passed to the python callable to help in
# evaluating whether or not to trigger the Target DAG
# c. The id (name) of the Target DAG
# d. The python callable can add contextual info to the DagRun created by
# way of adding a Pickleable payload (e.g. dictionary of primitives). This
# state is then made available to the TargetDag
# 2. A Target DAG : c.f. example_trigger_target_dag.py
args = {
'start_date': datetime.utcnow(),
'owner': 'airflow',
}
dag = DAG(
dag_id='example_trigger_target_dag',
default_args=args,
schedule_interval=None)
def run_this_func(ds, **kwargs):
print("Remotely received value of {} for key=message".
format(kwargs['dag_run'].conf['message']))
run_this = PythonOperator(
task_id='run_this',
provide_context=True,
python_callable=run_this_func,
dag=dag)
# You can also access the DagRun object in templates
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: '
'{{ dag_run.conf["message"] if dag_run else "" }}" ',
dag=dag)

Resources