Implementing cross-DAG dependency in Apache airflow - airflow

I am trying to implement DAG dependency between 2 DAGs say A and B. DAG A runs once every hour and DAG B runs every 15 mins.
Each time DAG B starts it's run I want to make sure DAG A is not in running state.
If DAG A is found to be running then DAG B has to wait until DAG A completes the run.
If DAG A is not running, DAG B can proceed with it's tasks.
DAG A :
from datetime import datetime,timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'dependency',
'depends_on_past': False,
'start_date': datetime(2020, 9, 10, 10, 1),
'email': ['xxxx.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG('DAG_A', schedule_interval='0/60 * * * *',max_active_runs=1, catchup=False,
default_args=default_args) as dag:
task1 = DummyOperator(task_id='task1', retries=1, dag=dag)
task2 = DummyOperator(task_id='task2', retries=1, dag=dag)
task3 = DummyOperator(task_id='task3', retries=1, dag=dag)
task1 >> task2 >> task3
DAG B:
from datetime import datetime,timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'dependency',
'depends_on_past': False,
'start_date': datetime(2020, 9, 10, 10, 1),
'email': ['xxxx.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG('DAG_B', schedule_interval='0/15 * * * *',max_active_runs=1, catchup=False,
default_args=default_args) as dag:
task4 = DummyOperator(task_id='task4', retries=1, dag=dag)
task5 = DummyOperator(task_id='task5', retries=1, dag=dag)
task6 = DummyOperator(task_id='task6', retries=1, dag=dag)
task4 >> task5 >> task6
I have tried using ExternalTaskSensor operator. I am unable to understand if the sensor finds DAG A to be in success state it triggers the next task else wait for the task to complete.
Thanks in advance.

I think the only way you can achieve that in "general" way is to use some external locking mechanism
You can achieve quite a good approximation though using pools:
https://airflow.apache.org/docs/apache-airflow/1.10.3/concepts.html?highlight=pool
if you set pool size to 1 and assign both dag A and B to the pool, only one of those can be running at a time. You can also add priority_weight in the way that you see best - in case you need to prioritise A over B or the other way round.

You could use ExternalTaskSensor to achieve what you are looking for. The key aspect is to initialize this sensor with the correct execution_date, being that in your example the execution_date of the last DagRun of DAG_A.
Check this example where DAG_A runs every 9 minutes for 200 seconds. DAG_B runs every 3 minutes and runs for 30 seconds. These values are arbitrary and only for demo purpose, could be pretty much anything.
DAG A (nothing new here):
import time
from airflow import DAG
from airflow.models.baseoperator import chain
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
def _executing_task(**kwargs):
print("Starting task_a")
time.sleep(200)
print("Completed task_a")
dag = DAG(
dag_id="example_external_task_sensor_a",
default_args={"owner": "airflow"},
start_date=days_ago(1),
schedule_interval="*/9 * * * *",
tags=['example_dags'],
catchup=False
)
with dag:
start = DummyOperator(
task_id='start')
task_a = PythonOperator(
task_id='task_a',
python_callable=_executing_task,
)
chain(start, task_a)
DAG B:
import time
from airflow import DAG
from airflow.utils.db import provide_session
from airflow.models.dag import get_last_dagrun
from airflow.models.baseoperator import chain
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from airflow.sensors.external_task import ExternalTaskSensor
def _executing_task():
time.sleep(30)
print("Completed task_b")
#provide_session
def _get_execution_date_of_dag_a(exec_date, session=None, **kwargs):
dag_a_last_run = get_last_dagrun(
'example_external_task_sensor_a', session)
print(dag_a_last_run)
print(f"EXEC DATE: {dag_a_last_run.execution_date}")
return dag_a_last_run.execution_date
dag = DAG(
dag_id="example_external_task_sensor_b",
default_args={"owner": "airflow"},
start_date=days_ago(1),
schedule_interval="*/3 * * * *",
tags=['example_dags'],
catchup=False
)
with dag:
start = DummyOperator(
task_id='start')
wait_for_dag_a = ExternalTaskSensor(
task_id='wait_for_dag_a',
external_dag_id='example_external_task_sensor_a',
allowed_states=['success', 'failed'],
execution_date_fn=_get_execution_date_of_dag_a,
poke_interval=30
)
task_b = PythonOperator(
task_id='task_b',
python_callable=_executing_task,
)
chain(start, wait_for_dag_a, task_b)
We are using the param execution_date_fn of the ExternalTaskSensor in order to obtain the execution_date of the last DagRun of the DAG_A, if we don't do so, it will wait for DAG_A with the same execution_date as the actual run of DAG_B which may not exists in many cases.
The function _get_execution_date_of_dag_a does a query to the metadata DB to obtain the exec_date by using get_last_dagrun from Airflow models.
Finally the other important parameter is allowed_states=['success', 'failed'] where we are telling it to wait until DAG_A is found in one of those states (i.e if it is in running state will keep executing poke).
Try it out and let me know if it worked for you!.

Related

View on_failure_callback DAG logger

Let's take an example DAG.
Here is the code for it.
import logging
from airflow import DAG
from datetime import datetime, timedelta
from airflow.models import TaskInstance
from airflow.operators.python import PythonOperator
from airflow.operators.dummy import DummyOperator
def task_failure_notification_alert(context):
logging.info("Task context details: %s", str(context))
def dag_failure_notification_alert(context):
logging.info("DAG context details: %s", str(context))
def red_exception_task(ti: TaskInstance, **kwargs):
raise Exception('red')
default_args = {
"owner": "analytics",
"start_date": datetime(2021, 12, 12),
'retries': 0,
'retry_delay': timedelta(),
"schedule_interval": "#daily"
}
dag = DAG('logger_dag',
default_args=default_args,
catchup=False,
on_failure_callback=dag_failure_notification_alert
)
start_task = DummyOperator(task_id="start_task", dag=dag, on_failure_callback=task_failure_notification_alert)
red_task = PythonOperator(
dag=dag,
task_id='red_task',
python_callable=red_exception_task,
provide_context=True,
on_failure_callback=task_failure_notification_alert
)
end_task = DummyOperator(task_id="end_task", dag=dag, on_failure_callback=task_failure_notification_alert)
start_task >> red_task >> end_task
We can see two functions i.e. task_failure_notification_alert and dag_failure_notification_alert are being called in case of failures.
We can see logs in case of Task failure by the below steps.
We can see logs for the task as below.
but I am unable to find logs for the on_failure_callback of DAG anywhere in UI. Where can we see it?
Under airflow/logs find the "scheduler" folder, under it look for the specific date you ran the Dag for example 2022-12-03 and there you will see name of the dag_file.log.

Airflow tasks not gettin running

I am trying to run a simple BASHOperator task in Airflow. The DAG when trigerred manually lists the tasks in Tree and Graph view but the tasks are always in not started state.
I have restarted my Airflow scheduler. I am running Airflow on local host using a Kubectl image on Docker Compose.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['vijayraghunath21#gmail.com'],
'email_on_success': True,
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=2),
}
with DAG(
dag_id='bash_demo',
default_args=default_args,
description='Bash Demo',
start_date=datetime(2021, 1, 1),
# schedule_interval='0 2 * * *',
schedule_interval=None,
max_active_runs=1,
catchup=False,
tags=['bash_demo'],
) as dag:
dag.doc_md = __doc__
# Task 1
dummy_task = DummyOperator(task_id='dummy_task')
# Task 2
bash_task = BashOperator(
task_id='bash_task', bash_command="echo 'command executed from BashOperator'")
dummy_task >> bash_task
DAG Image
As shown on the image you added the DAG is set to off thus it's not running. You should click on the toggle button to set it to on.
This issue can be avoided in two ways:
Global solution- if you wills set dags_are_paused_at_creation = False in airflow.cfg - This will effect all DAGs in the system.
Local solution - if you will use is_paused_upon_creation in the DAG contractor:
with DAG(
dag_id='bash_demo',
...
is_paused_upon_creation=False,
) as dag:
This parameter specifies if the dag is paused when created for the first time. If the dag exists already, the parameter is being ignored.

How can I run all all neccessary DAGs / Do I need ExternalTaskSensor?

is it possible to run two dags at another time with the externalTaskSensor?
I have two DAGs.
DAG A runs every two hours
10 a.m. (successful)
12a.m. (failed)
2 p.m. (successful)
Dag B is depended on DAG A. DAG B waits for DAG A at 12 a.m and fails, because DAG A failed. But since DAG A was successful at 2 p.m., Dag B should suppose to run.
How can you implement this? With an ExternalTaskSensor?
I just have a small dummy, to try to understand it.
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.sensors import ExternalTaskSensor
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.timezone import datetime
from datetime import datetime, timedelta
import airflow
source_dag = DAG(
dag_id='sensor_dag_source',
start_date = datetime(2020, 1, 20),
schedule_interval='* * * * *'
)
first_task = DummyOperator(task_id='first_task', dag=source_dag)
target_dag = DAG(
dag_id='sensor_dag_target',
start_date = datetime(2020, 1, 20),
schedule_interval='* * * * *'
)
task_sensor = ExternalTaskSensor(
dag=target_dag,
task_id='dag_sensor_source_sensor',
retries=100,
retry_delay=timedelta(seconds=30),
mode='reschedule',
external_dag_id='sensor_dag_source',
external_task_id='first_task'
)
first_task = DummyOperator(task_id='first_task', dag=target_dag)
task_sensor >> first_task
you can try and use TriggerDagRunOperator and trigger DAG B from DAG A
here is a full answer-
In airflow, is there a good way to call another dag's task?
there is another good post about it-
Wiring top-level DAGs together

Airflow - TriggerDagRunOperator Cross Check

I am trying to trigger one dag from another. I am using TriggerDagRunOperator for the same.
I have the following two dags.
Dag 1:
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dagrun_operator import TriggerDagRunOperator
def print_hello():
return 'Hello world!'
dag = DAG('dag_one', description='Simple tutorial DAG',
schedule_interval='0/15 * * * *',
start_date=datetime(2017, 3, 20), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
trigger = TriggerDagRunOperator(
task_id="test_trigger_dagrun",
trigger_dag_id="dag_two", # Ensure this equals the dag_id of the DAG to trigger
dag=dag,
)
dummy_operator >> hello_operator >> trigger
Dag 2:
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
def print_hello():
return 'Hello XYZABC!'
dag = DAG('dag_two', description='Simple tutorial DAG',
schedule_interval='0 12 * * *',
start_date=datetime(2017, 3, 20), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
dummy_operator >> hello_operator
Going through the webserver, everything seems fine and running (ie: dag one is triggering dag two ).
My question is how to make sure or check that Dag 2 is actually triggered by Dag 1 and it is not triggered because of its schedule or any other manual action.
Basically, where I can find who triggered the Dag or how the Dag was triggered?
If you see Tree view of Dag 1, Dag 2 that was run by Dag 1 is seen as tasks in this view.
If you see Tree view of Dag 2, you can find AIRFLOW_CTX_DAG_RUN_ID=trig__YYYY_MM_DD... in View Log.
If this is scheduled, it should say
AIRFLOW_CTX_DAG_RUN_ID=scheduled__YYYY_MM_DDT...
You can compare the occurrence time of dag2 with the occurrence time of the triggle task in dag1

How to enable Subdag in Airflow?

In Airflow document, it is mentioned as below
"Subdags must have a schedule and be enabled
Even though subdags are triggered as part of a larger dag, if their schedule is set to None or ‘#once’, the subdag operator will succeed without doing anything".
But not clear, how we can enable the Subdags. Is there any way to enable the Subdag?
You can create a SubDAG like this:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.subdag_operator import SubDagOperator
default_args = {
'email_on_failure': False,
'email_on_retry': False,
'start_date': datetime(2017, 12, 16),
}
schedule_interval = "#daily"
def create_subdag(main_dag, subdag_id):
subdag = DAG('{0}.{1}'.format(main_dag.dag_id, subdag_id),
default_args=default_args)
DummyOperator(
task_id='foo',
dag=subdag)
return subdag
main_dag = DAG(
dag_id='main_dag',
schedule_interval=schedule_interval,
default_args=default_args,
max_active_runs=1
)
my_subdag = SubDagOperator(
task_id='subdag',
dag=main_dag,
retries=3,
subdag=create_subdag(main_dag, 'subdag')
)

Resources