Airflow - TriggerDagRunOperator Cross Check - airflow

I am trying to trigger one dag from another. I am using TriggerDagRunOperator for the same.
I have the following two dags.
Dag 1:
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dagrun_operator import TriggerDagRunOperator
def print_hello():
return 'Hello world!'
dag = DAG('dag_one', description='Simple tutorial DAG',
schedule_interval='0/15 * * * *',
start_date=datetime(2017, 3, 20), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
trigger = TriggerDagRunOperator(
task_id="test_trigger_dagrun",
trigger_dag_id="dag_two", # Ensure this equals the dag_id of the DAG to trigger
dag=dag,
)
dummy_operator >> hello_operator >> trigger
Dag 2:
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
def print_hello():
return 'Hello XYZABC!'
dag = DAG('dag_two', description='Simple tutorial DAG',
schedule_interval='0 12 * * *',
start_date=datetime(2017, 3, 20), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
dummy_operator >> hello_operator
Going through the webserver, everything seems fine and running (ie: dag one is triggering dag two ).
My question is how to make sure or check that Dag 2 is actually triggered by Dag 1 and it is not triggered because of its schedule or any other manual action.
Basically, where I can find who triggered the Dag or how the Dag was triggered?

If you see Tree view of Dag 1, Dag 2 that was run by Dag 1 is seen as tasks in this view.
If you see Tree view of Dag 2, you can find AIRFLOW_CTX_DAG_RUN_ID=trig__YYYY_MM_DD... in View Log.
If this is scheduled, it should say
AIRFLOW_CTX_DAG_RUN_ID=scheduled__YYYY_MM_DDT...

You can compare the occurrence time of dag2 with the occurrence time of the triggle task in dag1

Related

View on_failure_callback DAG logger

Let's take an example DAG.
Here is the code for it.
import logging
from airflow import DAG
from datetime import datetime, timedelta
from airflow.models import TaskInstance
from airflow.operators.python import PythonOperator
from airflow.operators.dummy import DummyOperator
def task_failure_notification_alert(context):
logging.info("Task context details: %s", str(context))
def dag_failure_notification_alert(context):
logging.info("DAG context details: %s", str(context))
def red_exception_task(ti: TaskInstance, **kwargs):
raise Exception('red')
default_args = {
"owner": "analytics",
"start_date": datetime(2021, 12, 12),
'retries': 0,
'retry_delay': timedelta(),
"schedule_interval": "#daily"
}
dag = DAG('logger_dag',
default_args=default_args,
catchup=False,
on_failure_callback=dag_failure_notification_alert
)
start_task = DummyOperator(task_id="start_task", dag=dag, on_failure_callback=task_failure_notification_alert)
red_task = PythonOperator(
dag=dag,
task_id='red_task',
python_callable=red_exception_task,
provide_context=True,
on_failure_callback=task_failure_notification_alert
)
end_task = DummyOperator(task_id="end_task", dag=dag, on_failure_callback=task_failure_notification_alert)
start_task >> red_task >> end_task
We can see two functions i.e. task_failure_notification_alert and dag_failure_notification_alert are being called in case of failures.
We can see logs in case of Task failure by the below steps.
We can see logs for the task as below.
but I am unable to find logs for the on_failure_callback of DAG anywhere in UI. Where can we see it?
Under airflow/logs find the "scheduler" folder, under it look for the specific date you ran the Dag for example 2022-12-03 and there you will see name of the dag_file.log.

Issues with importing airflow.operators.sensors.external_task import ExternalTaskSensor module and triggering external dag

I am trying to trigger multiple external dag dataflow job via master dag.
I plan to use TriggerDagRunOperator and ExternalTaskSensor . I have around 10 dataflow jobs - some are to be executed in sequence and some in parallel .
For example: I want to execute Dag dataflow jobs A,B,C etc from master dag and before execution goes next task I want to ensure the previous dag run has completed. But I am having issues with importing ExternalTaskSensor module.
Is their any alternative path to achieve this ?
Note: Each Dag eg A/B/C has 6- 7 task .Can ExternalTaskSensor check if the last task of dag A has completed before DAG B or C can start.
I Used the below sample code to run dag’s which uses ExternalTaskSensor, I was able to successfully import the ExternalTaskSensor module.
import time
from datetime import datetime, timedelta
from pprint import pprint
from airflow import DAG
from airflow.operators.dagrun_operator import TriggerDagRunOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.sensors.external_task_sensor import ExternalTaskSensor
from airflow.utils.state import State
sensors_dag = DAG(
"test_launch_sensors",
schedule_interval=None,
start_date=datetime(2020, 2, 14, 0, 0, 0),
dagrun_timeout=timedelta(minutes=150),
tags=["DEMO"],
)
dummy_dag = DAG(
"test_dummy_dag",
schedule_interval=None,
start_date=datetime(2020, 2, 14, 0, 0, 0),
dagrun_timeout=timedelta(minutes=150),
tags=["DEMO"],
)
def print_context(ds, **context):
pprint(context['conf'])
with dummy_dag:
starts = DummyOperator(task_id="starts", dag=dummy_dag)
empty = PythonOperator(
task_id="empty",
provide_context=True,
python_callable=print_context,
dag=dummy_dag,
)
ends = DummyOperator(task_id="ends", dag=dummy_dag)
starts >> empty >> ends
with sensors_dag:
trigger = TriggerDagRunOperator(
task_id=f"trigger_{dummy_dag.dag_id}",
trigger_dag_id=dummy_dag.dag_id,
conf={"key": "value"},
execution_date="{{ execution_date }}",
)
sensor = ExternalTaskSensor(
task_id="wait_for_dag",
external_dag_id=dummy_dag.dag_id,
external_task_id="ends",
poke_interval=5,
timeout=120,
)
trigger >> sensor
In the above sample code, sensors_dag triggers tasks in dummy_dag using the TriggerDagRunOperator(). The sensors_dag will wait till the completion of the specified external_task in dummy_dag.

Implementing cross-DAG dependency in Apache airflow

I am trying to implement DAG dependency between 2 DAGs say A and B. DAG A runs once every hour and DAG B runs every 15 mins.
Each time DAG B starts it's run I want to make sure DAG A is not in running state.
If DAG A is found to be running then DAG B has to wait until DAG A completes the run.
If DAG A is not running, DAG B can proceed with it's tasks.
DAG A :
from datetime import datetime,timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'dependency',
'depends_on_past': False,
'start_date': datetime(2020, 9, 10, 10, 1),
'email': ['xxxx.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG('DAG_A', schedule_interval='0/60 * * * *',max_active_runs=1, catchup=False,
default_args=default_args) as dag:
task1 = DummyOperator(task_id='task1', retries=1, dag=dag)
task2 = DummyOperator(task_id='task2', retries=1, dag=dag)
task3 = DummyOperator(task_id='task3', retries=1, dag=dag)
task1 >> task2 >> task3
DAG B:
from datetime import datetime,timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'dependency',
'depends_on_past': False,
'start_date': datetime(2020, 9, 10, 10, 1),
'email': ['xxxx.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG('DAG_B', schedule_interval='0/15 * * * *',max_active_runs=1, catchup=False,
default_args=default_args) as dag:
task4 = DummyOperator(task_id='task4', retries=1, dag=dag)
task5 = DummyOperator(task_id='task5', retries=1, dag=dag)
task6 = DummyOperator(task_id='task6', retries=1, dag=dag)
task4 >> task5 >> task6
I have tried using ExternalTaskSensor operator. I am unable to understand if the sensor finds DAG A to be in success state it triggers the next task else wait for the task to complete.
Thanks in advance.
I think the only way you can achieve that in "general" way is to use some external locking mechanism
You can achieve quite a good approximation though using pools:
https://airflow.apache.org/docs/apache-airflow/1.10.3/concepts.html?highlight=pool
if you set pool size to 1 and assign both dag A and B to the pool, only one of those can be running at a time. You can also add priority_weight in the way that you see best - in case you need to prioritise A over B or the other way round.
You could use ExternalTaskSensor to achieve what you are looking for. The key aspect is to initialize this sensor with the correct execution_date, being that in your example the execution_date of the last DagRun of DAG_A.
Check this example where DAG_A runs every 9 minutes for 200 seconds. DAG_B runs every 3 minutes and runs for 30 seconds. These values are arbitrary and only for demo purpose, could be pretty much anything.
DAG A (nothing new here):
import time
from airflow import DAG
from airflow.models.baseoperator import chain
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
def _executing_task(**kwargs):
print("Starting task_a")
time.sleep(200)
print("Completed task_a")
dag = DAG(
dag_id="example_external_task_sensor_a",
default_args={"owner": "airflow"},
start_date=days_ago(1),
schedule_interval="*/9 * * * *",
tags=['example_dags'],
catchup=False
)
with dag:
start = DummyOperator(
task_id='start')
task_a = PythonOperator(
task_id='task_a',
python_callable=_executing_task,
)
chain(start, task_a)
DAG B:
import time
from airflow import DAG
from airflow.utils.db import provide_session
from airflow.models.dag import get_last_dagrun
from airflow.models.baseoperator import chain
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from airflow.sensors.external_task import ExternalTaskSensor
def _executing_task():
time.sleep(30)
print("Completed task_b")
#provide_session
def _get_execution_date_of_dag_a(exec_date, session=None, **kwargs):
dag_a_last_run = get_last_dagrun(
'example_external_task_sensor_a', session)
print(dag_a_last_run)
print(f"EXEC DATE: {dag_a_last_run.execution_date}")
return dag_a_last_run.execution_date
dag = DAG(
dag_id="example_external_task_sensor_b",
default_args={"owner": "airflow"},
start_date=days_ago(1),
schedule_interval="*/3 * * * *",
tags=['example_dags'],
catchup=False
)
with dag:
start = DummyOperator(
task_id='start')
wait_for_dag_a = ExternalTaskSensor(
task_id='wait_for_dag_a',
external_dag_id='example_external_task_sensor_a',
allowed_states=['success', 'failed'],
execution_date_fn=_get_execution_date_of_dag_a,
poke_interval=30
)
task_b = PythonOperator(
task_id='task_b',
python_callable=_executing_task,
)
chain(start, wait_for_dag_a, task_b)
We are using the param execution_date_fn of the ExternalTaskSensor in order to obtain the execution_date of the last DagRun of the DAG_A, if we don't do so, it will wait for DAG_A with the same execution_date as the actual run of DAG_B which may not exists in many cases.
The function _get_execution_date_of_dag_a does a query to the metadata DB to obtain the exec_date by using get_last_dagrun from Airflow models.
Finally the other important parameter is allowed_states=['success', 'failed'] where we are telling it to wait until DAG_A is found in one of those states (i.e if it is in running state will keep executing poke).
Try it out and let me know if it worked for you!.

How can I run all all neccessary DAGs / Do I need ExternalTaskSensor?

is it possible to run two dags at another time with the externalTaskSensor?
I have two DAGs.
DAG A runs every two hours
10 a.m. (successful)
12a.m. (failed)
2 p.m. (successful)
Dag B is depended on DAG A. DAG B waits for DAG A at 12 a.m and fails, because DAG A failed. But since DAG A was successful at 2 p.m., Dag B should suppose to run.
How can you implement this? With an ExternalTaskSensor?
I just have a small dummy, to try to understand it.
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.sensors import ExternalTaskSensor
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.timezone import datetime
from datetime import datetime, timedelta
import airflow
source_dag = DAG(
dag_id='sensor_dag_source',
start_date = datetime(2020, 1, 20),
schedule_interval='* * * * *'
)
first_task = DummyOperator(task_id='first_task', dag=source_dag)
target_dag = DAG(
dag_id='sensor_dag_target',
start_date = datetime(2020, 1, 20),
schedule_interval='* * * * *'
)
task_sensor = ExternalTaskSensor(
dag=target_dag,
task_id='dag_sensor_source_sensor',
retries=100,
retry_delay=timedelta(seconds=30),
mode='reschedule',
external_dag_id='sensor_dag_source',
external_task_id='first_task'
)
first_task = DummyOperator(task_id='first_task', dag=target_dag)
task_sensor >> first_task
you can try and use TriggerDagRunOperator and trigger DAG B from DAG A
here is a full answer-
In airflow, is there a good way to call another dag's task?
there is another good post about it-
Wiring top-level DAGs together

Airflow long running hourly DAG's missing few hours

My DAG is schduled to run each hour. I'm pulling each hour of data from an s3 source and processing them. Sometimes the task is taking more than an hour to complete. At that time, I'm missing an hour of data.
Example:
1:00pm DAG started and ran for 2 hours. So my next DAG run takes parameter as 3(3pm) missing 2pm data. In other words, how do I call the task and make sure it runs each hour i., 24 times in a day
Here is my DAG
HOUR_PACIFIC = arrow.utcnow().shift(hours=-3).to('US/Pacific').format("HH")
dag = DAG(
DAG_ID,
catchup=False,
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=5),
schedule_interval='0 * * * *')
start = DummyOperator(
task_id='Start',
dag=dag)
my_task = EMRStep(emr,
'stg',
HOUR_PACIFIC)
end = DummyOperator(
task_id='End',
dag=dag
)
start >> my_task >> end
You need to pass the catchup=True for the DAG object.
This appears to be a perfect scenario for using TimeDeltaSensor
Note: following code-snippet is just for reference and has NOT been tested
import datetime
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.sensors.time_delta_sensor import TimeDeltaSensor
from airflow.utils.trigger_rule import TriggerRule
# create DAG object
my_dag: DAG = DAG(dag_id="my_dag",
start_date=datetime.datetime(year=2019, month=3, day=11),
schedule_interval="0 0 0 * * *")
# create dummy begin & end tasks
my_begin_task: DummyOperator = DummyOperator(dag=my_dag,
task_id="my_begin_task")
my_end_task: DummyOperator = DummyOperator(dag=my_dag,
task_id="my_end_task",
trigger_rule=TriggerRule.ALL_DONE)
# populate the DAG
for i in range(1, 24, 1):
# create sensors and actual tasks for all hours of the day
my_time_delta_sensor: TimeDeltaSensor = TimeDeltaSensor(dag=my_dag,
task_id=f"my_time_delta_sensor_task_{i}_hours",
delta=datetime.timedelta(hours=i))
my_actual_task: PythonOperator = PythonOperator(dag=my_dag,
task_id=f"my_actual_task_{i}_hours",
python_callable=my_callable
..)
# wire-up tasks together
my_begin_task >> my_time_delta_sensor >> my_actual_task >> my_end_task
References
Apache Airflow: Delay a task for some period of time
Apache Airflow API Reference: TimeDeltaSensor
Cron Expression (Quartz) for a program to run every midnight at 12 am

Resources