Overwrite Airflow DAG parameter using CLI - airflow

Given the following DAG:
import logging
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
dag = DAG(
dag_id="dag_foo",
start_date=datetime(2022, 2, 28),
default_args={"owner": "Airflow", "params": {"param_a": "foo"}},
schedule_interval="#once",
catchup=False
)
def log_dag_param(param):
logging.info(param)
with dag:
DummyOperator(task_id="dummy") >> PythonOperator(
python_callable=log_dag_param, op_args=[dag.params["param_a"]]
)
I'm wondering if there is any way to overwrite an existing DAG parameter using the CLI. I'm aware of the airflow.models.dagrun.DagRun.conf, --conf parameter and this approach but I'm looking how I could overwrite a DAG parameter instead of a conf value.

Related

View on_failure_callback DAG logger

Let's take an example DAG.
Here is the code for it.
import logging
from airflow import DAG
from datetime import datetime, timedelta
from airflow.models import TaskInstance
from airflow.operators.python import PythonOperator
from airflow.operators.dummy import DummyOperator
def task_failure_notification_alert(context):
logging.info("Task context details: %s", str(context))
def dag_failure_notification_alert(context):
logging.info("DAG context details: %s", str(context))
def red_exception_task(ti: TaskInstance, **kwargs):
raise Exception('red')
default_args = {
"owner": "analytics",
"start_date": datetime(2021, 12, 12),
'retries': 0,
'retry_delay': timedelta(),
"schedule_interval": "#daily"
}
dag = DAG('logger_dag',
default_args=default_args,
catchup=False,
on_failure_callback=dag_failure_notification_alert
)
start_task = DummyOperator(task_id="start_task", dag=dag, on_failure_callback=task_failure_notification_alert)
red_task = PythonOperator(
dag=dag,
task_id='red_task',
python_callable=red_exception_task,
provide_context=True,
on_failure_callback=task_failure_notification_alert
)
end_task = DummyOperator(task_id="end_task", dag=dag, on_failure_callback=task_failure_notification_alert)
start_task >> red_task >> end_task
We can see two functions i.e. task_failure_notification_alert and dag_failure_notification_alert are being called in case of failures.
We can see logs in case of Task failure by the below steps.
We can see logs for the task as below.
but I am unable to find logs for the on_failure_callback of DAG anywhere in UI. Where can we see it?
Under airflow/logs find the "scheduler" folder, under it look for the specific date you ran the Dag for example 2022-12-03 and there you will see name of the dag_file.log.

Issues with importing airflow.operators.sensors.external_task import ExternalTaskSensor module and triggering external dag

I am trying to trigger multiple external dag dataflow job via master dag.
I plan to use TriggerDagRunOperator and ExternalTaskSensor . I have around 10 dataflow jobs - some are to be executed in sequence and some in parallel .
For example: I want to execute Dag dataflow jobs A,B,C etc from master dag and before execution goes next task I want to ensure the previous dag run has completed. But I am having issues with importing ExternalTaskSensor module.
Is their any alternative path to achieve this ?
Note: Each Dag eg A/B/C has 6- 7 task .Can ExternalTaskSensor check if the last task of dag A has completed before DAG B or C can start.
I Used the below sample code to run dag’s which uses ExternalTaskSensor, I was able to successfully import the ExternalTaskSensor module.
import time
from datetime import datetime, timedelta
from pprint import pprint
from airflow import DAG
from airflow.operators.dagrun_operator import TriggerDagRunOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.sensors.external_task_sensor import ExternalTaskSensor
from airflow.utils.state import State
sensors_dag = DAG(
"test_launch_sensors",
schedule_interval=None,
start_date=datetime(2020, 2, 14, 0, 0, 0),
dagrun_timeout=timedelta(minutes=150),
tags=["DEMO"],
)
dummy_dag = DAG(
"test_dummy_dag",
schedule_interval=None,
start_date=datetime(2020, 2, 14, 0, 0, 0),
dagrun_timeout=timedelta(minutes=150),
tags=["DEMO"],
)
def print_context(ds, **context):
pprint(context['conf'])
with dummy_dag:
starts = DummyOperator(task_id="starts", dag=dummy_dag)
empty = PythonOperator(
task_id="empty",
provide_context=True,
python_callable=print_context,
dag=dummy_dag,
)
ends = DummyOperator(task_id="ends", dag=dummy_dag)
starts >> empty >> ends
with sensors_dag:
trigger = TriggerDagRunOperator(
task_id=f"trigger_{dummy_dag.dag_id}",
trigger_dag_id=dummy_dag.dag_id,
conf={"key": "value"},
execution_date="{{ execution_date }}",
)
sensor = ExternalTaskSensor(
task_id="wait_for_dag",
external_dag_id=dummy_dag.dag_id,
external_task_id="ends",
poke_interval=5,
timeout=120,
)
trigger >> sensor
In the above sample code, sensors_dag triggers tasks in dummy_dag using the TriggerDagRunOperator(). The sensors_dag will wait till the completion of the specified external_task in dummy_dag.

Implementing cross-DAG dependency in Apache airflow

I am trying to implement DAG dependency between 2 DAGs say A and B. DAG A runs once every hour and DAG B runs every 15 mins.
Each time DAG B starts it's run I want to make sure DAG A is not in running state.
If DAG A is found to be running then DAG B has to wait until DAG A completes the run.
If DAG A is not running, DAG B can proceed with it's tasks.
DAG A :
from datetime import datetime,timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'dependency',
'depends_on_past': False,
'start_date': datetime(2020, 9, 10, 10, 1),
'email': ['xxxx.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG('DAG_A', schedule_interval='0/60 * * * *',max_active_runs=1, catchup=False,
default_args=default_args) as dag:
task1 = DummyOperator(task_id='task1', retries=1, dag=dag)
task2 = DummyOperator(task_id='task2', retries=1, dag=dag)
task3 = DummyOperator(task_id='task3', retries=1, dag=dag)
task1 >> task2 >> task3
DAG B:
from datetime import datetime,timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'dependency',
'depends_on_past': False,
'start_date': datetime(2020, 9, 10, 10, 1),
'email': ['xxxx.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG('DAG_B', schedule_interval='0/15 * * * *',max_active_runs=1, catchup=False,
default_args=default_args) as dag:
task4 = DummyOperator(task_id='task4', retries=1, dag=dag)
task5 = DummyOperator(task_id='task5', retries=1, dag=dag)
task6 = DummyOperator(task_id='task6', retries=1, dag=dag)
task4 >> task5 >> task6
I have tried using ExternalTaskSensor operator. I am unable to understand if the sensor finds DAG A to be in success state it triggers the next task else wait for the task to complete.
Thanks in advance.
I think the only way you can achieve that in "general" way is to use some external locking mechanism
You can achieve quite a good approximation though using pools:
https://airflow.apache.org/docs/apache-airflow/1.10.3/concepts.html?highlight=pool
if you set pool size to 1 and assign both dag A and B to the pool, only one of those can be running at a time. You can also add priority_weight in the way that you see best - in case you need to prioritise A over B or the other way round.
You could use ExternalTaskSensor to achieve what you are looking for. The key aspect is to initialize this sensor with the correct execution_date, being that in your example the execution_date of the last DagRun of DAG_A.
Check this example where DAG_A runs every 9 minutes for 200 seconds. DAG_B runs every 3 minutes and runs for 30 seconds. These values are arbitrary and only for demo purpose, could be pretty much anything.
DAG A (nothing new here):
import time
from airflow import DAG
from airflow.models.baseoperator import chain
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
def _executing_task(**kwargs):
print("Starting task_a")
time.sleep(200)
print("Completed task_a")
dag = DAG(
dag_id="example_external_task_sensor_a",
default_args={"owner": "airflow"},
start_date=days_ago(1),
schedule_interval="*/9 * * * *",
tags=['example_dags'],
catchup=False
)
with dag:
start = DummyOperator(
task_id='start')
task_a = PythonOperator(
task_id='task_a',
python_callable=_executing_task,
)
chain(start, task_a)
DAG B:
import time
from airflow import DAG
from airflow.utils.db import provide_session
from airflow.models.dag import get_last_dagrun
from airflow.models.baseoperator import chain
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from airflow.sensors.external_task import ExternalTaskSensor
def _executing_task():
time.sleep(30)
print("Completed task_b")
#provide_session
def _get_execution_date_of_dag_a(exec_date, session=None, **kwargs):
dag_a_last_run = get_last_dagrun(
'example_external_task_sensor_a', session)
print(dag_a_last_run)
print(f"EXEC DATE: {dag_a_last_run.execution_date}")
return dag_a_last_run.execution_date
dag = DAG(
dag_id="example_external_task_sensor_b",
default_args={"owner": "airflow"},
start_date=days_ago(1),
schedule_interval="*/3 * * * *",
tags=['example_dags'],
catchup=False
)
with dag:
start = DummyOperator(
task_id='start')
wait_for_dag_a = ExternalTaskSensor(
task_id='wait_for_dag_a',
external_dag_id='example_external_task_sensor_a',
allowed_states=['success', 'failed'],
execution_date_fn=_get_execution_date_of_dag_a,
poke_interval=30
)
task_b = PythonOperator(
task_id='task_b',
python_callable=_executing_task,
)
chain(start, wait_for_dag_a, task_b)
We are using the param execution_date_fn of the ExternalTaskSensor in order to obtain the execution_date of the last DagRun of the DAG_A, if we don't do so, it will wait for DAG_A with the same execution_date as the actual run of DAG_B which may not exists in many cases.
The function _get_execution_date_of_dag_a does a query to the metadata DB to obtain the exec_date by using get_last_dagrun from Airflow models.
Finally the other important parameter is allowed_states=['success', 'failed'] where we are telling it to wait until DAG_A is found in one of those states (i.e if it is in running state will keep executing poke).
Try it out and let me know if it worked for you!.

Airflow - TriggerDagRunOperator Cross Check

I am trying to trigger one dag from another. I am using TriggerDagRunOperator for the same.
I have the following two dags.
Dag 1:
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dagrun_operator import TriggerDagRunOperator
def print_hello():
return 'Hello world!'
dag = DAG('dag_one', description='Simple tutorial DAG',
schedule_interval='0/15 * * * *',
start_date=datetime(2017, 3, 20), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
trigger = TriggerDagRunOperator(
task_id="test_trigger_dagrun",
trigger_dag_id="dag_two", # Ensure this equals the dag_id of the DAG to trigger
dag=dag,
)
dummy_operator >> hello_operator >> trigger
Dag 2:
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
def print_hello():
return 'Hello XYZABC!'
dag = DAG('dag_two', description='Simple tutorial DAG',
schedule_interval='0 12 * * *',
start_date=datetime(2017, 3, 20), catchup=False)
dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)
hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)
dummy_operator >> hello_operator
Going through the webserver, everything seems fine and running (ie: dag one is triggering dag two ).
My question is how to make sure or check that Dag 2 is actually triggered by Dag 1 and it is not triggered because of its schedule or any other manual action.
Basically, where I can find who triggered the Dag or how the Dag was triggered?
If you see Tree view of Dag 1, Dag 2 that was run by Dag 1 is seen as tasks in this view.
If you see Tree view of Dag 2, you can find AIRFLOW_CTX_DAG_RUN_ID=trig__YYYY_MM_DD... in View Log.
If this is scheduled, it should say
AIRFLOW_CTX_DAG_RUN_ID=scheduled__YYYY_MM_DDT...
You can compare the occurrence time of dag2 with the occurrence time of the triggle task in dag1

Airflow : Publish a dynamically created dag

I want to be able to publish and trigger a DAG object from my code which is not in control of scheduler (viz. $AIRFLOW_HOME/dags folder)
My last resort would be to programmatically create a py file containing the DAG definition that I want to publish and save this file to the $AIRFLOW_HOME/dags folder.
I'm sure it should be easier than that.
Below is what I've tried.
import airflow
from airflow import DAG
from datetime import timedelta
from airflow.models import DagPickle
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.db import provide_session
#provide_session
def submit_dag(session=None):
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2)
}
dag = DAG(
dag_id='sample', default_args=args,
schedule_interval=None, start_date=airflow.utils.dates.days_ago(2),
dagrun_timeout=timedelta(minutes=60))
task = DummyOperator(task_id='one', dag=dag)
dag_pickle = DagPickle(task)
session.add(dag_pickle)
session.commit()
submit_dag()
The above code does create entries in dag_pickle table but how do I publish and later trigger this dag?
I can do pickle.dump(dag,open(DAGS_FOLDER/pickled_dags,'wb')) and have a file in DAGS FOLDER that would pickle.load(DAGS_FOLDER/pickled_dags)

Resources