The requirement is to have DAG run one after the other and on success of each DAG
I have a Master DAG in which I am calling all the DAG to get executed one after the other in sequence
Also in each of the dag_A, dag_B, dag_C I have to given schedule_interval = None and manually turn ON in GUI
I am using ExternalTaskSensor, coz even before all the tasks in the first dag_A gets completed, it kicks off the second dag_B, to avoid such issues I am using ExternalTaskSensor.If any better implementation please kindly let me know
Don't know What I am missing here
Code: master_dag.py
import datetime
import os
from datetime import timedelta
from airflow.models import DAG, Variable
from airflow.operators.dagrun_operator import TriggerDagRunOperator
from airflow.operators.sensors import ExternalTaskSensor
default_args = {
'owner': 'airflow',
'start_date': datetime.datetime(2020, 1, 7),
'provide_context': True,
'execution_timeout': None,
'retries': 0,
'retry_delay': timedelta(minutes=3),
'retry_exponential_backoff': True,
'email_on_retry': False,
}
dag = DAG(
dag_id='master_dag',
schedule_interval='7 3 * * *',
default_args=default_args,
max_active_runs=1,
catchup=False,
)
trigger_dag_A = TriggerDagRunOperator(
task_id='trigger_dag_A',
trigger_dag_id='dag_A',
dag=dag,
)
wait_for_dag_A = ExternalTaskSensor(
task_id='wait_for_dag_A',
external_dag_id='dag_A',
external_task_id='proc_success',
poke_interval=60,
allowed_states=['success'],
dag=dag,
)
trigger_dag_B = TriggerDagRunOperator(
task_id='trigger_dag_B',
trigger_dag_id='dag_B',
dag=dag,
)
wait_for_dag_B = ExternalTaskSensor(
task_id='wait_for_dag_B',
external_dag_id='dag_B',
external_task_id='proc_success',
poke_interval=60,
allowed_states=['success'],
dag=dag)
trigger_dag_C = TriggerDagRunOperator(
task_id='trigger_dag_C',
trigger_dag_id='dag_C',
dag=dag,
)
trigger_dag_A >> wait_dag_A >> trigger_dag_B >> wait_dag_B >> trigger_dag_C
Each of the DAG has multiple tasks running with last task been proc_success
Background
ExternalTaskSensor works by polling the state of DagRun / TaskInstance of the external DAG or task respectively (based on whether or not external_task_id is passed)
Now since a single DAG can have multiple active DagRuns, the sensor must be told that which of these runs / instances it is supposed to sense
For that, it uses execution_date as a distinguishing criteria. This can be expressed in (only) one of following two ways
:param execution_delta: time difference with the previous execution to
look at, the default is the same execution_date as the current task or DAG.
For yesterday, use [positive!] datetime.timedelta(days=1). Either
execution_delta or execution_date_fn can be passed to
ExternalTaskSensor, but not both.
:type execution_delta: datetime.timedelta
:param execution_date_fn: function that receives the current execution date
and returns the desired execution dates to query. Either execution_delta
or execution_date_fn can be passed to ExternalTaskSensor, but not both.
:type execution_date_fn: callable
The problem in your implementation
In your ExternalTaskSensors, you are not passing either of execution_date_fn or execution_delta params
as a result, the sensor picks up its own execution_date to poll for DagRuns of child DAGs, thereby getting stuck (clearly the execution_date of your parent / orchestrator DAG would be different from child DAGs)
#provide_session
def poke(self, context, session=None):
if self.execution_delta:
dttm = context['execution_date'] - self.execution_delta
elif self.execution_date_fn:
dttm = self.execution_date_fn(context['execution_date'])
else:
# if neither of above is passed, use current DAG's execution date
dttm = context['execution_date']
Further tips
You can skip passing external_task_id; when you do that, the ExternalTaskSensor, in effect, becomes an ExternalDagSensor. This is particularly helpful when your child DAGs (A, B & C) have more than one end task (so that completion of any one of those end-tasks doesn't guarantee the completion of entire DAG)
Also have a look at this discussion: Wiring top-level DAGs together
EDIT-1
On an afterthought, my initial judgement appears to be wrong; particularly following statement doesn't hold true.
clearly the execution_date of your parent / orchestrator DAG would be
different from child DAGs
Looking at the source, it becomes clear the TriggerDagRunOperator passes its own execution_date to child DagRun, meaning that the ExternalTaskSensor should then be able to sense that DAG or it's task.
trigger_dag(
dag_id=self.trigger_dag_id,
run_id=run_id,
conf=self.conf,
# own execution date passed to child DAG
execution_date=self.execution_date,
replace_microseconds=False,
)
so then the explanation holds no truth.
I would suggest you to
check the execution_date of your triggered child DAGs / the tasks whose external_task_id you are passing, in the UI or by querying meta-db
and compare it with execution_date of your orchestrator DAG
that should clarify certain bits
Related
We have an adhoc airflow DAG, which anyone can trigger to run manually from team of 50+.
We can check airflow audit logs who triggered the DAG via dag id and we can also get email upon failure.
But we are more curious to know if we can get email upon DAG start OR at the start of each task run, this will help us understand and track activity and usage/command executed from adhoc DAG.
#Khilesh Chauhan
There are a number of ways to achieve your intended outcome, in no particular order:
1. Task or Dag Level Callbacks
Official Callback Reference
We can take advantage of the on_success_callback callback, which can be harnessed in two distinct places.
# use a function inside a specific PythonOperator, for task level control
task = PythonOperator(
task_id='your_task',
on_success_callback=send_mail,
)
# use it inside your DAG initiation
dag = DAG(
dag_id='your_task',
on_failure_callback=send_mail
)
We can write an example send_mail function, which leverages the send_email utility.
from airflow.utils.email import send_email
def send_mail(**context):
task = context['task_instance'].task
subject = f'Airflow task has successfully completed {task.task_id}'
body = f'Hi, this is an alert to let you know that your task {task.task_id} has completed successfully.'
send_email(
dag.default_args['email'],
subject,
body
)
2. Add an EmailOperator to your DAG Official Email Operator Reference
You could add an EmailOperator task at the beginning of your DAG.
from airflow.operators.email_operator import EmailOperator
email = EmailOperator(
task_id='alert_DAG_start',
to='your#email.come',
subject='DAG Initiated - start {{ ds }}',
html_content=""" <h1>Some Content</h1> """
)
3. Create a function that uses a PythonOperator that executes send_email
You might need more control, such as including logging info etc. So you might want more control to use a PythonOperator.
I hope this helps you to resolve your problem.
Update
To answer your second question to get the username, I have created a function for you to use. We can import the session context manager, then use the .query method, for your debugging purposes, I loop through the array. you can see the username at index 3.
from airflow.models.log import Log
from airflow.utils.db import create_session
def return_user_name(**context):
"""
return the username for the executed tasks
"""
dag_id = context['task_instance'].dag_id
with create_session() as session:
result = session.query(Log.dttm, Log.dag_id, Log.execution_date, Log.owner, Log.extra).filter(Log.dag_id == dag_id, Log.event == 'trigger').first()
for index, result in enumerate(result):
print(index, result)
Now, I create multiple tasks using a variable like this and it works fine.
with DAG(....) as dag:
body = Variable.get("config_table", deserialize_json=True)
for i in range(len(body.keys())):
simple_task = Operator(
task_id = 'task_' + str(i),
.....
But I need to use XCOM value for some reason instead of using a variable.
Is it possible to dynamically create tasks with XCOM pull value?
I try to set value like this and it's not working
body = "{{ ti.xcom_pull(key='config_table', task_ids='get_config_table') }}"
It's possible to dynamically create tasks from XComs generated from a previous task, there are more extensive discussions on this topic, for example in this question. One of the suggested approaches follows this structure, here is a working example I made:
sample_file.json:
{
"cities": [ "London", "Paris", "BA", "NY" ]
}
Get your data from an API or file or any source. Push it as XCom.
def _process_obtained_data(ti):
list_of_cities = ti.xcom_pull(task_ids='get_data')
Variable.set(key='list_of_cities',
value=list_of_cities['cities'], serialize_json=True)
def _read_file():
with open('dags/sample_file.json') as f:
data = json.load(f)
# push to XCom using return
return data
with DAG('dynamic_tasks_example', schedule_interval='#once',
start_date=days_ago(2),
catchup=False) as dag:
get_data = PythonOperator(
task_id='get_data',
python_callable=_read_file)
Add a second task which will pull from pull from XCom and set a Variable with the data you will use to iterate later on.
preparation_task = PythonOperator(
task_id='preparation_task',
python_callable=_process_obtained_data)
*Of course, if you want you can merge both tasks into one. I prefer not to because usually, I take a subset of the fetched data to create the Variable.
Read from that Variable and later iterate on it. It's critical to define default_var.
end = DummyOperator(
task_id='end',
trigger_rule='none_failed')
# Top-level code within DAG block
iterable_list = Variable.get('list_of_cities',
default_var=['default_city'],
deserialize_json=True)
Declare dynamic tasks and their dependencies within a loop. Make the task_id uniques. TaskGroup is optional, helps you sorting the UI.
with TaskGroup('dynamic_tasks_group',
prefix_group_id=False,
) as dynamic_tasks_group:
if iterable_list:
for index, city in enumerate(iterable_list):
say_hello = PythonOperator(
task_id=f'say_hello_from_{city}',
python_callable=_print_greeting,
op_kwargs={'city_name': city, 'greeting': 'Hello'}
)
say_goodbye = PythonOperator(
task_id=f'say_goodbye_from_{city}',
python_callable=_print_greeting,
op_kwargs={'city_name': city, 'greeting': 'Goodbye'}
)
# TaskGroup level dependencies
say_hello >> say_goodbye
# DAG level dependencies
get_data >> preparation_task >> dynamic_tasks_group >> end
DAG Graph View:
Imports:
import json
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.models import Variable
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dummy import DummyOperator
from airflow.utils.task_group import TaskGroup
Things to keep in mind:
If you have simultaneous dag_runs of this same DAG, all of them will use the same variable, so you may need to make it 'unique' by differentiating their names.
You must set the default value while reading the Variable, otherwise, the first execution may not be processable to the Scheduler.
The Airflow Graph View UI may not refresh the changes immediately. Happens especially in the first run after adding or removing items from the iterable on which the dynamic task generation is created.
If you need to read from many variables, it's important to remember that it's recommended to store them in one single JSON value to avoid constantly create connections to the metadata database (example in this article).
Good luck!
Edit:
Another important point to take into consideration:
With this approach, the call to Variable.get() method is top-level code, so is read by the scheduler every 30 seconds (default of min_file_process_interval setting). This means that a connection to the metadata DB will happen each time.
Edit:
Added if clause to handle emtpy iterable_list case.
This is not possible, and in general dynamic tasks are not recommended:
The way the Airflow scheduler works is by reading the dag file, loading the tasks into the memory and then checks which dags and which tasks it need to schedule, while xcom are a runtime values that are related to a specific dag run, so the scheduler cannot relay on xcom values.
When using dynamic tasks you're making debug much harder for yourself, as the values you use for creating the dag can change and you'll lose access to logs without even understanding why.
What you can do is use branch operator, to have those tasks always and just skip them based on the xcom value.
For example:
def branch_func(**context)
return f"task_{context['ti'].xcom_pull(key=key)}"
branch = BranchPythonOperator(
task_id="branch",
python_callback=branch_func
)
tasks = [BaseOperator(task_id=f"task_{i}") for i in range(3)]
branch >> tasks
In some cases it's also not good to use this method (for example when I've 100 possible tasks), in those cases I'd recommend writing your own operator or use a single PythonOperator.
I'm running a script that checks the status of my database before a DAG runs and compares it to after the DAG finished running.
def pre_dag_db
pass
def run_dag
pass
def post_dag_db
pass
Is there a way for me to know when the DAG finished running so that my script knows when to run post_dag_db? The idea is that my post_dag_db runs after my DAG finished running because the DAG manipulates the db.
The easiest way to do this would be to just run the script as last task in your dag, maybe using a BashOperator.
Other options would be to trigger a separate dag (TriggerDagRunOperator) and there implement a dag that calls your script.
If you really cannot call your script from Airflow itself, you might want to check the REST APIs https://airflow.apache.org/docs/stable/api.html and use them to retrieve information about the dag_run. But this seems overly-complicated to me.
Quick and easy way is to add one task in DAG which will work/run as last task of the DAG, this will work like magic for you.
you can use any of the operator like (PythonOperator, BashOperator, etc).
I think you can use the following code:
dag = get_dag(args)
dr = DagRun.find(dag.dag_id, execution_date=args.execution_date)
print(dr[0].state if len(dr) > 0 else None)
This code is taken from airflow cli.
Make a custom class that inherits from dag, and whose dependencies are the same as your dag.
something like (custom_dag.py)
from airflow.models.dag import DAG
class PreAndPostDAG(DAG):
#property
def tasks(self) -> List[BaseOperator]:
return [self.pre_graph] + list(self.task_dict.values()) + [self.post_graph]
#property
def pre_graph(self):
#whatever crazy things you want to do here, before DAG starts
pass
#property
def post_graph(self):
#whatever crazy things you want to do here, AFTER DAG finishes
pass
That's the easiest I can think of, then you just import it when defining your dags:
from custom_dag import PreAndPostDAG
with PreAndPostDAG(
'LS',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
start_date=days_ago(2),
tags=['example'],
) as dag:
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='list',
bash_command='ls',
)
You get the rest, hope that helps
Lets say I have two DAG, where dag2 executed dag1 as part of it's flow using TriggerDagRunOperator as follows:
dag1: task1 > task2 > task3
dag2: task4 > dag1 > task5
Now lets say dag2 is scheduled for once a day at 5PM.
Is there a way for me to get the execution timestamp for dag2 (the parent DAG) while I'm running dag1?
Is there any built-in parameter that holds that value?
And if something happened and dag2 was triggered later than usual, lets say 6PM same day, then I still want to get the original scheduling time - that is 5PM while I'm in dag1.
Pass a function to the python_callable argument of TriggerDagRunOperator that injects the execution_date into the triggered DAG:
def inject_execution_date(context, dag_run_obj):
dag_run_obj.payload = {"parent_execution_date": context["execution_date"]}
return dag_run_obj
[...]
trigger_dro = TriggerDagRunOperator(python_callable=inject_execution_date, [...])
You can access this in the child DAG with context["conf"]["parent_execution_date"]
I have a use case where I have a list of clients. The client can be added or removed from the list, and they can have different start dates, and different initial parameters.
I want to use airflow to backfill all data for each client based on their initial start date + rerun if something fails. I am thinking about creating a SubDag for each client. Will this address my problem?
How can I dynamically create SubDags based on the client_id?
You can definitely create DAG objects dynamically:
def make_client_dag(parent_dag, client):
return DAG(
'%s.client_%s' % (parent_dag.dag_id, client.name),
start_date = client.start_date
)
You could then use that method in a SubDagOperator from your main dag:
for client in clients:
SubDagOperator(
task_id='client_%s' % client.name,
dag=main_dag,
subdag = make_client_dag(main_dag, client)
)
This will create a subdag specific to each member of the collection clients, and each will run for the next invocation of the main dag. I'm not sure if you'll get the backfill behavior you want.