I have a subdag as one of the nodes of a main DAG. The workflow works fine.
I tried to increase the levels of hierarchy by including another subdag inside the subdag. But airflow seem to get confused. Couple of questions in this regard:
1) Does airflow support subdag inside a subdag? If so, is there a limit to the hierarchy?
2) Are there any best practices in using a subdag inside a subdag?
I have recently started using Airflow myself and Airflow does indeed support having subdags inside subdags. I was able to go upto 4 levels deep but I'm not sure as to the exact limit of the hierarchy. Hope that helps!
short answer yes you can.
I managed to achieve this by using subdag create functions following https://github.com/geosolutions-it/evo-odas/wiki/Airflow---about-subDAGs,-branching-and-xcom
I Achieved 3 levels of subdag hierarchy.
Trick is to be extremely careful to follow the maindag.subdag notation in your tasks.
task = SubDagOperator(
subdag=create_subdag(dag_name, task_id, datetime(2019, 1, 29) , None),
task_id=task_id,
dag=dag
)
And in the create_subdag function you need to work carefully with the parent and child dags names, nesting several such functions does the job.
You'll see the errors in the UI if any.
I will update this post with more code if needed but it really depends on how many levels you need so you design the functions like that.
I recently started experimenting with subdags. I don't think there is a limit on how deep they can go. A lot of webpages suggest to stay away from subdags because of the issue with connection pools.
I made an example here with 2 levels. It can be refactored further but it demonstrates the point.
1) Create a helper method to create dags with subdag names formatted as parent.child.
def create_sub_dag(parent_dag_name, child_dag_name, start_date, schedule_interval):
''''Returns a DAG which has the dag_id formatted as parent.child '''
return DAG(
dag_id='{}.{}'.format(parent_dag_name, child_dag_name),
schedule_interval=schedule_interval,
start_date=start_date,
max_active_runs=15
)
2) Then recursively create and assign subdags to parent dags. Remember, subdags are still DAGs. SubDagOperator only bundles it as a task for the parent dag.
level1_list = ['AWS', 'AZURE']
level2_list = ['eu', 'us', 'ap', 'jp']
tasks = ['task_{}'.format(str(i)) for i in range(0, 10)]
for level1_item in level1_list:
level1_dag = create_sub_dag(dag_id, level1_item, datetime(2020, 3, 10), '0 6 * * *')
level1_subdag_operator = SubDagOperator(
subdag=level1_dag,
task_id=level1_item,
dag=dag,
)
level1_dag_id = '{}.{}'.format(dag_id, level1_item)
for level2_item in level2_list:
level2_dag = create_sub_dag(level1_dag_id, level2_item, datetime(2020, 3, 10), '0 6 * * *')
level2_subdag_operator = SubDagOperator(
subdag=level2_dag,
task_id=level2_item,
dag=level1_dag,
)
level2_dag_id = '{}.{}.{}'.format(dag_id, level1_item, level2_item)
create_tasks(level2_dag, tasks)
Related
I try create graph with chain of dynamic tasks.
First of all, I start with expand function. But problem is program should wait, when all the Add tasks finished and only then start Mul tasks. I need the next Mul to run immediately after each Add. Then I got the code that the graph could make
with DAG(dag_id="simple_maping", schedule='* * * * *', start_date=datetime(2022, 12, 22)) as dag:
#task
def read_conf():
conf = Variable.get('tables', deserialize_json=True)
return conf
#task
def add_one(x: str):
sleep(5)
return x + '1'
#task
def mul_two(x: str):
return x * 2
for i in read_conf():
mul_two(add_one(i))
but now there is an error - 'xcomarg' object is not iterable. I can fix it just remove task decorator from read_conf method, but I am not sure it's the best decision, because in my case list configuration names could contain >1000 elements. Without decorator, method have to read configuration every time when scheduler parsed graph.
Maybe the load without the decorator will not be critical? Or is there a way to make an object iterable? How to do it right?
EDIT: This solution has a bug in 2.5.0 which was solved for 2.5.1 (not released yet).
Yes, when you are chaining dynamically mapped tasks the latter (mul_2) will wait until all mapped instances of the first task (add_one) are done by default because the default trigger rule is all_success. While you can change the trigger rule for example to one_done this will not solve your issue because the second task will only once, when it first starts running, decide how many mapped task instances it creates (with one_done it only creates one mapped task instance, so not helpful for your use-case).
The issue with the for-loop (and why Airflow wont allow you to iterate over an XComArg) is that for-loops are parsed when the DAG code is parsed, which happens outside of runtime, when Airflow does not know yet how many results read_conf() will return. If the number of the configurations only rarely change then having a for-loop like that iterating over list in a separate file is an option, but yes at scale this can cause performance issues.
The best solution in my opinion is to use dynamic task group mapping which was added in Airflow 2.5.0:
All mapped task groups will run in parallel and for every input from read_conf(). So for every add_one its mul_two will run immediately. I put the code for this below.
One note: You will not be able to see the mapped task groups in the Airflow UI or be able to access their logs just yet, the feature is still quite new and the UI extension should come in 2.5.1. That is why I added a task downstream of the mapped task groups that prints out the list of results of the mul_two tasks, so you can check if it is working.
from airflow import DAG
from airflow.decorators import task, task_group
from datetime import datetime
from time import sleep
with DAG(
dag_id="simple_mapping",
schedule=None,
start_date=datetime(2022, 12, 22),
catchup=False
) as dag:
#task
def read_conf():
return [10, 20, 30]
#task_group
def calculations(x):
#task
def add_one(x: int):
sleep(x)
return x + 1
#task()
def mul_two(x: int):
return x * 2
mul_two(add_one(x))
#task
def pull_xcom(**context):
pulled_xcom = context["ti"].xcom_pull(
task_ids=['calculations.mul_two'],
key="return_value"
)
print(pulled_xcom)
calculations.expand(x=read_conf()) >> pull_xcom()
Hope this helps! :)
PS: you might want to set catchup=False unless you want to backfill a few weeks of tasks.
I am reading list of elements from an external file and looping over elements to create a series of tasks.
For example, if there are 2 elements in the file - [A, B]. There will be 2 series of tasks:
A1 -> A2 ..
B1 -> B2 ...
This reading elements logic is not part of any task but in the DAG itself. Thus Scheduler is calling it many times a day while reading the DAG file. I want to call it only during DAG runtime.
Wondering if there is already a pattern for such kind of use cases?
Depending on your requirements, if what you are looking for is to avoid reading a file many times, but you don't mind reading from the metadata database as many times instead, then you could change your approach to use Variables as the source of iteration to dynamically create tasks.
A basic example could be performing the file reading inside a PythonOperator and set the Variables you will use to iterate later on (same callable):
sample_file.json:
{
"cities": [ "London", "Paris", "BA", "NY" ]
}
Task definition:
from airflow.utils.dates import days_ago
from airflow.models import Variable
from airflow.utils.task_group import TaskGroup
import json
def _read_file():
with open('dags/sample_file.json') as f:
data = json.load(f)
Variable.set(key='list_of_cities',
value=data['cities'], serialize_json=True)
print('Loading Variable from file...')
def _say_hello(city_name):
print('hello from ' + city_name)
with DAG('dynamic_tasks_from_var', schedule_interval='#once',
start_date=days_ago(2),
catchup=False) as dag:
read_file = PythonOperator(
task_id='read_file',
python_callable=_read_file
)
Then you could read from that variable and create the dynamic tasks. (It's important to set a default_var). The TaskGroup is optional.
# Top-level code
updated_list = Variable.get('list_of_cities',
default_var=['default_city'],
deserialize_json=True)
print(f'Updated LIST: {updated_list}')
with TaskGroup('dynamic_tasks_group',
prefix_group_id=False,
) as dynamic_tasks_group:
for index, city in enumerate(updated_list):
say_hello = PythonOperator(
task_id=f'say_hello_from_{city}',
python_callable=_say_hello,
op_kwargs={'city_name': city}
)
# DAG level dependencies
read_file >> dynamic_tasks_group
In the Scheduler logs, you will only find:
INFO - Updated LIST: ['London', 'Paris', 'BA', 'NY']
Dag Graph View:
With this approach, the top-level code, hence read by the Scheduler continuously, is the call to Variable.get() method. If you need to read from many variables, it's important to remember that it's recommended to store them in one single JSON value to avoid constantly create connections to the metadata database (example in this article).
Update:
As for 11-2021 this approach is considered a "quick and dirty" kind of solution.
Does it work? Yes, totally. Is it production quality code? No.
What's wrong with it? The DB is accessed every time the Scheduler parses the file, by default every 30 seconds, and has nothing to do with your DAG execution. Full details on Airflow Best practices, top-level code.
How can this be improved? Consider if any of the recommended ways about dynamic DAG generation applies to your needs.
When building an Airflow dag, I typically specify a simple schedule to run periodically - I expect this is the most common use.
dag = DAG('my_dag',
description='this is what it does',
schedule_interval='0 12 * * *',
start_date=datetime(2017, 10, 1),
catchup=False)
I then need to use the 'date' as a parameter in my actual process, so I just check the current date.
date = datetime.date.today()
# do some date-sensitive stuff
operator = MyOperator(..., params=[date, ...])
My understanding is that setting catchup=True will have Airflow schedule my dag for every schedule interval between start_date and now (or end_date); e.g. every day.
How do I get the scheduled_date for use within my dag instance?
I think you mean execution date here, You can use Macros inside your operators, more detail can be found here: https://airflow.apache.org/code.html#macros. So airflow will respect it so you don't need to have your date been generated dynamically
Inside of Operator, you can call {{ ds }} in a str directly
Outside of Operator, for example PythonOperator, you will need provide_context=True first then to pass **kwargs as last arguments to your function then you can call kwargs['ds']
I have a workflow that involves many instances of the SubDagOperator, with the tasks generated in a loop. The pattern is illustrated by the following toy dag file:
from datetime import datetime, timedelta
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.subdag_operator import SubDagOperator
dag = DAG(
'subdaggy-2',
schedule_interval=None,
start_date=datetime(2017,1,1)
)
def make_sub_dag(parent_dag, N):
dag = DAG(
'%s.task_%d' % (parent_dag.dag_id, N),
schedule_interval=parent_dag.schedule_interval,
start_date=parent_dag.start_date
)
DummyOperator(task_id='task1', dag=dag) >> DummyOperator(task_id='task2', dag=dag)
return dag
downstream_task = DummyOperator(task_id='downstream', dag=dag)
for N in range(20):
SubDagOperator(
dag=dag,
task_id='task_%d' % N,
subdag=make_sub_dag(dag, N)
) >> downstream_task
I find this to be a convenient way to organize tasks, particularly since it helps keep the top-level DAG uncluttered, especially if the subdag itself contains more tasks (i.e. tens, not just a couple.)
The problem is, this approach doesn't scale very well as the number of subdags (20 in the example) increases. I find that when the total number of DAG objects created in an overall workflow surpasses about 200 (which can easily happen with a production workflow, especially if that pattern occurs several times) things grind to a halt.
So the question: is there a way to organize tasks this way (many similar subdags) that scales to hundreds or thousands of subdags? Some profiling suggests that the process spends a lot of time in the DAG object constructor. Perhaps there is a way to avoid instantiating a new DAG object for each of the SubDagOperators?
Well, it seems that at least a significant part of this problem is indeed due to the cost of the DAG constructor, which in turn is in large part due to the cost of inspect.stack(). I put up a simple patch to replace that with a cheaper method and there does indeed seem to be improvement - a flow with several thousand subdags which previously failed to load for me now loads. We'll see if this goes anywhere.
My idea is to have a task foo which generates a list of inputs (users, reports, log files, etc), and a task is launched for every element in the input list. The goal is to make use of Airflow's retrying and other logic, instead of reimplementing it.
So, ideally, my DAG should look something like this:
The only variable here is the number of tasks generated. I want to do some more tasks after all of these are completed, so spinning up a new DAG for every task does not seem appropriate.
This is my code:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2015, 6, 1)
}
dag = DAG('dynamic_dag_generator', schedule_interval=None, default_args=default_args)
foo_operator = BashOperator(
task_id='foo',
bash_command="echo '%s'" % json.dumps(range(0, random.randint(40,60))),
xcom_push=True,
dag=dag)
def gen_nodes(**kwargs):
ti = kwargs['ti']
workers = json.loads(ti.xcom_pull(task_ids='foo'))
for wid in workers:
print("Iterating worker %s" % wid)
op = PythonOperator(
task_id='test_op_%s' % wid,
python_callable=lambda: print("Dynamic task!"),
dag=dag
)
op.set_downstream(bar_operator)
op.set_upstream(dummy_op)
gen_subdag_node_op = PythonOperator(
task_id='gen_subdag_nodes',
python_callable=gen_nodes,
provide_context=True,
dag=dag
)
gen_subdag_node_op.set_upstream(foo_operator)
dummy_op = DummyOperator(
task_id='dummy',
dag=dag
)
dummy_op.set_upstream(gen_subdag_node_op)
bar_operator = DummyOperator(
task_id='bar',
dag=dag)
bar_operator.set_upstream(dummy_op)
In the logs, I can see that gen_nodes is executed correctly (i.e. Iterating worker 5, etc). However, the new tasks are not scheduled and there is no evidence that they were executed.
I found related code samples online, such as this, but could not make it work. Am I missing something?
Alternatively, is there a more appropriate approach to this problem (isolating units of work)?
At this point in time, airflow does not support adding/removing a task while the dag is running.
The workflow order will be whatever is evaluated at the start of the dag run.
See the second paragraph here.
This means you cannot add/remove tasks based on something that happens in the run. You can add X tasks in a for loop based on something not related to the run, but after the run has begun there is no changing the workflow shape/order.
Many times you can instead use a BranchPythonOperator to make a decision during a dag run, (and these decisions can be based on your xcom values) but they must be a decision to go down a branch that already exists in the workflow.
Dag runs, and Dag definitions are separated in airflow in ways that aren't entirely intuitive, but more or less anything that is created/generated inside a dag run (xcom, dag_run.conf, etc.) is not usable for defining the dag itself.