Using Airflow TaskGroup without setting direct dependencies? - airflow

The "mainstream" way to use TG is
t1 >> t2 >> task_group >> t3 ...
But in some cases, I'd like to use TG in a different way:
with DAG(...) as dag:
t1 = DummyOperator(task_id="t1")
t2 = DummyOperator(task_id="t2")
t3 = DummyOperator(task_id="t3")
t4 = DummyOperator(task_id="t4")
with TaskGroup(group_id="myTG") as tg1:
tg1 = DummyOperator(task_id="TG1")
tg2 = DummyOperator(task_id="TG2")
tg3 = DummyOperator(task_id="TG3")
##########################################################
# setting direct dependencies on the "outer" DAG tasks: #
##########################################################
tg1.set_upstream(t2)
tg2.set_upstream(t4)
# internal TG structure:
tg1 >> tg3
tg2 >> tg3
t1 >> t2
t3 >> t4
As seen above, the TG is not chained directly to the DAG, but rather the internal tasks have dependencies on the "outer" tasks.
This actually works, and the outcome is as expected: (using Airflow 2.1.0)
However, this behavior is not documented anywhere and I couldn't find any evidence that this is supported by design. My fear is that this might be a side effect or undefined behavior, and might break in future releases.
Does anybody know if it's safe to use this method?

Airflow Task Group is a collection of tasks, which represents a part of the dag (sub dag), this collection has roots tasks (the tasks which don't have any upstream task in the same Task Group) and leaves tasks (the tasks which don't have any downstream task in the same Task group).
When you set an upstream for the Task Group, it just sets it as an upstream for all the roots tasks, and when you set a downstream task for the Task Group, it sets it as a downstream task for all the leaves tasks. (method source code)
Your code is completely safe, and you can also leave a task without any upstream, and add an upstream task for the second, which is not possible if you set the upstream directly for the Task Group.
You will not find any example in the doc similar to your code, because the main goal of the Task Group was replacing the Sub Dags, and set dependencies for the whole Task Group and use is as dependency for the other tasks/Task Groups, so your code is not recommended but 100% safe.

Related

Is it possible to arrange execution of several tasks in a transaction-like mode?

By transactional-like I mean - etither all tasks in a set are successful or none of them are and should be retried from the first one.
Consider two operators, A and B, A downloads a file, B reads it and performs some actions.
A successfully executes, but before B comes into play a blackout/corruption occurs, and B cannot process the file, so it fails.
To pass, it needs A to re-download the file, but since A is in success status, there is no direct way to do that automatically. A human must go and clear A`s status
Now, if wonder, if there is a known way to clear the statuses of task instances up to a certain task, if some task fails?
I know I can use a hook, as in clear an upstream task in airflow within the dag, but that looks a bit ugly
This feature is not supported by airflow, but you can use on_failure_callback to achieve that:
def clear_task_A(context):
execution_date = context["execution_date"]
clear_task_A = BashOperator(
task_id='clear_task_A',
bash_command=f'airflow tasks clear -s {execution_date} -t A -d -y <dag_id>'
) # -s: start date, -t: task_id regex -d: include downstream?, -y: yes?
return clear_task_A.execute(context=context)
with DAG(...) as dag:
A = DummyOperator(
task_id='A'
)
B = BashOperator(
task_id='B',
bash_command='exit 1',
on_failure_callback=clear_task_A
)
A >> B

Multiple applications of >> in Airflow?

Suppose I have Airflow tasks like this:
apple_task = DummyOperator(
task_id='apple'
)
banana_task = DummyOperator(
task_id='banana'
)
cherry_task = DummyOperator(
task_id='cherry'
)
apple_task >> cherry_task
banana_task >> cherry_task
Do the repeated applications of >> stack or replace the previous one?
What will the graph look like?
Airflow 2.2.2
They stack, as in apple_task and banana_task will be ran in parallel, both must succeed to run cherry_task.
It's equivalent to [apple_task, banana_task] >> cherry_task.
The scheduler parses the DAGs (every 30s by default), and the DAG is read and the graph is constructed. An advantage to specifying task dependencies as you did, you can dynamically create tasks at parse time - as they're just python objects.
The DAG documentation page has some more examples under the task dependencies heading here and the control flow heading here.

Airflow TaskGroups - adding another taskgroup that waits for the other taskgroups to finish running

I have 3 tasksgroups currently, they run independently from one another, but I want to add another one (QUERY) that runs after the other task groups finish running:
with TaskGroup(group_id='query') as QUERY:
affiliates_query = SnowflakeQueryOperator()
t1
t2
t3
I tried this but it didn't work:
t1
t2
t3
[t1,t2,t3] >> QUERY
I solved this by putting the TaskGroups into another overall TaskGroup and referencing the name of that TaskGroup in the workflow.
with dag:
with TaskGroup(group_id='big_tent') as big_tent:
with TaskGroup(group_id='t1') as t1:
# Do something
with TaskGroup(group_id='t2') as t2:
# Do something
with TaskGroup(group_id='t3') as t3:
# Do something
with TaskGroup(group_id='query') as QUERY:
affiliates_query = SnowflakeQueryOperator()
big_tent >> QUERY
I used a for loop but it should return the same results. A Dummy Operator between them won't matter but it does help when reading the DAG by separating the processes.

How to run non-dependent sibling tasks sequentially?

Let's suppose I have a dag with 4 tasks, and the structure would be more or less like this if we only care about the order in which they are run:
t1.set_downstream(t2)
t2.set_downstream(t3)
t3.set_downstream(t4)
This let's me run the tasks sequentially t1 -> t2 -> t3 -> t4
My issue is, I want them to run in the above sequence, but I want t4 to ignore if and only if t3 fails.
I.e: the graph would look like this if we only care about dependencies
t1.set_downstream(t2)
t2.set_downstream(t3)
t2.set_downstream(t4)
What can I do to have both scenarios?
My current option is to set the trigger rule of t4 to all_done, however, I don't want it to run if t1 or t2 fails.
Any ideas? Is this even possible?
You can use a BranchOperator to add the logic you are looking for. Or even create your own SkipOperator based on SkipMixin.

Dynamic tasks in airflow based on an external file

I am reading list of elements from an external file and looping over elements to create a series of tasks.
For example, if there are 2 elements in the file - [A, B]. There will be 2 series of tasks:
A1 -> A2 ..
B1 -> B2 ...
This reading elements logic is not part of any task but in the DAG itself. Thus Scheduler is calling it many times a day while reading the DAG file. I want to call it only during DAG runtime.
Wondering if there is already a pattern for such kind of use cases?
Depending on your requirements, if what you are looking for is to avoid reading a file many times, but you don't mind reading from the metadata database as many times instead, then you could change your approach to use Variables as the source of iteration to dynamically create tasks.
A basic example could be performing the file reading inside a PythonOperator and set the Variables you will use to iterate later on (same callable):
sample_file.json:
{
"cities": [ "London", "Paris", "BA", "NY" ]
}
Task definition:
from airflow.utils.dates import days_ago
from airflow.models import Variable
from airflow.utils.task_group import TaskGroup
import json
def _read_file():
with open('dags/sample_file.json') as f:
data = json.load(f)
Variable.set(key='list_of_cities',
value=data['cities'], serialize_json=True)
print('Loading Variable from file...')
def _say_hello(city_name):
print('hello from ' + city_name)
with DAG('dynamic_tasks_from_var', schedule_interval='#once',
start_date=days_ago(2),
catchup=False) as dag:
read_file = PythonOperator(
task_id='read_file',
python_callable=_read_file
)
Then you could read from that variable and create the dynamic tasks. (It's important to set a default_var). The TaskGroup is optional.
# Top-level code
updated_list = Variable.get('list_of_cities',
default_var=['default_city'],
deserialize_json=True)
print(f'Updated LIST: {updated_list}')
with TaskGroup('dynamic_tasks_group',
prefix_group_id=False,
) as dynamic_tasks_group:
for index, city in enumerate(updated_list):
say_hello = PythonOperator(
task_id=f'say_hello_from_{city}',
python_callable=_say_hello,
op_kwargs={'city_name': city}
)
# DAG level dependencies
read_file >> dynamic_tasks_group
In the Scheduler logs, you will only find:
INFO - Updated LIST: ['London', 'Paris', 'BA', 'NY']
Dag Graph View:
With this approach, the top-level code, hence read by the Scheduler continuously, is the call to Variable.get() method. If you need to read from many variables, it's important to remember that it's recommended to store them in one single JSON value to avoid constantly create connections to the metadata database (example in this article).
Update:
As for 11-2021 this approach is considered a "quick and dirty" kind of solution.
Does it work? Yes, totally. Is it production quality code? No.
What's wrong with it? The DB is accessed every time the Scheduler parses the file, by default every 30 seconds, and has nothing to do with your DAG execution. Full details on Airflow Best practices, top-level code.
How can this be improved? Consider if any of the recommended ways about dynamic DAG generation applies to your needs.

Resources