trigger rule none_skipped is not working as expected - airflow

I have a situation where I need to use none_skipped trigger rule but it is behaving strangely. Here is my scenario.
Branch Task B,
if true T1 >> T2 >> joinTask
if false F1 >> F2 >> joinTask
if condition in B false then T1 is skipped but T2 is executed as trigger rule was 'all_done'( i need T2 even if T1 fails ). So I made T2 as 'none_skipped'
I was expecting T2 to get triggered if T1 is success or failed or upstream failed ( As per the documentation). Instead T2 is getting triggered as soon as the dag starts. It is executed before any other task.

It looks like this was only recently fixed, in airflow 1.10.5:
https://github.com/apache/airflow/pull/5902
Try updating airflow to 1.10.5

Related

Using Airflow TaskGroup without setting direct dependencies?

The "mainstream" way to use TG is
t1 >> t2 >> task_group >> t3 ...
But in some cases, I'd like to use TG in a different way:
with DAG(...) as dag:
t1 = DummyOperator(task_id="t1")
t2 = DummyOperator(task_id="t2")
t3 = DummyOperator(task_id="t3")
t4 = DummyOperator(task_id="t4")
with TaskGroup(group_id="myTG") as tg1:
tg1 = DummyOperator(task_id="TG1")
tg2 = DummyOperator(task_id="TG2")
tg3 = DummyOperator(task_id="TG3")
##########################################################
# setting direct dependencies on the "outer" DAG tasks: #
##########################################################
tg1.set_upstream(t2)
tg2.set_upstream(t4)
# internal TG structure:
tg1 >> tg3
tg2 >> tg3
t1 >> t2
t3 >> t4
As seen above, the TG is not chained directly to the DAG, but rather the internal tasks have dependencies on the "outer" tasks.
This actually works, and the outcome is as expected: (using Airflow 2.1.0)
However, this behavior is not documented anywhere and I couldn't find any evidence that this is supported by design. My fear is that this might be a side effect or undefined behavior, and might break in future releases.
Does anybody know if it's safe to use this method?
Airflow Task Group is a collection of tasks, which represents a part of the dag (sub dag), this collection has roots tasks (the tasks which don't have any upstream task in the same Task Group) and leaves tasks (the tasks which don't have any downstream task in the same Task group).
When you set an upstream for the Task Group, it just sets it as an upstream for all the roots tasks, and when you set a downstream task for the Task Group, it sets it as a downstream task for all the leaves tasks. (method source code)
Your code is completely safe, and you can also leave a task without any upstream, and add an upstream task for the second, which is not possible if you set the upstream directly for the Task Group.
You will not find any example in the doc similar to your code, because the main goal of the Task Group was replacing the Sub Dags, and set dependencies for the whole Task Group and use is as dependency for the other tasks/Task Groups, so your code is not recommended but 100% safe.

Is it possible to arrange execution of several tasks in a transaction-like mode?

By transactional-like I mean - etither all tasks in a set are successful or none of them are and should be retried from the first one.
Consider two operators, A and B, A downloads a file, B reads it and performs some actions.
A successfully executes, but before B comes into play a blackout/corruption occurs, and B cannot process the file, so it fails.
To pass, it needs A to re-download the file, but since A is in success status, there is no direct way to do that automatically. A human must go and clear A`s status
Now, if wonder, if there is a known way to clear the statuses of task instances up to a certain task, if some task fails?
I know I can use a hook, as in clear an upstream task in airflow within the dag, but that looks a bit ugly
This feature is not supported by airflow, but you can use on_failure_callback to achieve that:
def clear_task_A(context):
execution_date = context["execution_date"]
clear_task_A = BashOperator(
task_id='clear_task_A',
bash_command=f'airflow tasks clear -s {execution_date} -t A -d -y <dag_id>'
) # -s: start date, -t: task_id regex -d: include downstream?, -y: yes?
return clear_task_A.execute(context=context)
with DAG(...) as dag:
A = DummyOperator(
task_id='A'
)
B = BashOperator(
task_id='B',
bash_command='exit 1',
on_failure_callback=clear_task_A
)
A >> B

Airflow TaskGroups - adding another taskgroup that waits for the other taskgroups to finish running

I have 3 tasksgroups currently, they run independently from one another, but I want to add another one (QUERY) that runs after the other task groups finish running:
with TaskGroup(group_id='query') as QUERY:
affiliates_query = SnowflakeQueryOperator()
t1
t2
t3
I tried this but it didn't work:
t1
t2
t3
[t1,t2,t3] >> QUERY
I solved this by putting the TaskGroups into another overall TaskGroup and referencing the name of that TaskGroup in the workflow.
with dag:
with TaskGroup(group_id='big_tent') as big_tent:
with TaskGroup(group_id='t1') as t1:
# Do something
with TaskGroup(group_id='t2') as t2:
# Do something
with TaskGroup(group_id='t3') as t3:
# Do something
with TaskGroup(group_id='query') as QUERY:
affiliates_query = SnowflakeQueryOperator()
big_tent >> QUERY
I used a for loop but it should return the same results. A Dummy Operator between them won't matter but it does help when reading the DAG by separating the processes.

How to run non-dependent sibling tasks sequentially?

Let's suppose I have a dag with 4 tasks, and the structure would be more or less like this if we only care about the order in which they are run:
t1.set_downstream(t2)
t2.set_downstream(t3)
t3.set_downstream(t4)
This let's me run the tasks sequentially t1 -> t2 -> t3 -> t4
My issue is, I want them to run in the above sequence, but I want t4 to ignore if and only if t3 fails.
I.e: the graph would look like this if we only care about dependencies
t1.set_downstream(t2)
t2.set_downstream(t3)
t2.set_downstream(t4)
What can I do to have both scenarios?
My current option is to set the trigger rule of t4 to all_done, however, I don't want it to run if t1 or t2 fails.
Any ideas? Is this even possible?
You can use a BranchOperator to add the logic you are looking for. Or even create your own SkipOperator based on SkipMixin.

How do I trigger all upstream_failed task to "retry" after their parents state changed to success in Airflow?

I have a simple example showing a DAG with two levels. When running, this dag fails because artificial bug, there is one failed task and one that that is upstream_fail.
bug = True
def process1(param):
print("process 1 running {}".format(param))
if bug and (param == 2):
raise Exception("failure!!")
def process2(param):
print("process 2 running {}".format(param))
with dag:
for i in range(10):
task1 = PythonOperator(
task_id="process_1_{}".format(i),
python_callable=process1,
op_kwargs={'param': i}
)
task2 = PythonOperator(
task_id="process_2_{}".format(i),
python_callable=process2,
op_kwargs={'param': i},
trigger_rule=TriggerRule.ALL_SUCCESS,
retries=2
)
task1 >> task2
Now lets assume I fixed the bug (bug = False) and tried to clear all failed task:
airflow clear -s 2001 -e 2019 --only_failed test_resubmit
This command clears the task test_resubmit.process_1_2 and it will run successfully, however its downstream (i.e. test_resubmit.process_2_2) is still in the upstream_failed state. How do I trigger all upstream_failed task to "retry" after their parents state changed to success?
The upstream_failed state is an end state, so it will not retry even if its dependencies are now satisfied (unlike up_for_retry). You'll want to pass --downstream so that tasks downstream of the failed tasks are also cleared.
See all options in https://airflow.readthedocs.io/en/stable/cli.html#clear.

Resources