How to Trigger a Task based on previous task status? - airflow

I have 4 tasks as shown below. I want Task D to be triggered even if Task C has Failed or Succeeded. However, Task C or Task D should not be triggered if Task A or Task B has failed.
I tried to use trigger rule = all_done for Task D but if Task B fails, it triggers Task D as well.
Is there a way to accomplish this in Airflow?

In your case, B is the critical task, and C is non-critical, but you want it to at least make an attempt before D.
First you need to remove all the trigger rules you have applied.
You currently have all_done on C, which means that C runs even when B fails -- which you don't want.
Next you need to add a dependency between B and D:
task_b >> task_d
Now B and C are each independently upstream of D.
So what remains are two problems:
D must not run if B fails
D must not run until C is done
You can't do one_success because the important one is B and it's not enough if C alone succeeds.
What you need is "B success and C done".
A relatively clean way to do this is to make C "skip" instead of fail if an error is encountered.
Here's an example of how to do that:
class MySkippingDummyOperator(DummyOperator):
def execute(self, context):
try:
super().execute(context)
except Exception as e:
raise AirflowSkipException(f'skipping instead of failing.')
If MySkippingDummyOperator encounters an error, the task will end in skipped state.
So B is success / fail, and C is success / skip. With this behavior we can use trigger rule none_failed on task D.
none_failed means everything completed and nothing failed.
And this should produce the desired behavior:
if B is unsuccessful, then D can't run
if C is unsuccessful, it will only be a skip, so D can still run
D will not run unless both B and C are done
Alternatively, you could let D use all_done, and then from within D retrieve the task instance state of B and then skip D if B failed. But this is more complicated and certainly more of a hack.

Related

Structuring Airflow DAG with complex trigger rules

This is probably as much a logic problem as anything, but it has me stumped. I have the following dag:
I have two main branching events that are tripping me up:
Following A, only the branch B or C should run.
Following B, D should optionally run before E.
Right now I have this implemented so that E has the trigger rule "none_failed", which was needed to prevent E from being skipped when D is skipped. This set up works for all cases except when C needs to run. In this case, C and E run simultaneously (you can see via the box colors B/D are skipped as intended, but E is green). After reading the trigger rule docs, I understand why this is happening (both E's parent tasks are skipped, therefore it runs). However I can't seem to figure out how to get the intended behavior out of this (e.g. keep the current behavior of B/D/E when that branch is run but don't run E when C is run).
For additional context, this is not my entire DAG. Tasks C and E converge into another task downstream of these with a trigger rule of ONE_FAILED, but I omitted this from the example for simplicity. Any ideas how to get the intended behavior?
This is probably not the best solution but seems it covers all of your scenarios. Main thing is that I added a dummy task before E in order to control E's timing and change trigger_rule for E to be "one_success".
"one_success" requires at least 1 immediate parent to be succeeded, so for E, either D or dummy has to success in order for E to run.
A = BranchPythonOperator(task_id='A', python_callable=_branch_A, dag=dag)
B = BranchPythonOperator(task_id='B', python_callable=_branch_B, dag=dag)
C = DummyOperator(task_id='C', dag=dag)
D = PythonOperator(task_id='D', python_callable=_test, dag=dag)
dummy = DummyOperator(task_id='dummy', dag=dag)
E = DummyOperator(task_id='E', trigger_rule='one_success', dag=dag)
A >> [B, C]
B >> [D, dummy] >> E
Demo

Can a task depend on multiple ShortCircuitOperator tasks in Airflow?

For instance, can I set up two shorting-tasks in front of another task:
[short_A, short_B] >> task_A
Yes.
When depending on multiple ShortCircuitOperators, you are essentially creating an AND gate.
So if you want task_A to run, you need short_A AND short_B AND ... to return True.
If you want to build an OR gate, you either have to replace the shortcircuit operator with a custom one, or raise AirflowSkipException inside normal python operator tasks.
For instance for python tasks A, B and C:
[A, B] >> C
You want A and B to implement raise AirflowSkipException where you would have set return False in the shorting operator. Finally you need to set the parameter
trigger_rule='none_failed_or_skipped' on C.

How do I run a single task in an Airflow DAG more than once?

So I have 2 paths and step A and C checks for updates and do some transformations in databases and step E unload a query to an artifact file.
A -> B ->
--> E
C -> D ->
Now I want step E to run when:
1) Step A and B are completed
or
2) Step C and D are completed
I tried to use trigger_rule 'one_success' in step E, but the problem is that if step A starts just before step C, step E will only run once and the data change in C is not unloaded to the final artifact, missing the desired SLA.
Is there a way in Airflow to force a step to execute once any parent task finish executing, regardless whether it has already been executed? It seems like this is a very logical and common use case, but searching the documentation doesn't yield me anything.

Xtext - Get cross-referenced child

I have a grammar that looks like:
A:
myField=[B]
B:
C | D | E
I have a function that gets A (let's say a) as a parameter and I want to access C, for example.
I did a.myField that returns a B object (let's say b). Than I used
EcoreUtil2.getAllContentsOfType(b,C) - but it returns an empty list.
Maybe the reason is that B is not really parsed again, but cross-referenced. If so, is there any function that allows me to access C/D/E in the above example?
Thank you.
Update
Apparently b is null, so of course getAllContentsOfType() returns an empty list. How do I access B (which is cross-referenced from A)?
Had to check that a.myField isn't null.

Reloading module/file and task issue

Goal:
To have possibility to reload a whole module and use its exported functions and types in tasks without restarting them.
Problem:
I have a problem with applying a new function definitions while the task, which uses those, is running. The idea is to reload a module, not to include a file again, but further in the post I show the simplified problem version.
A simplified example:
Let me explain the problem using one file defining only one function f, as follows:
#sample_file.jl
f() = info("f version 01")
Run f every 10 seconds from a task:
julia> include("sample_file.jl")
julia> function call_f()
while (true)
f()
sleep(10)
end
end
julia> task = #async call_f()
Then in a REPL every 10 seconds we see:
julia> INFO: f version v01
INFO: f version 01
INFO: f version 01
INFO: f version 01
Now try to change definition in the sample_file.jl, e.g.
#sample_file.jl
f() = info("f version 02")
In the REPL:
julia> reload("sample_file")
julia> f()
INFO: f version 02
...but the infos from the task still give:
julia> INFO: f version 01
INFO: f version 01
INFO: f version 01
INFO: f version 01
INFO: f version 01
...
Question:
Do you have any idea to deal with that?
In your simplified example, this is https://github.com/JuliaLang/julia/issues/265. The function call_f gets compiled with the original definition of f, and does currently not get recompiled when f is changed.
In general, I think that you need to consider what you want to happen when f is changed. Do you want call_f to be recompiled? The simple solution, which doesn't need to recompile call_f, is to store the current function f in a non-const variable (f becomes const when you define your function). Then the jit compiler will know that the function can change and will generate an indirect call.
The core of your problem as you describe it is to share data in parallel computing, which is always a reason to seat down and ponder about the options available, restrictions, etc.
You can just call #everywhere which runs a command on all processes, but I'd say this is a bad idea because you will probably bump into another data sharing/synchronization issue.
My best bet, considering the short description, would be to use a "get an update on global state" approach:
# main process
# ...
#
if should_update
current_state.updateSomeParameter(newValue)
end
# state is always `spawn`ed
#spawn current_state
# continue doing main process stuff
# on remote process
while do_stuff
# do stuff
fetch(current_state)
updateSelfTo(current_state)
#continue doing my remote stuff
end

Resources