Structuring Airflow DAG with complex trigger rules

Structuring Airflow DAG with complex trigger rules - airflow

This is probably as much a logic problem as anything, but it has me stumped. I have the following dag:
I have two main branching events that are tripping me up:
Following A, only the branch B or C should run.
Following B, D should optionally run before E.
Right now I have this implemented so that E has the trigger rule "none_failed", which was needed to prevent E from being skipped when D is skipped. This set up works for all cases except when C needs to run. In this case, C and E run simultaneously (you can see via the box colors B/D are skipped as intended, but E is green). After reading the trigger rule docs, I understand why this is happening (both E's parent tasks are skipped, therefore it runs). However I can't seem to figure out how to get the intended behavior out of this (e.g. keep the current behavior of B/D/E when that branch is run but don't run E when C is run).
For additional context, this is not my entire DAG. Tasks C and E converge into another task downstream of these with a trigger rule of ONE_FAILED, but I omitted this from the example for simplicity. Any ideas how to get the intended behavior?

This is probably not the best solution but seems it covers all of your scenarios. Main thing is that I added a dummy task before E in order to control E's timing and change trigger_rule for E to be "one_success".
"one_success" requires at least 1 immediate parent to be succeeded, so for E, either D or dummy has to success in order for E to run.
A = BranchPythonOperator(task_id='A', python_callable=_branch_A, dag=dag)
B = BranchPythonOperator(task_id='B', python_callable=_branch_B, dag=dag)
C = DummyOperator(task_id='C', dag=dag)
D = PythonOperator(task_id='D', python_callable=_test, dag=dag)
dummy = DummyOperator(task_id='dummy', dag=dag)
E = DummyOperator(task_id='E', trigger_rule='one_success', dag=dag)
A >> [B, C]
B >> [D, dummy] >> E
Demo

Related

Can a task depend on multiple ShortCircuitOperator tasks in Airflow?

For instance, can I set up two shorting-tasks in front of another task:
[short_A, short_B] >> task_A

Yes.
When depending on multiple ShortCircuitOperators, you are essentially creating an AND gate.
So if you want task_A to run, you need short_A AND short_B AND ... to return True.
If you want to build an OR gate, you either have to replace the shortcircuit operator with a custom one, or raise AirflowSkipException inside normal python operator tasks.
For instance for python tasks A, B and C:
[A, B] >> C
You want A and B to implement raise AirflowSkipException where you would have set return False in the shorting operator. Finally you need to set the parameter
trigger_rule='none_failed_or_skipped' on C.

How to Trigger a Task based on previous task status?

I have 4 tasks as shown below. I want Task D to be triggered even if Task C has Failed or Succeeded. However, Task C or Task D should not be triggered if Task A or Task B has failed.
I tried to use trigger rule = all_done for Task D but if Task B fails, it triggers Task D as well.
Is there a way to accomplish this in Airflow?

In your case, B is the critical task, and C is non-critical, but you want it to at least make an attempt before D.
First you need to remove all the trigger rules you have applied.
You currently have all_done on C, which means that C runs even when B fails -- which you don't want.
Next you need to add a dependency between B and D:
task_b >> task_d
Now B and C are each independently upstream of D.
So what remains are two problems:
D must not run if B fails
D must not run until C is done
You can't do one_success because the important one is B and it's not enough if C alone succeeds.
What you need is "B success and C done".
A relatively clean way to do this is to make C "skip" instead of fail if an error is encountered.
Here's an example of how to do that:
class MySkippingDummyOperator(DummyOperator):
def execute(self, context):
try:
super().execute(context)
except Exception as e:
raise AirflowSkipException(f'skipping instead of failing.')
If MySkippingDummyOperator encounters an error, the task will end in skipped state.
So B is success / fail, and C is success / skip. With this behavior we can use trigger rule none_failed on task D.
none_failed means everything completed and nothing failed.
And this should produce the desired behavior:
if B is unsuccessful, then D can't run
if C is unsuccessful, it will only be a skip, so D can still run
D will not run unless both B and C are done
Alternatively, you could let D use all_done, and then from within D retrieve the task instance state of B and then skip D if B failed. But this is more complicated and certainly more of a hack.

How can I test whether stdin has input available in julia?

I would like to detect whether there is input on stdin in a short time window, and continue execution either way, with the outcome stored in a Bool. (My real goal is to implement a pause button on a simulation that runs in the terminal. A second keypress should unpause the program, and it should continue executing.) I have tried to use poll_fd but it does not work on stdin:
julia> FileWatching.poll_fd(stdin, readable=true)
ERROR: MethodError: no method matching poll_fd(::Base.TTY; readable=true)
Is there a way that will work on julia? I have found a solution that works in python, and I have considered using this via PyCall, but I am looking for
a cleaner, pure-julia way; and
a way that does not fight or potentially interfere with julia's use of libuv.

bytesavailable(stdin)
Here is a sample usage. Note that if you capture the keyboard you also need to handle Ctrl+C yourself (in this example only the first byte of chunk is checked).
If you want to run it fully asynchronously put #async in front of the while loop. However if there will be no more code in this case this program will just exit.
import REPL
term = REPL.Terminals.TTYTerminal("xterm",stdin,stdout,stderr)
REPL.Terminals.raw!(term,true)
Base.start_reading(stdin)
while (true)
sleep(1)
bb = bytesavailable(stdin)
if bb > 0
data = read(stdin, bb)
if data[1] == UInt(3)
println("Ctrl+C - exiting")
exit()
end
println("Got $bb bytes: $(string(data))")
end
end

Following #Przemyslaw Szufel's response, here is a full solution that allows a keypress to pause/unpause the iteration of a loop:
import REPL
term = REPL.Terminals.TTYTerminal("xterm",stdin,stdout,stderr)
REPL.Terminals.raw!(term,true)
Base.start_reading(stdin)
function read_and_handle_control_c()
b = bytesavailable(stdin)
#assert b > 0
data = read(stdin, b)
if data[1] == UInt(3)
println("Ctrl+C - exiting")
exit()
end
nothing
end
function check_for_and_handle_pause()
if bytesavailable(stdin) > 0
read_and_handle_control_c()
while bytesavailable(stdin) == 0
sleep(0.1)
end
read_and_handle_control_c()
end
nothing
end
while true
# [do stuff]
sleep(0.05)
check_for_and_handle_pause()
end
This is somewhat suboptimal in that it requires the process to wake up regularly even when paused, but it achieves my goal nevertheless.

How do I run a single task in an Airflow DAG more than once?

So I have 2 paths and step A and C checks for updates and do some transformations in databases and step E unload a query to an artifact file.
A -> B ->
--> E
C -> D ->
Now I want step E to run when:
1) Step A and B are completed
or
2) Step C and D are completed
I tried to use trigger_rule 'one_success' in step E, but the problem is that if step A starts just before step C, step E will only run once and the data change in C is not unloaded to the final artifact, missing the desired SLA.
Is there a way in Airflow to force a step to execute once any parent task finish executing, regardless whether it has already been executed? It seems like this is a very logical and common use case, but searching the documentation doesn't yield me anything.

Reloading module/file and task issue

Goal:
To have possibility to reload a whole module and use its exported functions and types in tasks without restarting them.
Problem:
I have a problem with applying a new function definitions while the task, which uses those, is running. The idea is to reload a module, not to include a file again, but further in the post I show the simplified problem version.
A simplified example:
Let me explain the problem using one file defining only one function f, as follows:
#sample_file.jl
f() = info("f version 01")
Run f every 10 seconds from a task:
julia> include("sample_file.jl")
julia> function call_f()
while (true)
f()
sleep(10)
end
end
julia> task = #async call_f()
Then in a REPL every 10 seconds we see:
julia> INFO: f version v01
INFO: f version 01
INFO: f version 01
INFO: f version 01
Now try to change definition in the sample_file.jl, e.g.
#sample_file.jl
f() = info("f version 02")
In the REPL:
julia> reload("sample_file")
julia> f()
INFO: f version 02
...but the infos from the task still give:
julia> INFO: f version 01
INFO: f version 01
INFO: f version 01
INFO: f version 01
INFO: f version 01
...
Question:
Do you have any idea to deal with that?

In your simplified example, this is https://github.com/JuliaLang/julia/issues/265. The function call_f gets compiled with the original definition of f, and does currently not get recompiled when f is changed.
In general, I think that you need to consider what you want to happen when f is changed. Do you want call_f to be recompiled? The simple solution, which doesn't need to recompile call_f, is to store the current function f in a non-const variable (f becomes const when you define your function). Then the jit compiler will know that the function can change and will generate an indirect call.

The core of your problem as you describe it is to share data in parallel computing, which is always a reason to seat down and ponder about the options available, restrictions, etc.
You can just call #everywhere which runs a command on all processes, but I'd say this is a bad idea because you will probably bump into another data sharing/synchronization issue.
My best bet, considering the short description, would be to use a "get an update on global state" approach:
# main process
# ...
#
if should_update
current_state.updateSomeParameter(newValue)
end
# state is always `spawn`ed
#spawn current_state
# continue doing main process stuff
# on remote process
while do_stuff
# do stuff
fetch(current_state)
updateSelfTo(current_state)
#continue doing my remote stuff
end