Can a task depend on multiple ShortCircuitOperator tasks in Airflow? - airflow

For instance, can I set up two shorting-tasks in front of another task:
[short_A, short_B] >> task_A

Yes.
When depending on multiple ShortCircuitOperators, you are essentially creating an AND gate.
So if you want task_A to run, you need short_A AND short_B AND ... to return True.
If you want to build an OR gate, you either have to replace the shortcircuit operator with a custom one, or raise AirflowSkipException inside normal python operator tasks.
For instance for python tasks A, B and C:
[A, B] >> C
You want A and B to implement raise AirflowSkipException where you would have set return False in the shorting operator. Finally you need to set the parameter
trigger_rule='none_failed_or_skipped' on C.

Related

XCOM is a tuple, how to pass the right value to two different downstream tasks

I have an upstream extract task, that extracts files into two different s3 paths. This operator returns a tuple of the two separate s3 paths as XCOM. How do I pass the appropriate XCOM value to the appropriate task?
extract_task >> load_task_0
load_task_1
Probably a little late to the party, but will answer anyways.
With TaskFlow API in Airflow 2.0 you can do something like this using decorators:
#task(multiple_outputs=True)
def extract_task():
return {
"path_0": "s3://path0",
"path_1": "s3://path1",
}
Then in your DAG:
#dag()
def my_dag():
output = extract_task()
load_task_0(output["path_0"])
load_task_1(output["path_1"])
This works with dictionary, probably won't work with tuple but you can try.

Structuring Airflow DAG with complex trigger rules

This is probably as much a logic problem as anything, but it has me stumped. I have the following dag:
I have two main branching events that are tripping me up:
Following A, only the branch B or C should run.
Following B, D should optionally run before E.
Right now I have this implemented so that E has the trigger rule "none_failed", which was needed to prevent E from being skipped when D is skipped. This set up works for all cases except when C needs to run. In this case, C and E run simultaneously (you can see via the box colors B/D are skipped as intended, but E is green). After reading the trigger rule docs, I understand why this is happening (both E's parent tasks are skipped, therefore it runs). However I can't seem to figure out how to get the intended behavior out of this (e.g. keep the current behavior of B/D/E when that branch is run but don't run E when C is run).
For additional context, this is not my entire DAG. Tasks C and E converge into another task downstream of these with a trigger rule of ONE_FAILED, but I omitted this from the example for simplicity. Any ideas how to get the intended behavior?
This is probably not the best solution but seems it covers all of your scenarios. Main thing is that I added a dummy task before E in order to control E's timing and change trigger_rule for E to be "one_success".
"one_success" requires at least 1 immediate parent to be succeeded, so for E, either D or dummy has to success in order for E to run.
A = BranchPythonOperator(task_id='A', python_callable=_branch_A, dag=dag)
B = BranchPythonOperator(task_id='B', python_callable=_branch_B, dag=dag)
C = DummyOperator(task_id='C', dag=dag)
D = PythonOperator(task_id='D', python_callable=_test, dag=dag)
dummy = DummyOperator(task_id='dummy', dag=dag)
E = DummyOperator(task_id='E', trigger_rule='one_success', dag=dag)
A >> [B, C]
B >> [D, dummy] >> E
Demo

How to use macros within functions in Airflow

I am trying to calculate a hash for each task in airflow, using a combination of dag_id, task_id & execution_date. I am doing the calculation in the init of a custom operator, so that I could use it to calculate a unique retry_delay for each task (I don't want to use exponential backoff)
I find it difficult to use the {{ execution_date}} macro inside a call to hash function or int function, in those cases airflow does not replace it the specific date (just keeps the string {{execution_date}} and I get the same has for all execution dates
self.task_hash = int(hashlib.sha1("{}#{}#{}".format(self.dag_id,
self.task_id,
'{{execution_date}}')
.encode('utf-8')).hexdigest(), 16)
I have put task_hash in template_fields, also I have tried to do the calculation in a custom macro - this works for the hash part, but when I put it inside int(), it's the same issue
Any workround, or perhaps I could retrieve the execution_date (on the init of an operator), not from macros?
thanks
Try:
self.task_hash = int(hashlib.sha1("{}#{}#{{execution_date}}".format(
self.dag_id, self.task_id).encode('utf-8')).hexdigest(), 16)

Erlang: Make a ring

I'm quite new to Erlang (Reading through "Software for a Concurrent World"). From what I've read, we link two processes together to form a reliable system.
But if we need more than two process, I think we should connect them in a ring. Although this is slightly tangential to my actual question, please let me know if this is incorrect.
Given a list of PIDs:
[1,2,3,4,5]
I want to form these in a ring of {My_Pid, Linked_Pid} tuples:
[{1,2},{2,3},{3,4},{4,5},{5,1}]
I have trouble creating an elegant solution that adds the final {5,1} tuple.
Here is my attempt:
% linkedPairs takes [1,2,3] and returns [{1,2},{2,3}]
linkedPairs([]) -> [];
linkedPairs([_]) -> [];
linkedPairs([X1,X2|Xs]) -> [{X1, X2} | linkedPairs([X2|Xs])].
% joinLinks takes [{1,2},{2,3}] and returns [{1,2},{2,3},{3,1}]
joinLinks([{A, _}|_]=P) ->
{X, Y} = lists:last(P)
P ++ [{Y, A}].
% makeRing takes [1,2,3] and returns [{1,2},{2,3},{3,1}]
makeRing(PIDs) -> joinLinks(linkedPairs(PIDs)).
I cringe when looking at my joinLinks function - list:last is slow (I think), and it doesn't look very "functional".
Is there a better, more idiomatic solution to this?
If other functional programmers (non-Erlang) stumble upon this, please post your solution - the concepts are the same.
Use lists:zip with the original list and its 'rotated' version:
1> L=[1,2,3].
[1,2,3]
2> lists:zip(L, tl(L) ++ [hd(L)]).
[{1,2},{2,3},{3,1}]
If you are manipulating long lists, you can avoid the creation of the intermediary list tl(L) ++ [hd(L)] using an helper function:
1> L = lists:seq(1,5).
[1,2,3,4,5]
2> Link = fun Link([Last],First,Acc) -> lists:reverse([{Last,First}|Acc]);
Link([X|T],First,Acc) -> Link(T,First,[{X,hd(T)}|Acc]) end.
#Fun<erl_eval.42.127694169>
3> Joinlinks = fun(List) -> Link(List,hd(List),[]) end.
#Fun<erl_eval.6.127694169>
4> Joinlinks(L).
[{1,2},{2,3},{3,4},{4,5},{5,1}]
5>
But if we need more than two process, I think we should connect them
in a ring.
No. For instance, suppose you want to download the text of 10 different web pages. Instead of sending a request, then waiting for the server to respond, then sending the next request, etc., you can spawn a separate process for each request. Each spawned process only needs the pid of the main process, and the main process collects the results as they come in. When a spawned process gets a reply from a server, the spawned process sends a message to the main process with the results, then terminates. The spawned processes have no reason to send messages to each other. No ring.
I would guess that it is unlikely that you will ever create a ring of processes in your erlang career.
I have trouble creating an elegant solution that adds the final {5,1} tuple.
You can create the four other processes passing them self(), which will be different for each spawned process. Then, you can create a separate branch of your create_ring() function that terminates the recursion and returns the pid of the last created process to the main process:
init(N) ->
LastPid = create_ring(....),
create_ring(0, PrevPid) -> PrevPid;
create_ring(N, PrevPid) when N > 0 ->
Pid = spawn(?MODULE, loop, [PrevPid]),
create_ring(.......).
Then, the main process can call (not spawn) the same function that is being spawned by the other processes, passing the function the last pid that was returned by the create_ring() function:
init(N) ->
LastPid = create_ring(...),
loop(LastPid).
As a result, the main process will enter into the same message loop as the other processes, and the main process will have the last pid stored in the loop parameter variable to send messages to.
In erlang, you will often find that while you are defining a function, you won't be able to do everything that you want in that function, so you need to call another function to do whatever it is that is giving you trouble, and if in the second function you find you can't do everything you need to do, then you need to call another function, etc. Applied to the ring problem above, I found that init() couldn't do everything I wanted in one function, so I defined the create_ring() function to handle part of the problem.

Terminology: Partial application where the unbound argument is a function?

... partial application (or partial function application) refers to the process of fixing a
number of arguments to a function, producing another function of smaller arity.
I would like to find out if there is a specific name for the following: (pseudo-code!)
// Given functions:
def f(a, b) := ...
def g(a, b) := ...
def h(a, b) := ...
// And a construct of the following:
def cc(F, A, B) := F(A, B) // cc calls its argument F with A and B as parameters
// Then doing Partial Application for cc:
def call_1(F) := cc(F, 42, "answer")
def call_2(F) := cc(F, 7, "lucky")
// And the calling different matching functions this way:
do call_1(f)
do call_1(g)
do call_2(g)
do call_2(h)
Is there a name for this in functional programming? Or is it just partial application where the unbound parameter just happens to be a function
Actually, there's more to things like your call_N functions, beyond just partial application. Two things of note:
When you apply call_1 or call_2 to an argument, they can be immediately discarded; everything you do with them will be a tail call.
You could write similar functions that don't just apply the argument, but hold onto it for a while; this essentially lets the functions grab hold of their evaluation context, and give techniques for implementing complicated flow control via "jumping back" to previous contexts.
If you take the above two points and run with the concept, you'll eventually end up with continuation-passing style.

Resources