i want to launch grouped tasks in airflow. When the first group end then start the second group of tasks, example:
I have task A,B,C and D and i want run tasks A and B together and when A and B will finish, then C and D will start together
Like this, but not working...
[A,B] >> [C,D]
(the tasks are BashOperator type)
could you help me??
thanks!!!
There are two ways I will show how you can do this.
1. Define the dependencies one by one.
Now you are trying to do it all in one line.
This is not possible because we are only able to set a dependency for a lists to a single task and from a single task to a list. However, it is not possible to go from a list to a list.
Because your example only has 4 tasks, we can do it in two lines.
# original
# [A,B] >> [C,D]
# new way
[A, B] >> C
[A, B] >> D
2. Create a DummyOperator in the middle.
Let's introduce task E, a DummyOperator.
The DummyOperator will always succeed automatically once its dependencies are all done. Now we can define it as follows.
[A, B] >> E
E >> [C, D]
In general, this is a very nice way of defining your DAGS because it allows you to scale it to any number of tasks depending on any number of tasks with still just two lines.
Related
I want to use airflow for image processing.
I have 4 Tasks: Image Pre process (A) ,bounding box finder (B), classification (C), image finalize (D).
the chart look like this:
A -> B1 -> C \
-> B2 -> C - D
-> B3 -> C /
-> Bn -> C /
the output of Image Pre process task is a list of bounding box proposals, for each bounding box I run classification and once all classification tasks ends I run the image finalize.
I want everything to run in parallel
This will run on 10000 images per day so if I will have different presentation of pipeline in the UI for each image, I can't keep track of the pipeline...
Is it possible in airflow ?
Dynamically creating tasks like this is not something Airflow is best for. Take a look at the answer here to get some insight: Airflow dynamic tasks at runtime.
Airflow is better suited as a scheduling tool, so I propose you delegate the actual work and parallelization to another tool like Celery. You can still use Airflow to schedule this work, in a way that your B step is a simple operator which reads the output from A (via XCom or similar) and distributes actual work to some remote workers.
Can you know in advance the maximum possible number of B tasks? If that's manageable, you could get away with creating the max B tasks, and then skipping some of them as needed depending on the outcome of A.
The implementation might not be trivial, but you could get some hints from this discussion: Launch a subdag with variable parallel tasks in airflow.
In POSIX and Windows API have barriers that allow synchronizing n threads. When n threads waiting for the barrier they may proceed doing some work.
What I want is for threads to wait for a set of barriers. When any of the barriers have n threads waiting it unlocks, returning which one of the barriers that was unlocked. Looking at POSIX and Windows API this is not a part of the native API. Is there another way around it?
Target language is C/C++ but language agnostic solutions are also appreciated.
Bakground: I´m looking into CSP, the basis of Occam and an inspiration source for Go. I believe a runtime can treat Events as barriers. However that would require some way of waiting for multiple barriers. I´d like to getting around this without major effort put in a supervisor.
Edit: Making an example in CSP notation.
P = c -> c -> d -> P;
Q = d -> e -> Q;
R = c -> R | d -> SKIP;
RUNTIME = P || Q || R;
For those of you unfamiliar with the syntax, P is a process (thread) interacting with event c, then c again, then d, then works like P. Q works similar. R is defined as c then R
or d then SKIP. RUNTIME is the concurrent process made from P, Q and R.
Events are synchronized similar to barriers; all processes must be able to deal with it at the same time for it to happen. The trick is that a process may be able to participate in a set of event such as R which may participate in c or d, but unable to because of other processes (P and Q) not being able to participate.
This is where the "wait for ANY from a SET of barriers" comes in. R may wait for c or d simultaneously, and depending on which one is unlocked will behave differently.
Suppose I have some function that will recurs forever, the simplest one I know is:
f x = f x
How can I write a monad that will modify the behaviour of this function, such that it gives me the value of x, and a continuation that will compute the next step, consisting of the value of x at that step and a continuation...
You could structure this as something like the following (using Haskell for illustrative purposes):
data Steps a where
Done :: a -> Steps a
ToDo :: b -> (b -> Steps a) -> Steps a
with the obvious monad implementation.
However, since b is existentially typed in the ToDo constructor, you won't be able to do much with these intermediate results. In effect, I don't think this will give you any more information than just the plain old partiality monad of
data Partial a = Now a | Later (Partial a)
I learning more and more about Erlang language and have recently faced some problem. I read about foldl(Fun, Acc0, List) -> Acc1 function. I used learnyousomeerlang.com tutorial and there was an example (example is about Reverse Polish Notation Calculator in Erlang):
%function that deletes all whitspaces and also execute
rpn(L) when is_list(L) ->
[Res] = lists:foldl(fun rpn/2, [], string:tokens(L," ")),
Res.
%function that converts string to integer or floating poitn value
read(N) ->
case string:to_float(N) of
%returning {error, no_float} where there is no float avaiable
{error,no_float} -> list_to_integer(N);
{F,_} -> F
end.
%rpn managing all actions
rpn("+",[N1,N2|S]) -> [N2+N1|S];
rpn("-", [N1,N2|S]) -> [N2-N1|S];
rpn("*", [N1,N2|S]) -> [N2*N1|S];
rpn("/", [N1,N2|S]) -> [N2/N1|S];
rpn("^", [N1,N2|S]) -> [math:pow(N2,N1)|S];
rpn("ln", [N|S]) -> [math:log(N)|S];
rpn("log10", [N|S]) -> [math:log10(N)|S];
rpn(X, Stack) -> [read(X) | Stack].
As far as I understand lists:foldl executes rpn/2 on every element on list. But this is as far as I can understand this function. I read the documentation but it does not help me a lot. Can someone explain me how lists:foldl works?
Let's say we want to add a list of numbers together:
1 + 2 + 3 + 4.
This is a pretty normal way to write it. But I wrote "add a list of numbers together", not "write numbers with pluses between them". There is something fundamentally different between the way I expressed the operation in prose and the mathematical notation I used. We do this because we know it is an equivalent notation for addition (because it is commutative), and in our heads it reduces immediately to:
3 + 7.
and then
10.
So what's the big deal? The problem is that we have no way of understanding the idea of summation from this example. What if instead I had written "Start with 0, then take one element from the list at a time and add it to the starting value as a running sum"? This is actually what summation is about, and it's not arbitrarily deciding which two things to add first until the equation is reduced.
sum(List) -> sum(List, 0).
sum([], A) -> A;
sum([H|T], A) -> sum(T, H + A).
If you're with me so far, then you're ready to understand folds.
There is a problem with the function above; it is too specific. It braids three ideas together without specifying any independently:
iteration
accumulation
addition
It is easy to miss the difference between iteration and accumulation because most of the time we never give this a second thought. Most languages accidentally encourage us to miss the difference, actually, by having the same storage location change its value each iteration of a similar function.
It is easy to miss the independence of addition merely because of the way it is written in this example because "+" looks like an "operation", not a function.
What if I had said "Start with 1, then take one element from the list at a time and multiply it by the running value"? We would still be doing the list processing in exactly the same way, but with two examples to compare it is pretty clear that multiplication and addition are the only difference between the two:
prod(List) -> prod(List, 1).
prod([], A) -> A;
prod([H|T], A) -> prod(T, H * A).
This is exactly the same flow of execution but for the inner operation and the starting value of the accumulator.
So let's make the addition and multiplication bits into functions, so we can pull that part of the pattern out:
add(A, B) -> A + B.
mult(A, B) -> A * B.
How could we write the list operation on its own? We need to pass a function in -- addition or multiplication -- and have it operate over the values. Also, we have to pay attention to the identity of the type and operation of things we are operating on or else we will screw up the magic that is value aggregation. "add(0, X)" always returns X, so this idea (0 + Foo) is the addition identity operation. In multiplication the identity operation is to multiply by 1. So we must start our accumulator at 0 for addition and 1 for multiplication (and for building lists an empty list, and so on). So we can't write the function with an accumulator value built-in, because it will only be correct for some type+operation pairs.
So this means to write a fold we need to have a list argument, a function to do things argument, and an accumulator argument, like so:
fold([], _, Accumulator) ->
Accumulator;
fold([H|T], Operation, Accumulator) ->
fold(T, Operation, Operation(H, Accumulator)).
With this definition we can now write sum/1 using this more general pattern:
fsum(List) -> fold(List, fun add/2, 0).
And prod/1 also:
fprod(List) -> fold(List, fun prod/2, 1).
And they are functionally identical to the one we wrote above, but the notation is more clear and we don't have to write a bunch of recursive details that tangle the idea of iteration with the idea of accumulation with the idea of some specific operation like multiplication or addition.
In the case of the RPN calculator the idea of aggregate list operations is combined with the concept of selective dispatch (picking an operation to perform based on what symbol is encountered/matched). The RPN example is relatively simple and small (you can fit all the code in your head at once, it's just a few lines), but until you get used to functional paradigms the process it manifests can make your head hurt. In functional programming a tiny amount of code can create an arbitrarily complex process of unpredictable (or even evolving!) behavior, based just on list operations and selective dispatch; this is very different from the conditional checks, input validation and procedural checking techniques used in other paradigms more common today. Analyzing such behavior is greatly assisted by single assignment and recursive notation, because each iteration is a conceptually independent slice of time which can be contemplated in isolation of the rest of the system. I'm talking a little ahead of the basic question, but this is a core idea you may wish to contemplate as you consider why we like to use operations like folds and recursive notations instead of procedural, multiple-assignment loops.
I hope this helped more than confused.
First, you have to remember haw works rpn. If you want to execute the following operation: 2 * (3 + 5), you will feed the function with the input: "3 5 + 2 *". This was useful at a time where you had 25 step to enter a program :o)
the first function called simply split this character list into element:
1> string:tokens("3 5 + 2 *"," ").
["3","5","+","2","*"]
2>
then it processes the lists:foldl/3. for each element of this list, rpn/2 is called with the head of the input list and the current accumulator, and return a new accumulator. lets go step by step:
Step head accumulator matched rpn/2 return value
1 "3" [] rpn(X, Stack) -> [read(X) | Stack]. [3]
2 "5" [3] rpn(X, Stack) -> [read(X) | Stack]. [5,3]
3 "+" [5,3] rpn("+", [N1,N2|S]) -> [N2+N1|S]; [8]
4 "2" [8] rpn(X, Stack) -> [read(X) | Stack]. [2,8]
5 "*" [2,8] rpn("*",[N1,N2|S]) -> [N2*N1|S]; [16]
At the end, lists:foldl/3 returns [16] which matches to [R], and though rpn/1 returns R = 16
Given an undirected cyclic graph, I want to find all possible traversals with Breadth-First search or Depth-First search. That is given a graph as an adjacency-list:
A-BC
B-A
C-ADE
D-C
E-C
So all BFS paths from root A would be:
{ABCDE,ABCED,ACBDE,ACBED}
and for DFS:
{ABCDE,ABCED,ACDEB,ACEDB}
How would I generate those traversals algorithmically in a meaningful way? I suppose one could generate all permutations of letters and check their validity, but that seems like last-resort to me.
Any help would be appreciated.
Apart from the obvious way where you actually perform all possible DFS and BFS traversals you could try this approach:
Step 1.
In a dfs traversal starting from the root A transform the adjacency list of the currently visited node like so: First remove the parent of the node from the list. Second generate all permutations of the remaining nodes in the adj list.
So if you are at node C having come from node A you will do:
C -> ADE transform into C -> DE transform into C -> [DE, ED]
Step 2.
After step 1 you have the following transformed adj list:
A -> [CB, BC]
B -> []
C -> [DE, ED]
D -> []
E -> []
Now you launch a processing starting from (A,0), where the first item in the pair is the traversal path and the second is an index. Lets assume we have two queues. A BFS queue and a DFS queue. We put this pair into both queues.
Now we repeat the following, first for one queue until it is empty and then for the other queue.
We pop the first pair off the queue. We get (A,0). The node A maps to [BC, CB]. So we generate two new paths (ACB,1) and (ABC,1). Put these new paths in the queue.
Take the first one of these off the queue to get (ACB,1). The index is 1 so we look at the second character in the path string. This is C. Node C maps to [DE, ED].
The BFS children of this path would be (ACBDE,2) and (ACBED,2) which we obtained by appending the child permutation.
The DFS children of this path would be (ACDEB,2) and (ACEDB,2) which we obtained by inserting the child permutation right after C into the path string.
We generate the new paths according to which queue we are working on, based on the above and put them in the queue. So if we are working on the BFS queue we put in (ACBDE,2) and (ACBED,2). The contents of our queue are now : (ABC,1) , (ACBDE,2), (ACBED,2).
We pop (ABC,1) off the queue. Generate (ABC,2) since B has no children. And get the queue :
(ACBDE,2), (ACBED,2), (ABC,2) and so on. At some point we will end up with a bunch of pairs where the index is not contained in the path. For example if we get (ACBED,5) we know this is a finished path.
BFS is should be quite simple: each node has a certain depth at which it will be found. In your example you find A at depth 0, B and C at depth 1 and E and D at depth 2. In each BFS path, you will have the element with depth 0 (A) as the first element, followed by any permutation of the elements at depth 1 (B and C), followed by any permutation of the elements at depth 2 (E and D), etc...
If you look at your example, your 4 BFS paths match that pattern. A is always the first element, followed by BC or CB, followed by DE or ED. You can generalize this for graphs with nodes at deeper depths.
To find that, all you need is 1 Dijkstra search which is quite cheap.
In DFS, you don't have the nice separation by depth which makes BFS straightforward. I don't immediately see an algorithm that is as efficient as the one above. You could set up a graph structure and build up your paths by traversing your graph and backtracking. There are some cases in which this would not be very efficient but it might be enough for your application.