Airflow tasks not running in correct order? - airflow

cleanup_pbp is downstream of all 4 of load_pbp_30629, load_pbp_30630, load_to_bq_30629, load_to_bq_30630. cleanup_pbp started at 2021-12-05T08:54:48.
however, load_pbp_30630, one of the 4 upstream tasks, did not end until 2021-12-05T09:02:23.
How is cleanup_pbp running before load_pbp_30630 ends? I've never seen this before. I know our task dependencies have a bit of criss-cross going on, but that shouldn't explain why the tasks run out of order?

We have exactly the same problem and after checking, finally we found out that the problem is caused by the loop function in Dag script.
Actually we use a for loop to create tasks as well as their relationships. As in each iteration it will create the upstream task and the downstream task (like the cleanup_pbp in your case) and always give the same id to downstream one and define its relations (ex. Load_pbp_xxx >> cleanup_pbp), in graph or tree view it considers this downstream id has serval dependences but when Dag runs it takes only the relation defined in the last iteration. If you check Task Instance you will see only one task in its upstream_list.
The solution is to move the definition of this final task out of the loop but keep the definition of dependencies inside the loop. This well resolves our problem of running order, the final task won’t start until all dependences are finished. (with tigger rule all_success of course)
I’m not sure if you are in the same situation as us but hope this will give you some idea.

Related

Way to designate certain set of airflow tasks to run before others (order invariant)?

Have an airflow (v1.10.5) dag that looks like...
Is there a way to specify that all of the blue tasks should complete before scheduler moves on to any downstream tasks (as currently scheduler sometimes goes down an entire branch of tasks before doing the next blue task)?
Want to avoid just putting them in sequence (and using with trigger rule TriggerRule.ALL_DONE) because they do not actually have any logical order in which they need to be done (other than that they all need to be done before any other downstream tasks in any branch).
Anyone know of any way to do this (like some kind of "priority" pool for tasks)? Other workaround suggestions?
Asked this question on the airflow mailing list and this is the results...
white
blue = [blue_a, blue_b, blue_c]
green = [green_a, green_b, green_c]
yellow = [yellow_a, yellow_b]
cross_downstream(from_tasks=[white], to_tasks=[blue])
cross_downstream(from_tasks=blue, to_tasks=green)
cross_downstream(from_tasks=green to_tasks=yellow)
This should create the required network of dependencies between tasks.
Here is visualization available:
https://imgur.com/a/2jqyqQO
This is the easiest solution and in my opinion the correct one.
However, if you don't want a dependencies then you can create a new
schedule rule by editing the BaseOperator.deps property.
The docs for this helper dag building function can be found here: https://airflow.apache.org/docs/stable/concepts.html#relationship-helper
Which was a useful solution, but...
One thing about my case is that the next tasks (greens) in each branch should only run if the blue task in that same branch completes successfully (should not care about the success/failure status of the other blue tasks, only that they have been run). Thus I don't think the ALL_DONE trigger rule will help the greens and ALL_SUCCESS would be too strict.
Any ideas for such a thing?
After some more thought, here is my workaround...

Airflow: Concurrency Depth first, rather than breadth first?

In airflow, the default configuration seems to be to queue up tasks, in parallel, across days--from one day to the next.
However, if I spin this process up across, say, two years, then the airflow dag will churn through preliminary processes first, across all days, rather than taking, say, 4 days forward from start to finish concurrently.
How do I toggle airflow to execute tasks according to a depth first paradigm rather than a breadth first paradigm?
I have come across a similar situation. I used the following trick to achieve that depth-first behaviour.
Assign all tasks of your DAG to a single pool (with limited number of slots like, say, 20-30)
Set weight_rule=upstream to all the above tasks
Explaination
The UPSTREAM weight_rule reverses prioritization of tasks based on their position across breadth of workflow, resulting in all downstream tasks to have a higher priority than upstream tasks.
This would ensure that whatever branches are launched will go onto completion before next branch is picked, thereby achieving that depth-first behaviour
Try to toggle with the parallelism and max_active_runs parameters in your airflow.cfg and the concurrency parameter at your DAGs.

How to run Airflow DAG for specific number of times?

How to run airflow dag for specified number of times?
I tried using TriggerDagRunOperator, This operators works for me.
In callable function we can check states and decide to continue or not.
However the current count and states needs to be maintained.
Using above approach I am able to repeat DAG 'run'.
Need expert opinion, Is there is any other profound way to run Airflow DAG for X number of times?
Thanks.
I'm afraid that Airflow is ENTIRELY about time based scheduling.
You can set a schedule to None and then use the API to trigger runs, but you'd be doing that externally, and thus maintaining the counts and states that determine when and why to trigger externally.
When you say that your DAG may have 5 tasks which you want to run 10 times and a run takes 2 hours and you cannot schedule it based on time, this is confusing. We have no idea what the significance of 2 hours is to you, or why it must be 10 runs, nor why you cannot schedule it to run those 5 tasks once a day. With a simple daily schedule it would run once a day at approximately the same time, and it won't matter that it takes a little longer than 2 hours on any given day. Right?
You could set the start_date to 11 days ago (a fixed date though, don't set it dynamically), and the end_date to today (also fixed) and then add a daily schedule_interval and a max_active_runs of 1 and you'll get exactly 10 runs and it'll run them back to back without overlapping while changing the execution_date accordingly, then stop. Or you could just use airflow backfill with a None scheduled DAG and a range of execution datetimes.
Do you mean that you want it to run every 2 hours continuously, but sometimes it will be running longer and you don't want it to overlap runs? Well, you definitely can schedule it to run every 2 hours (0 0/2 * * *) and set the max_active_runs to 1, so that if the prior run hasn't finished the next run will wait then kick off when the prior one has completed. See the last bullet in https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled.
If you want your DAG to run exactly every 2 hours on the dot [give or take some scheduler lag, yes that's a thing] and to leave the prior run going, that's mostly the default behavior, but you could add depends_on_past to some of the important tasks that themselves shouldn't be run concurrently (like creating, inserting to, or dropping a temp table), or use a pool with a single slot.
There isn't any feature to kill the prior run if your next schedule is ready to start. It might be possible to skip the current run if the prior one hasn't completed yet, but I forget how that's done exactly.
That's basically most of your options there. Also you could create manual dag_runs for an unscheduled DAG; creating 10 at a time when you feel like (using the UI or CLI instead of the API, but the API might be easier).
Do any of these suggestions address your concerns? Because it's not clear why you want a fixed number of runs, how frequently, or with what schedule and conditions, it's difficult to provide specific recommendations.
This functionality isn't natively supported by Airflow
But by exploiting the meta-db, we can cook-up this functionality ourselves
we can write a custom-operator / python operator
before running the actual computation, check if 'n' runs for the task (TaskInstance table) already exist in meta-db. (Refer to task_command.py for help)
and if they do, just skip the task (raise AirflowSkipException, reference)
This excellent article can be used for inspiration: Use apache airflow to run task exactly once
Note
The downside of this approach is that it assumes historical runs of task (TaskInstances) would forever be preserved (and correctly)
in practise though, I've often found task_instances to be missing (we have catchup set to False)
furthermore, on large Airflow deployments, one might need to setup routinal cleanup of meta-db, which would make this approach impossible

Apache Airflow "greedy" DAG execution

Situation:
We have a DAG (daily ETL process) with 20 tasks. Some tasks are independent and most have a dependency structure.
Problem:
When an independent task fails, Airflow stops the whole DAG execution and marks it as failed.
Question:
Would it be possible to force Airflow to keep executing the DAG as long as all dependencies are satisfied? This way one failed independent task would not block the whole execution of all the other streams.
It's seems like such a trivial and fundamental problem, I was really surprised that nobody else has an issue with that behaviour. (Maybe I'm just missing something)
You can set the trigger rules for each individual operarors.
All operators have a trigger_rule argument which defines the rule by which the generated task get triggered. The default value for trigger_rule is all_success and can be defined as “trigger this task when all directly upstream tasks have succeeded”. All other rules described here are based on direct parent tasks and are values that can be passed to any operator while creating tasks:
all_success: (default) all parents have succeeded
all_failed: all parents are in a failed or upstream_failed state
all_done: all parents are done with their execution
one_failed: fires as soon as at least one parent has failed, it does not wait for all parents to be done
one_success: fires as soon as at least one parent succeeds, it does not wait for all parents to be done

Erlang supervisor with one critical child

We are in the process of re-organizing our applications supervision tree to make it more robustly handle failures and re-starts. However, we have a scenario where we have one parent supervisor that starts four child supervisors. The problem we have is that the first child supervisor starts several children gen_servers that must be started and initialized prior to the second child supervisor starting or it will fail.
So, I need a startup like the following:
test_app.erl -> super_supervisor -> [config_supervisor, auth_supervisor, rest_supervisor]
The trick I'm having trouble with is that config_supervisor must complete all initialization prior to auth_supervisor or rest_supervisor being started. With the rest_for_one startup strategy I get, essentially, this behavior but only by allowing auth_supervisor to fail because the needed config is not there. I would prefer to just request that config_supervisor is completed with it's initialization (which includes starting several gen_servers) prior to moving-on to auth_supervisor.
This seems like a common scenario that would have been conquered previously but, I am having a hard time "googling" a solution. Does anybody have advice or sample code that might exist to handle this scenario?
Supervisors do a synchronous start of their children, starting each one in turn before starting the next in the order they occur in the childspeclist. So your super_supervisor should start its children in the right order, first config_supervisor, then auth_supervisor and finally rest_supervisor by having them in that order. A supervisor must (successfully) start all its children before it is considered to be started. So if config_supervisor has all the necessary processes which must be started during the initialization as its children then super_supervisor will not start the other supervisors until the config_supervisor is done.
In this case you would not need rest_for_one to ensure starting in the right order if the children are in the right order in the childspeclist.
For a worker process, gen_server/gen_fsm/gen_event, they are considered started when their init callback returns.
Have I understood your description and question correctly?
You may try to move config_supervisor into its own application and set the application as a requirement for the main one, in this case the config application will be started first and then the main supervisor with auth_supervisor, etc will start their initialisation.
Did you look at the rest_for_one restart strategy? It seems that it should be covenient in this case, the middle supervisor starts the gen_servers in a defined order and last the leaf supervisor who in turn start the critical process.

Resources