Airflow: Concurrency Depth first, rather than breadth first? - airflow

In airflow, the default configuration seems to be to queue up tasks, in parallel, across days--from one day to the next.
However, if I spin this process up across, say, two years, then the airflow dag will churn through preliminary processes first, across all days, rather than taking, say, 4 days forward from start to finish concurrently.
How do I toggle airflow to execute tasks according to a depth first paradigm rather than a breadth first paradigm?

I have come across a similar situation. I used the following trick to achieve that depth-first behaviour.
Assign all tasks of your DAG to a single pool (with limited number of slots like, say, 20-30)
Set weight_rule=upstream to all the above tasks
Explaination
The UPSTREAM weight_rule reverses prioritization of tasks based on their position across breadth of workflow, resulting in all downstream tasks to have a higher priority than upstream tasks.
This would ensure that whatever branches are launched will go onto completion before next branch is picked, thereby achieving that depth-first behaviour

Try to toggle with the parallelism and max_active_runs parameters in your airflow.cfg and the concurrency parameter at your DAGs.

Related

Airflow Deferrable Operator Pattern for Event-driven DAGs

I'm looking for examples of patterns in place for event-driven DAGs, specifically those with dependencies on other DAGs. Let's start with a simple example:
dag_a -> dag_b
dag_b depends on dag_a. I understand that at the end of dag_a I can add a trigger to launch dag_b. However, this philosophically feels misaligned from an abstraction standpoint: dag_a does not need to understand or know that dag_b exists, yet this pattern would enforce the responsibility of calling dag_b on dag_a.
Let's consider a slightly more complex example (pardon my poor ASCII drawing skills):
dag_a ------> dag_c
/
dag_b --/
In this case, if dag_c depends on both dag_a and dag_b. I understand that we could set up a sensor for the output of each dag_a and dag_b, but with the advent of deferrable operators, it doesn't seem that this remains a best practice. I suppose I'm wondering how to set up a DAG of DAGs in an async fashion.
The potential for deferrable operators for event-driven DAGs is introduced in Astronomer's guide here: https://www.astronomer.io/guides/deferrable-operators, but it's unclear how it would be best applied these in light of the above examples.
More concretely, I'm envisioning a use case where multiple DAGs run every day (so they share the same run date), and the output of each DAG is a date partition in a table somewhere. Downstream DAGs consume the partitions of the upstream tables, so we want to schedule them such that downstream DAGs don't attempt to run before the upstream ones complete.
Right now I'm using a "fail fast and often" approach in downstream dags, where they start running at the scheduled date, but first check if the data they need exists upstream, and if not the task fails. I have these tasks set to retry every x interval, with high number of retries (e.g. retry every hour for 24 hours, if it's still not there then something is wrong and the DAG fails). This is fine since 1) it works for the most part and 2) I don't believe the failed tasks continue to occupy a worker slot between retries, so it actually is somewhat async (I could be wrong). It's just a little crude, so I'm imagining there is a better way.
Any tactical advice for how to set this relationship up to be more event driven while still benefitting from the async nature of deferrable operators is welcome.
we are using an event bus to connect the DAGs, the end task of a DAG will send the event out, and followed DAG will be triggered in the orchestrator according to the event type.
Starting from Airflow 2.4 you can use data-aware scheduling. It would look like this:
dag_a_finished = Dataset("dag_a_finished")
with DAG(dag_id="dag_a", ...):
# Last task in dag_a
BashOperator(task_id="final", outlets=[dag_a_finished], ...)
with DAG(dag_id="dag_b", schedule=[dag_a_finished], ...):
...
with DAG(dag_id="dag_c", schedule=[dag_a_finished], ...):
...
In theory Dataset should represent some piece of data, but technically nothing prevents you from making it just a string identifier used for setting up DAG dependency - just like in above example.

Airflow tasks not running in correct order?

cleanup_pbp is downstream of all 4 of load_pbp_30629, load_pbp_30630, load_to_bq_30629, load_to_bq_30630. cleanup_pbp started at 2021-12-05T08:54:48.
however, load_pbp_30630, one of the 4 upstream tasks, did not end until 2021-12-05T09:02:23.
How is cleanup_pbp running before load_pbp_30630 ends? I've never seen this before. I know our task dependencies have a bit of criss-cross going on, but that shouldn't explain why the tasks run out of order?
We have exactly the same problem and after checking, finally we found out that the problem is caused by the loop function in Dag script.
Actually we use a for loop to create tasks as well as their relationships. As in each iteration it will create the upstream task and the downstream task (like the cleanup_pbp in your case) and always give the same id to downstream one and define its relations (ex. Load_pbp_xxx >> cleanup_pbp), in graph or tree view it considers this downstream id has serval dependences but when Dag runs it takes only the relation defined in the last iteration. If you check Task Instance you will see only one task in its upstream_list.
The solution is to move the definition of this final task out of the loop but keep the definition of dependencies inside the loop. This well resolves our problem of running order, the final task won’t start until all dependences are finished. (with tigger rule all_success of course)
I’m not sure if you are in the same situation as us but hope this will give you some idea.

How to run Airflow DAG for specific number of times?

How to run airflow dag for specified number of times?
I tried using TriggerDagRunOperator, This operators works for me.
In callable function we can check states and decide to continue or not.
However the current count and states needs to be maintained.
Using above approach I am able to repeat DAG 'run'.
Need expert opinion, Is there is any other profound way to run Airflow DAG for X number of times?
Thanks.
I'm afraid that Airflow is ENTIRELY about time based scheduling.
You can set a schedule to None and then use the API to trigger runs, but you'd be doing that externally, and thus maintaining the counts and states that determine when and why to trigger externally.
When you say that your DAG may have 5 tasks which you want to run 10 times and a run takes 2 hours and you cannot schedule it based on time, this is confusing. We have no idea what the significance of 2 hours is to you, or why it must be 10 runs, nor why you cannot schedule it to run those 5 tasks once a day. With a simple daily schedule it would run once a day at approximately the same time, and it won't matter that it takes a little longer than 2 hours on any given day. Right?
You could set the start_date to 11 days ago (a fixed date though, don't set it dynamically), and the end_date to today (also fixed) and then add a daily schedule_interval and a max_active_runs of 1 and you'll get exactly 10 runs and it'll run them back to back without overlapping while changing the execution_date accordingly, then stop. Or you could just use airflow backfill with a None scheduled DAG and a range of execution datetimes.
Do you mean that you want it to run every 2 hours continuously, but sometimes it will be running longer and you don't want it to overlap runs? Well, you definitely can schedule it to run every 2 hours (0 0/2 * * *) and set the max_active_runs to 1, so that if the prior run hasn't finished the next run will wait then kick off when the prior one has completed. See the last bullet in https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled.
If you want your DAG to run exactly every 2 hours on the dot [give or take some scheduler lag, yes that's a thing] and to leave the prior run going, that's mostly the default behavior, but you could add depends_on_past to some of the important tasks that themselves shouldn't be run concurrently (like creating, inserting to, or dropping a temp table), or use a pool with a single slot.
There isn't any feature to kill the prior run if your next schedule is ready to start. It might be possible to skip the current run if the prior one hasn't completed yet, but I forget how that's done exactly.
That's basically most of your options there. Also you could create manual dag_runs for an unscheduled DAG; creating 10 at a time when you feel like (using the UI or CLI instead of the API, but the API might be easier).
Do any of these suggestions address your concerns? Because it's not clear why you want a fixed number of runs, how frequently, or with what schedule and conditions, it's difficult to provide specific recommendations.
This functionality isn't natively supported by Airflow
But by exploiting the meta-db, we can cook-up this functionality ourselves
we can write a custom-operator / python operator
before running the actual computation, check if 'n' runs for the task (TaskInstance table) already exist in meta-db. (Refer to task_command.py for help)
and if they do, just skip the task (raise AirflowSkipException, reference)
This excellent article can be used for inspiration: Use apache airflow to run task exactly once
Note
The downside of this approach is that it assumes historical runs of task (TaskInstances) would forever be preserved (and correctly)
in practise though, I've often found task_instances to be missing (we have catchup set to False)
furthermore, on large Airflow deployments, one might need to setup routinal cleanup of meta-db, which would make this approach impossible

Intel MPI distributed memory: building a wall out of M*N blocks using q<M processors

Imagine I have M independent jobs, each job has N steps. Jobs are independent from each other but steps of each job should be serial. In other words J(i,j) should be started only after J(i,j-1) is finished (i indicates the job index and j indicates the step). This is isomorphic to building a wall with width of M and hight of N blocks.
Each block of job should be executed only once. The time that it takes to do one block of work using one CPU (also the same order) is different for different blocks and is not known in advance.
The simple way of doing this using MPI is to assign blocks of work to processors and wait until all of them finish their blocks before the next assignment. This way we can make ensure that priorities are enforced, but there will be a lot of waiting time.
Is there a more efficient way of doing this? I mean when a processor finishes its job, using some kind of environmental variables or shared memory, could decide which block of job it should do next, without waiting for other processors to finish their jobs and make a collective decision using communications.
You have M jobs with N steps each. You also have a set of worker processes of size W, somewhere between 2 and M.
If W is close to M, the best you can do is simply assign them 1:1. If one worker finishes early that's fine.
If W is much smaller than M, and N is also fairly large, here is an idea:
Estimate some average or typical time for one step to complete. Call this T. You can adjust this estimate as you go in case you have a very poor estimator at the outset.
Divide your M jobs evenly in number among the workers, and start them. Tell the workers to run as many steps of their assigned jobs as possible before a timeout, say T*N/K. Overrunning the timeout slightly to finish the current job is allowed to ensure forward progress.
Have the workers communicate to each other which steps they completed.
Repeat, dividing the jobs evenly again taking into account how complete each one is (e.g. two 50% complete jobs count the same as one 0% complete job).
The idea is to give all the workers enough time to complete roughly 1/K of the total work each time. If no job takes much more than K*T, this will be quite efficient.
It's up to you to find a reasonable K. Maybe try 10.
Here's an idea, IDK if it's good:
Maintain one shared variable: n = the progress of the farthest-behind task. i.e. the lowest step-number that any of the M tasks has completed. It starts out at 0, because all tasks start at the first step. It stays at 0 until all tasks have completed at least 1 step each.
When a processor finishes a step of a job, check the progress of the step it's currently working on against n. If n < current_job_step - 4, switch tasks because the one we're working on is too far ahead of the farthest-behind one.
I picked 4 to give a balance between too much switching vs. having too much serial work in only a couple tasks. Adjust as necessary, and maybe make it adaptive as you near the end.
Switching tasks without having two threads both grab the same work unit is non-trivial unless you have a scheduler thread that makes all the decisions. If this is on a single shared-memory machine, you could use locking to protect a priority queue.

How many mappers/reducers should be set when configuring Hadoop cluster?

When configuring a Hadoop Cluster whats the scientific method to set the number of mappers/reducers for the cluster?
There is no formula. It depends on how many cores and how much memory do you have. The number of mapper + number of reducer should not exceed the number of cores in general. Keep in mind that the machine is also running Task Tracker and Data Node daemons. One of the general suggestion is more mappers than reducers. If I were you, I would run one of my typical jobs with reasonable amount of data to try it out.
Quoting from "Hadoop The Definite Guide, 3rd edition", page 306
Because MapReduce jobs are normally
I/O-bound, it makes sense to have more tasks than processors to get better
utilization.
The amount of oversubscription depends on the CPU utilization of jobs
you run, but a good rule of thumb is to have a factor of between one and two more
tasks (counting both map and reduce tasks) than processors.
A processor in the quote above is equivalent to one logical core.
But this is just in theory, and most likely each use case is different than another, some tests need to be performed. But this number can be a good start to test with.
Probably, you should also look at reducer lazy loading, which allows reducers to start later when required, so basically, number of maps slots can be increased. Don't have much idea on this though but, seems useful.
Taken from Hadoop Gyan-My blog:
No. of mappers is decided in accordance with the data locality principle as described earlier. Data Locality principle : Hadoop tries its best to run map tasks on nodes where the data is present locally to optimize on the network and inter-node communication latency. As the input data is split into pieces and fed to different map tasks, it is desirable to have all the data fed to that map task available on a single node.Since HDFS only guarantees data having size equal to its block size (64M) to be present on one node, it is advised/advocated to have the split size equal to the HDFS block size so that the map task can take advantage of this data localization. Therefore, 64M of data per mapper. If we see some mappers running for a very small period of time, try to bring down the number of mappers and make them run longer for a minute or so.
No. of reducers should be slightly less than the number of reduce slots in the cluster (the concept of slots comes in with a pre-configuration in the job/task tracker properties while configuring the cluster) so that all the reducers finish in one wave and make full utilisation of the cluster resources.

Resources