Airflow Task Priority - airflow

I have 6 tasks (t1,t2,t3,t4,t5 and t6) all running at same time.
These task is running as t1 >> t2 >> t3 >> t4 >> t5 >> t6
At some point t3 is having issues then t4 is not executed or any of the task is having issues then the later task is not getting executed.
Can some please let me know how can I avoid this problem. I do not want other task to be waiting for earlier task to finish. If the earlier task is not finish within 5 minutes then it should skip.

The way you've wired your workflow doesn't suit your requirements.
Tasks that don't really depend on one another shouldn't be wired together. I've seen some people do it just to limit load on external system; but for such use-case, one must employ Pools
If you want a task to timeout after a given interval (so that downstream tasks can continue), you can exploit the execution_timeout argument of BaseOperator

Related

airflow concurrency per dag run

I am trying to implement airflow and per dag run I need to run few tasks concurrently. The problem is I am able to find tasks per dag that can be run concurrently but not per dag run.
t1 >> t2 >> [t3, t4, t5, t6] >> t7
e.g. I have this dag and I run this dag three times in parallel.
Now I want is every dag run should have it's own concurrent task execution limit, not per dag limit.
Any help is appreciated. thanks.
Maybe one of this tree configuration at airflow.cfg would be helpfull to your issue:
Try parallelism, max_active_tasks_per_dag, max_active_runs_per_dag
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html?highlight=parallelism#parallelism
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html?highlight=parallelism#max-active-tasks-per-dag
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html?highlight=parallelism#max-active-runs-per-dag
If not works, try to read this detailed conf post aboutr concurrency.
https://www.astronomer.io/guides/airflow-scaling-workers/

Airflow Scheduler handling queueing of dags

I have the following airflow setup
Executor : KubernetesExecutor
airflow version : 2.1.3
airflow config : parallelism = 256
I have the below scenario
I have a number of dags(eg 10) which are dependent on the success state of another task from another dag. The tasks kept failing with retries enabled for 6 times.
All the dependent dags run hourly and as a result they were added to the queue state by the scheduler. I can see around 800 dags were in queue and nothing was running. So I ended up manually changing their state to Fail.
Below are my questions from this event.
Is there a limit on the number of dags that can run concurrently in airflow set up ?
Is there a limit on how many dags can be enqueued ?
When dags are queued how does the scheduler decides which one to pick ? Is it based on queued time ?
Is is possible for setting up priority among the queued dags ?
How Does airflow 2.1.3 treats task in queue ? Are they counted against max_active_runs parameters ?

Limiting concurrency for a single task across DAG instances

I have a DAG (a >> b >> c >> d). This DAG can have up to 100 instances running at a time. It is fine for tasks a, b, and d to run concurrently; however, I would only like one dag_run to run task c at a time. How do I do this? Thanks!
You could try using Pools.
Pools are a classic way of limiting task execution in Airflow. You can assign individual tasks to a specific pool and control how many TaskInstances of that task are running concurrently. In your case, you could create a pool with a single slot, assign task C to this pool, and Airflow should only have one instance of that task running at any given time.

Tasks in airflow not running in order

I'm using airflow to automatize some machine learning models. Everything it's successfull but i have issues according to the order of the tasks.
I have a 7 tasks running in paralel and the last two tasks must start when those 7 tasks finish.
At the time 6 tasks finishes, the last two start without waiting for the 7th taks to finish.
Here's the image of whats happening.
It appears that trigger_rule of your creation_order_cell_task is incorrect (fot the desired behaviour)
To get the behaviour you want, it should be either ALL_SUCCESS (default) or ALL_DONE

Control airflow task parallelism per dag for specific tasks?

Is there a way to control the parallelism for particular tasks in an airflow dag? Eg. say I have a dag definition like...
for dataset in list_of_datasets:
# some simple operation
task_1 = BashOperator(task_id=f'task_1_{dataset.name}', ...)
# load intensive operation
task_2 = BashOperator()
# another simple operation
task_3 = BashOperator()
task_1 >> task_2 >> task_3
Is there a way to have something where task_1 can have, say, 5 of its kind running in a dag instance, while only 2 instances of task_2 may be running in a dag instance (also implying that if there are 2 instances of task_2 already running, then only 3 instances of task_1 can run)? Any other common ways to work around this kind of requirement (I imagine this must come up often for pipelines)?
From discussions on the apache airflow email list...
You can use pools (https://airflow.apache.org/docs/stable/concepts.html#pools) to limit how many tasks can be run in parallel for given pool.
So you can create named "pools" with task count limits (via the webserver admin menu)
and then assign those tasks to those pools in a dag definition file when they are created
for dataset in datasets:
high_load_task = BashOperator(
task_id='high_load_task_%s' % dataset["id"],
bash_command='some_command',
pool='example_pool',
trigger_rule=TriggerRule.ALL_SUCCESS,
dag=dag)
Also...
There is another way to limit parallelism of tasks - applicable for example in cases where you have different kinds of machines with different capabilities (for example with/without GPU). You can have some affinities defined between tasks and actual machines that are executing them - in Celery executor you can define queues in your celery configuration and you can assign your task to one of the queues. Then you can have a number of workers/slots defined for all machines in the queue and as effect you can also limit parallelism of the tasks this way: https://airflow.apache.org/docs/stable/concepts.html#queues

Resources