Control airflow task parallelism per dag for specific tasks? - airflow

Is there a way to control the parallelism for particular tasks in an airflow dag? Eg. say I have a dag definition like...
for dataset in list_of_datasets:
# some simple operation
task_1 = BashOperator(task_id=f'task_1_{dataset.name}', ...)
# load intensive operation
task_2 = BashOperator()
# another simple operation
task_3 = BashOperator()
task_1 >> task_2 >> task_3
Is there a way to have something where task_1 can have, say, 5 of its kind running in a dag instance, while only 2 instances of task_2 may be running in a dag instance (also implying that if there are 2 instances of task_2 already running, then only 3 instances of task_1 can run)? Any other common ways to work around this kind of requirement (I imagine this must come up often for pipelines)?

From discussions on the apache airflow email list...
You can use pools (https://airflow.apache.org/docs/stable/concepts.html#pools) to limit how many tasks can be run in parallel for given pool.
So you can create named "pools" with task count limits (via the webserver admin menu)
and then assign those tasks to those pools in a dag definition file when they are created
for dataset in datasets:
high_load_task = BashOperator(
task_id='high_load_task_%s' % dataset["id"],
bash_command='some_command',
pool='example_pool',
trigger_rule=TriggerRule.ALL_SUCCESS,
dag=dag)
Also...
There is another way to limit parallelism of tasks - applicable for example in cases where you have different kinds of machines with different capabilities (for example with/without GPU). You can have some affinities defined between tasks and actual machines that are executing them - in Celery executor you can define queues in your celery configuration and you can assign your task to one of the queues. Then you can have a number of workers/slots defined for all machines in the queue and as effect you can also limit parallelism of the tasks this way: https://airflow.apache.org/docs/stable/concepts.html#queues

Related

Airflow different parallelism for different types of task

we have certain task which requires huge amount of resources which can't be run with high parallelism and many other smaller tasks which are can run at parallelism of 32.
I am aware of parallelism config
The amount of parallelism as a setting to the executor. This defines the max number of task instances that should run simultaneously on this airflow installation
parallelism = 32
Is there a way where we can tag tasks and different level of parallelism for different tasks at entire airflow level.
Like having smaller task to run at default parallelism [32] but heavy task at much lower parallelism [1-4]
Pools (docs: https://airflow.apache.org/docs/apache-airflow/stable/concepts/pools.html) serve exactly this purpose: to limit the parallelism for a specific set of tasks.
You can create pools with your desired # of "slots" in the Airflow UI, and assign the pool to your task:
my_task = BashOperator(
...,
pool="heavy_task_pool",
...,
)

Run parallel tasks in Apache Airflow

I am able to configure airflow.cfg file to run tasks one after the other.
What I want to do is, execute tasks in parallel, e.g. 2 at a time and reach the end of list.
How can I configure this?
Executing tasks in Airflow in parallel depends on which executor you're using, e.g., SequentialExecutor, LocalExecutor, CeleryExecutor, etc.
For a simple setup, you can achieve parallelism by just setting your executor to LocalExecutor in your airflow.cfg:
[core]
executor = LocalExecutor
Reference: https://github.com/apache/incubator-airflow/blob/29ae02a070132543ac92706d74d9a5dc676053d9/airflow/config_templates/default_airflow.cfg#L76
This will spin up a separate process for each task.
(Of course you'll need to have a DAG with at least 2 tasks that can execute in parallel to see it work.)
Alternatively, with CeleryExecutor, you can spin up any number of workers by just running (as many times as you want):
$ airflow worker
The tasks will go into a Celery queue and each Celery worker will pull off of the queue.
You might find the section Scaling out with Celery in the Airflow Configuration docs helpful.
https://airflow.apache.org/howto/executor/use-celery.html
For any executor, you may want to tweak the core settings that control parallelism once you have that running.
They're all found under [core]. These are the defaults:
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
Reference: https://github.com/apache/incubator-airflow/blob/29ae02a070132543ac92706d74d9a5dc676053d9/airflow/config_templates/default_airflow.cfg#L99

How can Airflow be used to run distinct tasks of one workflow in separate machines?

disclaimer: I'm not (yet) a user of Airflow, just found about it today and I'm starting to explore if it may fit my uses cases.
I have one data processing workflow that is a sequential (not parallel) execution of multiple tasks. However, some of the tasks need to run on specific machines. Can Airflow manage this? What would be the advised implementation model for this use case?
Thanks.
Yes, you can achieve this in Airflow with queues. You can tie tasks to a specific queue. Then for each worker on a machine, you can set it to only pickup tasks from select queues.
In code, it would look like this:
task_1 = BashOperator(
dag=dag,
task_id='task_a',
...
)
task_2 = PythonOperator(
dag=dag,
task_id='task_b',
queue='special',
...
)
Note that there is this setting in airflow.cfg:
# Default queue that tasks get assigned to and that worker listen on.
default_queue = default
So if you started your workers with this:
Server A> airflow worker
Server B> airflow worker --queues special
Server C> airflow worker --queues default,special
Then task_1 can be picked up by servers A+C and task_2 can be picked up by servers B+C.
In case you're running Airflow in Docker, then you should do the following:
Set the queue name in the DAG file:
with DAG(dag_id='dag_v1',
default_args={
'retries': 1,
'retry_delay': timedelta(seconds=30),
'queue':'server-1',
...
},
schedule_interval=None,
tags=['my_dags']) as dag:
...
Set the default queue in the docker-compose.yml file
AIRFLOW__OPERATORS__DEFAULT_QUEUE: 'server-1'
Restart the Airflow Webserver, Scheduler etc.
Note: You have to do this for each worker but I assume that you have 1 worker per machine - meaning that each machine needs to have a different AIRFLOW__OPERATORS__DEFAULT_QUEUE name and the corresponding DAGs you want to run on that machine need to have the same name for their queue (then you can indeed use the ${HOSTNAME} as the name).

Running a particular task on airflow master node

I have a dag with a list of tasks that are run using the celery executor on different worker nodes. However I would like to run one of the tasks on the master node. Is that possible?
Yes it is possible. You are able to set specific tasks to listen to specific queues in Celery. The airflow documentation covers it quite nicely but the gist of it is:
set a queue attribute on the operator representing the task you want to run on a specific node to a value different from the celery -> default_queue value in airflow.cfg
Run the worker process on your master node by specifying the queue it needs to listen on airflow worker -q queue_name. If you want your worker to listen to multiple queues, you can use a comma delimited list airflow worker -q default_queue,queue_name

Airflow parallelism

the Local Executor spawns new processes while scheduling tasks. Is there a limit to the number of processes it creates. I needed to change it. I need to know what is the difference between scheduler's "max_threads" and
"parallelism" in airflow.cfg ?
parallelism: not a very descriptive name. The description says it sets the maximum task instances for the airflow installation, which is a bit ambiguous — if I have two hosts running airflow workers, I'd have airflow installed on two hosts, so that should be two installations, but based on context 'per installation' here means 'per Airflow state database'. I'd name this max_active_tasks.
dag_concurrency: Despite the name based on the comment this is actually the task concurrency, and it's per worker. I'd name this max_active_tasks_for_worker (per_worker would suggest that it's a global setting for workers, but I think you can have workers with different values set for this).
max_active_runs_per_dag: This one's kinda alright, but since it seems to be just a default value for the matching DAG kwarg, it might be nice to reflect that in the name, something like default_max_active_runs_for_dags
So let's move on to the DAG kwargs:
concurrency: Again, having a general name like this, coupled with the fact that concurrency is used for something different elsewhere makes this pretty confusing. I'd call this max_active_tasks.
max_active_runs: This one sounds alright to me.
source: https://issues.apache.org/jira/browse/AIRFLOW-57
max_threads gives the user some control over cpu usage. It specifies scheduler parallelism.
It's 2019 and more updated docs have come out. In short:
AIRFLOW__CORE__PARALLELISM is the max number of task instances that can run concurrently across ALL of Airflow (all tasks across all dags)
AIRFLOW__CORE__DAG_CONCURRENCY is the max number of task instances allowed to run concurrently FOR A SINGLE SPECIFIC DAG
These docs describe it in more detail:
According to https://www.astronomer.io/guides/airflow-scaling-workers/:
parallelism is the max number of task instances that can run
concurrently on airflow. This means that across all running DAGs, no
more than 32 tasks will run at one time.
And
dag_concurrency is the number of task instances allowed to run
concurrently within a specific dag. In other words, you could have 2
DAGs running 16 tasks each in parallel, but a single DAG with 50 tasks
would also only run 16 tasks - not 32
And, according to https://airflow.apache.org/faq.html#how-to-reduce-airflow-dag-scheduling-latency-in-production:
max_threads: Scheduler will spawn multiple threads in parallel to
schedule dags. This is controlled by max_threads with default value of
2. User should increase this value to a larger value(e.g numbers of cpus where scheduler runs - 1) in production.
But it seems like this last piece shouldn't take up too much time, because it's just the "scheduling" portion. Not the actual running portion. Therefore we didn't see the need to tweak max_threads much, but AIRFLOW__CORE__PARALLELISM and AIRFLOW__CORE__DAG_CONCURRENCY did affect us.
The scheduler's max_threads is the number of processes to parallelize the scheduler over. The max_threads cannot exceed the cpu count. The LocalExecutor's parallelism is the number of concurrent tasks the LocalExecutor should run. Both the scheduler and the LocalExecutor use python's multiprocessing library for parallelism.

Resources