How to manage airflow pools slots? - airflow

I have the default pool with 128 slots.
Now I have defined some other pools for each business_unit. A business_unit is a department, so the important data (prio 1) has the default pool available, but the prio 2 data has a pool for each business_unit.
As I have 4 business_unit I have 5 pools:
1. default --> 128 slots
2. business_unit_A --> 8 slots
2. business_unit_B --> 8 slots
2. business_unit_C --> 8 slots
2. business_unit_D --> 8 slots
Here I have a doubt regarding how to manage the default one. As I created 4 new pools with 8 slots each one, I am using a total of 32 slots of default. Should I redefine default pool as 96 slots?
Is the total number of slots available 128 and I have to play with it as the 100% of
"available resources"? Or can I add new pools with slots and airflow manages it behind it.
Which one is the recommended?
A task uses by default just 1 slot? If I increase it because it’s a large task the execution time should be faster? (does it relates with host resources)

Pools are a way to control/limit the resources consumed by your Airflow tasks. There is no limit on the number of pools slots, you can set it to 99999 if you like. You'll have to estimate if your hardware provides enough resources at peak moments given the number of running tasks.
By default, each task consumes one pool slot. There is however an argument pool_slots on the BaseOperator to claim more than one slot:
BashOperator(
task_id="large_task",
...,
pool_slots=5,
)
Docs: https://airflow.apache.org/docs/apache-airflow/stable/concepts/pools.html#using-multiple-pool-slots
Note: there are more settings in Airflow controlling/limiting the number of parallel tasks, see https://www.astronomer.io/guides/airflow-scaling-workers.

Related

How to define number of slots in a Airflow pool

I'm using Airflow through Cloud Composer (Image: composer-2.0.29-airflow-2.3.3). I have defined 5 DAGS that run concurrently with 22 tasks run concurrently (max) distributed among the 5 DAGS. These DAGS are in the default-pool with default number of slots set to 128.
My composer instance has:
1 Scheduler: 0.5 vCPUs, 1.875 GB memory, 1 GB storage
Worker: 0.5 vCPUs, 1.875 GB memory, 1 GB storage
Autoscaling worker: from 1 to 3.
I would like to create different pools to separate my 5 systems. How do I define the number of slots in each pool? Suppose a pool has 1 DAG with 10 tasks (with 5/10 concurrent tasks). How many slots should I assign to each task?
DAG example:
task1.x is ingestion of JDBC table; while task2.x is update of the corresponding BigQuery table.
Thank you all!
Airflow pools are designed to avoid overwhelmed on external systems used by a group of tasks. For example, if you have some tasks in different dags which use a machine learning model API, a RDBMS, an API with quotas or any other system with limited scaling, you can use an Airflow pool to limit the number of parallel tasks which interact with this system.
In your case, you have two systems, JDBC database and BigQuery. You need to create just two pools, jdbc_pool and bigquery_pool, and assign all the tasks (form all the dags) which interact with the jdbc table to the first one and assign all the tasks which interact with biquery to the second one. For the slots, you can define them based on the performance of each system, and the computational weight of each task.
If you have a monitoring tool (prometheus, datadog, ...), you can run one of the tasks and watch the resources usage on your db, lets assume that it uses 10% of the resources, in this case you can create a pool with 8 slots to attend 80% of resources usage (you should avoid using 100% of the resources to avoid the problems when there is unexpected load). Then for the pool slots of each task:
if all the tasks are similar, you can use pool_slots=1 for all the tasks: max 8 parallel tasks with 80% of resources usage
if you have some tasks which are more complicated than the task you have tested (they use more than 10% of the db resources), you can use a higher value for pool_slots for these tasks based on the resources usage: assume there is a task which consumes 20% of the resources, you can use pool_slots=2 only for this tasks and keep 1 for the others, in this case you can have 8 parallel simple tasks or 6 parallel simple tasks with this heavy task with 80% of resources usage in the two cases.
For bigquery_pool, you need to check what are the quotas, but I think you can use a high value without any problem where it is a very scalable serverless DWH.
If you just want to limit the number of executed tasks in each worker to avoid OOM problem for ex, you can set the worker concurrency conf.
And if you want to limit the number of executed tasks in the whole Airflow server, you can set the parallelism conf.

Airflow parallelism based on operator type

Does airflow support throttling or parallelism limit by operator type.
I want to limit number of spark submits across different dags but not end up limiting the parallelism across
You want to use an Airflow pool. See for details: https://airflow.apache.org/docs/apache-airflow/stable/concepts.html#pools.
In short, what a pool does is limit the number of available slots for tasks to run in. In your case, you can subclass the operators in question and give them a default pool parameter of a particular pool instance so that developers are nudged in the correct direction.

Test an Apache Airflow DAG while it is already scheduled and running?

I ran the following test command:
airflow test events {task_name_redacted} 2018-12-12
...and got the following output:
Dependencies not met for <TaskInstance: events.{redacted} 2018-12-12T00:00:00+00:00 [None]>, dependency 'Task Instance Slots Available' FAILED: The maximum number of running tasks (16) for this task's DAG 'events' has been reached.
[2019-01-17 19:47:48,978] {models.py:1556} WARNING -
--------------------------------------------------------------------------------
FIXME: Rescheduling due to concurrency limits reached at task runtime. Attempt 1 of 6. State set to NONE.
--------------------------------------------------------------------------------
[2019-01-17 19:47:48,978] {models.py:1559} INFO - Queuing into pool None
My Airflow is configured with a maximum concurrency of 16. Does this mean that I cannot test a task when the DAG is currently running, and has used all of it's task slots?
Also, it was a little unclear from the docs, but does the airflow test actually execute the task, as in if it was a SparkSubmitOperator, it would actually submit the job?
While I am yet to reach that phase of deployment where concurrency will matter, the docs do give a fairly good indication of problem at hand
Since at any point of time just one scheduler is running (and you shouldn't be running multiple anyways), indeed it appears that irrespective of whether the DAG-runs are live-runs or test-runs, this limit will apply on them collectively. So that is certainly a hurdle.
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
But beware that merely increasing this number (assuming you have big-enough boxes for hefty workers / multiple workers), several other configurations will have to be tweaked as well to achieve the kind of parallelism I sense you want.
They are all listed under [core] section
# The amount of parallelism as a setting to the executor. This
defines the max number of task instances that should run
simultaneously on this airflow installation
parallelism = 32
# When not using pools, tasks are run in the "default pool", whose
size is guided by this config element
non_pooled_task_slot_count = 128
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
But we are still not there, because once you spawn so many tasks simultaneously, the backend metadata-db will start choking. While this is likely a minor problem (and might not be affecting unless you have some real huge DAGs / very large no of Variable interactions in your tasks), its still worth noting as a potential roadblock
# The SqlAlchemy pool size is the maximum number of database
connections in the pool. 0 indicates no limit.
sql_alchemy_pool_size = 5
# The SqlAlchemy pool recycle is the number of seconds a connection
can be idle in the pool before it is invalidated. This config does not
apply to sqlite. If the number of DB connections is ever exceeded, a
lower config value will allow the system to recover faster.
sql_alchemy_pool_recycle = 1800
# How many seconds to retry re-establishing a DB connection after
disconnects. Setting this to 0 disables retries.
sql_alchemy_reconnect_timeout = 300
Needless to say, all this is pretty much futile unless you pick the right executor; SequentialExecutor, in particular is only intended for testing
# The executor class that airflow should use. Choices include SequentialExecutor, LocalExecutor, CeleryExecutor, DaskExecutor,
KubernetesExecutor
executor = SequentialExecutor
But then params to BaseOperator like depends_on_past, wait_for_downstream are there to spoil the party as well
Finally I leave you with this link related to Airflow + Spark combination: How to submit Spark jobs to EMR cluster from Airflow?
(Pardon me if the answer confused you more than you already were, but..)

Apache Airflow pools: used slots > available slots

I have 6 subdags. Each of them contains a task with pool='crawler' that requires a lot of resources so I created a pool crawler with only 1 slot.
When I run the DAG it seems that the pool restriction is bypassed and all six tasks are executed at the same time (as you can see from the screenshot).
How can I force used slots to be <= available slots?
From the source code:
Airflow pool is not honored by SubDagOperator. Hence resources could
be consumed by SubdagOperators

Spreading a job over different nodes of a cluster in sun grid engine (SGE)

I'm tryin get sun gridending (sge) to run the separate processes of an MPI job over all of the nodes of my cluster.
What is happening is that each node has 12 processors, so SGE is assigning 12 of my 60 processes to 5 separate nodes.
I'd like it to assign 2 processes to each of the 30 nodes available, because with 12 processes (dna sequence alignments) running on each node, the nodes are running out of memory.
So I'm wondering if it's possible to explicitly get SGE to assign the processes to a given node?
Thanks,
Paul.
Check out "allocation_rule" in the configuration for the parallel environment; either with that or then by specifying $pe_slots for allocation_rule and then using the -pe option to qsub you should be able to do what you ask for above.
You can do it by creating a queue in which you can define the queue uses only only 2 processors out of 12 processors in each node.
You can see configuration of current queue by using the command
qconf -sq queuename
you will see following in the queue configuration. This queue named in such a way that it uses only 5 execution hosts and 4 slots (processors) each.
....
slots 1,[master=4],[slave1=4],[slave2=4],[slave3=4],[slave4=4]
....
use following command to change the queue configuration
qconf -mq queuename
then change those 4 into 2.
From an admin host, run "qconf -msconf" to edit the scheduler configuration. It will bring up a list of configuration options in an editor. Look for one called "load_factor". Set the value to "-slots" (without the quotes).
This tells the scheduler that the machine is least loaded when it has the fewest slots in use. If your exec hosts have a similar number of slots each, you will get an even distribution. If you have some exec hosts that have more slots than the others, they will be preferred, but your distribution will still be more even than the default value for load_factor (which I don't remember, having changed this in my cluster quite some time ago).
You may need to set the slots on each host. I have done this myself because I need to limit the number of jobs on a particular set of boxes to less than their maximum because they don't have as much memory as some of the other ones. I don't know if it is required for this load_factor configuration, but if it is, you can add a slots consumable to each host. Do this with "qconf -me hostname", add a value to the "complex_values" that looks like "slots=16" where 16 is the number of slots you want that host to use.
This is what I learned from our sysadmin. Put this SGE resource request in your job script:
#$ -l nodes=30,ppn=2
Requests 2 MPI processes per node (ppn) and 30 nodes. I think there is no guarantee that this 30x2 layout will work on a 30-node cluster if other users also run lots of jobs but perhaps you can give it a try.

Resources