I have 6 subdags. Each of them contains a task with pool='crawler' that requires a lot of resources so I created a pool crawler with only 1 slot.
When I run the DAG it seems that the pool restriction is bypassed and all six tasks are executed at the same time (as you can see from the screenshot).
How can I force used slots to be <= available slots?
From the source code:
Airflow pool is not honored by SubDagOperator. Hence resources could
be consumed by SubdagOperators
Related
I'm using Airflow through Cloud Composer (Image: composer-2.0.29-airflow-2.3.3). I have defined 5 DAGS that run concurrently with 22 tasks run concurrently (max) distributed among the 5 DAGS. These DAGS are in the default-pool with default number of slots set to 128.
My composer instance has:
1 Scheduler: 0.5 vCPUs, 1.875 GB memory, 1 GB storage
Worker: 0.5 vCPUs, 1.875 GB memory, 1 GB storage
Autoscaling worker: from 1 to 3.
I would like to create different pools to separate my 5 systems. How do I define the number of slots in each pool? Suppose a pool has 1 DAG with 10 tasks (with 5/10 concurrent tasks). How many slots should I assign to each task?
DAG example:
task1.x is ingestion of JDBC table; while task2.x is update of the corresponding BigQuery table.
Thank you all!
Airflow pools are designed to avoid overwhelmed on external systems used by a group of tasks. For example, if you have some tasks in different dags which use a machine learning model API, a RDBMS, an API with quotas or any other system with limited scaling, you can use an Airflow pool to limit the number of parallel tasks which interact with this system.
In your case, you have two systems, JDBC database and BigQuery. You need to create just two pools, jdbc_pool and bigquery_pool, and assign all the tasks (form all the dags) which interact with the jdbc table to the first one and assign all the tasks which interact with biquery to the second one. For the slots, you can define them based on the performance of each system, and the computational weight of each task.
If you have a monitoring tool (prometheus, datadog, ...), you can run one of the tasks and watch the resources usage on your db, lets assume that it uses 10% of the resources, in this case you can create a pool with 8 slots to attend 80% of resources usage (you should avoid using 100% of the resources to avoid the problems when there is unexpected load). Then for the pool slots of each task:
if all the tasks are similar, you can use pool_slots=1 for all the tasks: max 8 parallel tasks with 80% of resources usage
if you have some tasks which are more complicated than the task you have tested (they use more than 10% of the db resources), you can use a higher value for pool_slots for these tasks based on the resources usage: assume there is a task which consumes 20% of the resources, you can use pool_slots=2 only for this tasks and keep 1 for the others, in this case you can have 8 parallel simple tasks or 6 parallel simple tasks with this heavy task with 80% of resources usage in the two cases.
For bigquery_pool, you need to check what are the quotas, but I think you can use a high value without any problem where it is a very scalable serverless DWH.
If you just want to limit the number of executed tasks in each worker to avoid OOM problem for ex, you can set the worker concurrency conf.
And if you want to limit the number of executed tasks in the whole Airflow server, you can set the parallelism conf.
I have a DAG with 2 tasks:
download_file_from_ftp >> transform_file
My concern is that tasks can be performed on different workers.The file will be downloaded on the first worker and will be transformed on another worker. An error will occur because the file is missing on the second worker. Is it possible to configure the dag that all tasks are performed on one worker?
It's a bad practice. Even if you will find a work around it will be very unreliable.
In general, if your executor allows this - you can configure tasks to execute on a specific worker type. For example in CeleryExecutor you can set tasks to a specific Queue. Assuming there is only 1 worker consuming from that queue then your tasks will be executed on the same worker BUT the fact that it's 1 worker doesn't mean it will be the same machine. It highly depended on the infrastructure that you use. For example: when you restart your machines do you get the exact same machine or new one is spawned?
I highly advise you - don't go down this road.
To solve your issue either download the file to shared disk space like S3, Google cloud storage, etc... then all workers can read the file as it's stored in cloud or combine the download and transform into a single operator thus both actions are executed together.
I'm looking at using airflow for scheduling test-cases execution against shared hw in a lab and have some best practice questions on how to use the resource pool concept for a whole DAG-instance instead of just on task level.
Basically a test-case needs (executed as a instance of a test-case DAG (deploy/execute/collect/un-deploy)) certain physical resources and should therefore request them from the different resource pools(modelling the the physical resources) in order to not run into conflicting concurrent usage with other triggered DAG-instances.
My question is if it's possible to define resource usage on DAG-instance level or if it's only possible on task level. If the latter, then would one parallel task claiming the resource during the whole DAG-instance execution be the best way to handle not having to pass the resource claim between all tasks in the DAG? Other alternatives?
Update after questions from Viraj and dlamblin:
Running 1.10.1
Running LocalExecutor
Have verified that I can run parallel DAGS with concurrent tasks
The resources I want to have custom pools for are not worker resources, rather different peripheral hw units such as relays, routers etc that the tasks running in parallel on a the localexecutor should block on if they are occupied(0 custom resource pool instances left) by an/-other task(s)
The Kubernetes Executor allows for certain node type affinity to be configured on the task or dag level. The Celery Executor has a queue concept to select from a worker group with certain resources available to the worker. You're probably not using a Local Executor as your question doesn't quite make sense for that case.
I ran the following test command:
airflow test events {task_name_redacted} 2018-12-12
...and got the following output:
Dependencies not met for <TaskInstance: events.{redacted} 2018-12-12T00:00:00+00:00 [None]>, dependency 'Task Instance Slots Available' FAILED: The maximum number of running tasks (16) for this task's DAG 'events' has been reached.
[2019-01-17 19:47:48,978] {models.py:1556} WARNING -
--------------------------------------------------------------------------------
FIXME: Rescheduling due to concurrency limits reached at task runtime. Attempt 1 of 6. State set to NONE.
--------------------------------------------------------------------------------
[2019-01-17 19:47:48,978] {models.py:1559} INFO - Queuing into pool None
My Airflow is configured with a maximum concurrency of 16. Does this mean that I cannot test a task when the DAG is currently running, and has used all of it's task slots?
Also, it was a little unclear from the docs, but does the airflow test actually execute the task, as in if it was a SparkSubmitOperator, it would actually submit the job?
While I am yet to reach that phase of deployment where concurrency will matter, the docs do give a fairly good indication of problem at hand
Since at any point of time just one scheduler is running (and you shouldn't be running multiple anyways), indeed it appears that irrespective of whether the DAG-runs are live-runs or test-runs, this limit will apply on them collectively. So that is certainly a hurdle.
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
But beware that merely increasing this number (assuming you have big-enough boxes for hefty workers / multiple workers), several other configurations will have to be tweaked as well to achieve the kind of parallelism I sense you want.
They are all listed under [core] section
# The amount of parallelism as a setting to the executor. This
defines the max number of task instances that should run
simultaneously on this airflow installation
parallelism = 32
# When not using pools, tasks are run in the "default pool", whose
size is guided by this config element
non_pooled_task_slot_count = 128
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
But we are still not there, because once you spawn so many tasks simultaneously, the backend metadata-db will start choking. While this is likely a minor problem (and might not be affecting unless you have some real huge DAGs / very large no of Variable interactions in your tasks), its still worth noting as a potential roadblock
# The SqlAlchemy pool size is the maximum number of database
connections in the pool. 0 indicates no limit.
sql_alchemy_pool_size = 5
# The SqlAlchemy pool recycle is the number of seconds a connection
can be idle in the pool before it is invalidated. This config does not
apply to sqlite. If the number of DB connections is ever exceeded, a
lower config value will allow the system to recover faster.
sql_alchemy_pool_recycle = 1800
# How many seconds to retry re-establishing a DB connection after
disconnects. Setting this to 0 disables retries.
sql_alchemy_reconnect_timeout = 300
Needless to say, all this is pretty much futile unless you pick the right executor; SequentialExecutor, in particular is only intended for testing
# The executor class that airflow should use. Choices include SequentialExecutor, LocalExecutor, CeleryExecutor, DaskExecutor,
KubernetesExecutor
executor = SequentialExecutor
But then params to BaseOperator like depends_on_past, wait_for_downstream are there to spoil the party as well
Finally I leave you with this link related to Airflow + Spark combination: How to submit Spark jobs to EMR cluster from Airflow?
(Pardon me if the answer confused you more than you already were, but..)
the Local Executor spawns new processes while scheduling tasks. Is there a limit to the number of processes it creates. I needed to change it. I need to know what is the difference between scheduler's "max_threads" and
"parallelism" in airflow.cfg ?
parallelism: not a very descriptive name. The description says it sets the maximum task instances for the airflow installation, which is a bit ambiguous — if I have two hosts running airflow workers, I'd have airflow installed on two hosts, so that should be two installations, but based on context 'per installation' here means 'per Airflow state database'. I'd name this max_active_tasks.
dag_concurrency: Despite the name based on the comment this is actually the task concurrency, and it's per worker. I'd name this max_active_tasks_for_worker (per_worker would suggest that it's a global setting for workers, but I think you can have workers with different values set for this).
max_active_runs_per_dag: This one's kinda alright, but since it seems to be just a default value for the matching DAG kwarg, it might be nice to reflect that in the name, something like default_max_active_runs_for_dags
So let's move on to the DAG kwargs:
concurrency: Again, having a general name like this, coupled with the fact that concurrency is used for something different elsewhere makes this pretty confusing. I'd call this max_active_tasks.
max_active_runs: This one sounds alright to me.
source: https://issues.apache.org/jira/browse/AIRFLOW-57
max_threads gives the user some control over cpu usage. It specifies scheduler parallelism.
It's 2019 and more updated docs have come out. In short:
AIRFLOW__CORE__PARALLELISM is the max number of task instances that can run concurrently across ALL of Airflow (all tasks across all dags)
AIRFLOW__CORE__DAG_CONCURRENCY is the max number of task instances allowed to run concurrently FOR A SINGLE SPECIFIC DAG
These docs describe it in more detail:
According to https://www.astronomer.io/guides/airflow-scaling-workers/:
parallelism is the max number of task instances that can run
concurrently on airflow. This means that across all running DAGs, no
more than 32 tasks will run at one time.
And
dag_concurrency is the number of task instances allowed to run
concurrently within a specific dag. In other words, you could have 2
DAGs running 16 tasks each in parallel, but a single DAG with 50 tasks
would also only run 16 tasks - not 32
And, according to https://airflow.apache.org/faq.html#how-to-reduce-airflow-dag-scheduling-latency-in-production:
max_threads: Scheduler will spawn multiple threads in parallel to
schedule dags. This is controlled by max_threads with default value of
2. User should increase this value to a larger value(e.g numbers of cpus where scheduler runs - 1) in production.
But it seems like this last piece shouldn't take up too much time, because it's just the "scheduling" portion. Not the actual running portion. Therefore we didn't see the need to tweak max_threads much, but AIRFLOW__CORE__PARALLELISM and AIRFLOW__CORE__DAG_CONCURRENCY did affect us.
The scheduler's max_threads is the number of processes to parallelize the scheduler over. The max_threads cannot exceed the cpu count. The LocalExecutor's parallelism is the number of concurrent tasks the LocalExecutor should run. Both the scheduler and the LocalExecutor use python's multiprocessing library for parallelism.