Run parallel tasks in Apache Airflow - airflow

I am able to configure airflow.cfg file to run tasks one after the other.
What I want to do is, execute tasks in parallel, e.g. 2 at a time and reach the end of list.
How can I configure this?

Executing tasks in Airflow in parallel depends on which executor you're using, e.g., SequentialExecutor, LocalExecutor, CeleryExecutor, etc.
For a simple setup, you can achieve parallelism by just setting your executor to LocalExecutor in your airflow.cfg:
[core]
executor = LocalExecutor
Reference: https://github.com/apache/incubator-airflow/blob/29ae02a070132543ac92706d74d9a5dc676053d9/airflow/config_templates/default_airflow.cfg#L76
This will spin up a separate process for each task.
(Of course you'll need to have a DAG with at least 2 tasks that can execute in parallel to see it work.)
Alternatively, with CeleryExecutor, you can spin up any number of workers by just running (as many times as you want):
$ airflow worker
The tasks will go into a Celery queue and each Celery worker will pull off of the queue.
You might find the section Scaling out with Celery in the Airflow Configuration docs helpful.
https://airflow.apache.org/howto/executor/use-celery.html
For any executor, you may want to tweak the core settings that control parallelism once you have that running.
They're all found under [core]. These are the defaults:
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
# Are DAGs paused by default at creation
dags_are_paused_at_creation = True
# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
Reference: https://github.com/apache/incubator-airflow/blob/29ae02a070132543ac92706d74d9a5dc676053d9/airflow/config_templates/default_airflow.cfg#L99

Related

Airflow different parallelism for different types of task

we have certain task which requires huge amount of resources which can't be run with high parallelism and many other smaller tasks which are can run at parallelism of 32.
I am aware of parallelism config
The amount of parallelism as a setting to the executor. This defines the max number of task instances that should run simultaneously on this airflow installation
parallelism = 32
Is there a way where we can tag tasks and different level of parallelism for different tasks at entire airflow level.
Like having smaller task to run at default parallelism [32] but heavy task at much lower parallelism [1-4]
Pools (docs: https://airflow.apache.org/docs/apache-airflow/stable/concepts/pools.html) serve exactly this purpose: to limit the parallelism for a specific set of tasks.
You can create pools with your desired # of "slots" in the Airflow UI, and assign the pool to your task:
my_task = BashOperator(
...,
pool="heavy_task_pool",
...,
)

Is there a way to view a list of all airflow workers?

New to Airflow, so apologies if this question doesn't really make sense. Is there a command or place in the webserver UI where one can see a list of all running workers? Also, if a Celery worker node is not explicitly started with airflow worker, are there "default" workers that are initialized with either the webserver or scheduler?
Reguarding this part of your question:
Also, if a Celery worker node is not explicitly started with airflow worker, are there "default" workers that are initialized with either the webserver or scheduler?
Take a look at the docs on executors.
Using celery requires some configuration changes. If you haven't configured airflow to use celery, then even if you start a celery worker, the worker won't pick up any tasks.
Conversely, if you have configured airflow to use celery, and you have not started any celery workers, then your cluster will not execute a single task.
If you are using SequentialExecutor (the default) or LocalExecutor (requires configuration), with both of these executors, tasks are executed by the scheduler, and no celery workers are used (and if you spun some up, then they wouldn't execute any tasks).
Regarding this:
Is there a way to view a list of all airflow workers?
If you have configured airflow to use celery, then you can run flower to see monitoring of celery workers. In airflow >= 2.0.0 flower is launched with airflow celery flower.

Maximum number of DAGs in Airflow and Cloud Composer

Is there a maximum number of DAGs that can be run in 1 Airflow or Cloud Composer environment?
If this is dependent on several factors (Airflow infrastructure config, Composer cluster specs, number of active runs per DAG etc..) what are all the factors that affect this?
I found from Composer docs that Composer uses CeleryExecutor and runs it on Google Kubernetes Engine (GKE).
There is no limit on the maximum number of dags in Airflow and it is a function of the resources (nodes, CPU, memory) available and then assuming there are resources available, the Airflow configuration options are just a limit setting that will be a bottleneck and have to be modified.
There is a helpful guide on how to do this in Cloud Composer here. So once you enable autoscaling in the underlying GKE cluster, and unlock the hard-limits specified in the Airflow configuration, there should be no limit to maximum number of tasks.
For vanilla Airflow, it will depend on the executor you are using in Airflow, and it will be easier to scale up if you use the KubernetesExecutor and then handle the autoscaling in K8s.
If you are using LocalExecutor then you can improve this if you are facing slow performance by increasing the resources allocated to your Airflow installation (CPU, memory).
it depends on the available resources allowed to your airflow and the type of the executor. And there is a maximum amount of allowed tasks and dags to run concurrently and simultaneously defined in the [core] section of airflow.cfg :
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 124
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 124
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 500

Test an Apache Airflow DAG while it is already scheduled and running?

I ran the following test command:
airflow test events {task_name_redacted} 2018-12-12
...and got the following output:
Dependencies not met for <TaskInstance: events.{redacted} 2018-12-12T00:00:00+00:00 [None]>, dependency 'Task Instance Slots Available' FAILED: The maximum number of running tasks (16) for this task's DAG 'events' has been reached.
[2019-01-17 19:47:48,978] {models.py:1556} WARNING -
--------------------------------------------------------------------------------
FIXME: Rescheduling due to concurrency limits reached at task runtime. Attempt 1 of 6. State set to NONE.
--------------------------------------------------------------------------------
[2019-01-17 19:47:48,978] {models.py:1559} INFO - Queuing into pool None
My Airflow is configured with a maximum concurrency of 16. Does this mean that I cannot test a task when the DAG is currently running, and has used all of it's task slots?
Also, it was a little unclear from the docs, but does the airflow test actually execute the task, as in if it was a SparkSubmitOperator, it would actually submit the job?
While I am yet to reach that phase of deployment where concurrency will matter, the docs do give a fairly good indication of problem at hand
Since at any point of time just one scheduler is running (and you shouldn't be running multiple anyways), indeed it appears that irrespective of whether the DAG-runs are live-runs or test-runs, this limit will apply on them collectively. So that is certainly a hurdle.
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
But beware that merely increasing this number (assuming you have big-enough boxes for hefty workers / multiple workers), several other configurations will have to be tweaked as well to achieve the kind of parallelism I sense you want.
They are all listed under [core] section
# The amount of parallelism as a setting to the executor. This
defines the max number of task instances that should run
simultaneously on this airflow installation
parallelism = 32
# When not using pools, tasks are run in the "default pool", whose
size is guided by this config element
non_pooled_task_slot_count = 128
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
But we are still not there, because once you spawn so many tasks simultaneously, the backend metadata-db will start choking. While this is likely a minor problem (and might not be affecting unless you have some real huge DAGs / very large no of Variable interactions in your tasks), its still worth noting as a potential roadblock
# The SqlAlchemy pool size is the maximum number of database
connections in the pool. 0 indicates no limit.
sql_alchemy_pool_size = 5
# The SqlAlchemy pool recycle is the number of seconds a connection
can be idle in the pool before it is invalidated. This config does not
apply to sqlite. If the number of DB connections is ever exceeded, a
lower config value will allow the system to recover faster.
sql_alchemy_pool_recycle = 1800
# How many seconds to retry re-establishing a DB connection after
disconnects. Setting this to 0 disables retries.
sql_alchemy_reconnect_timeout = 300
Needless to say, all this is pretty much futile unless you pick the right executor; SequentialExecutor, in particular is only intended for testing
# The executor class that airflow should use. Choices include SequentialExecutor, LocalExecutor, CeleryExecutor, DaskExecutor,
KubernetesExecutor
executor = SequentialExecutor
But then params to BaseOperator like depends_on_past, wait_for_downstream are there to spoil the party as well
Finally I leave you with this link related to Airflow + Spark combination: How to submit Spark jobs to EMR cluster from Airflow?
(Pardon me if the answer confused you more than you already were, but..)

Airflow parallelism

the Local Executor spawns new processes while scheduling tasks. Is there a limit to the number of processes it creates. I needed to change it. I need to know what is the difference between scheduler's "max_threads" and
"parallelism" in airflow.cfg ?
parallelism: not a very descriptive name. The description says it sets the maximum task instances for the airflow installation, which is a bit ambiguous — if I have two hosts running airflow workers, I'd have airflow installed on two hosts, so that should be two installations, but based on context 'per installation' here means 'per Airflow state database'. I'd name this max_active_tasks.
dag_concurrency: Despite the name based on the comment this is actually the task concurrency, and it's per worker. I'd name this max_active_tasks_for_worker (per_worker would suggest that it's a global setting for workers, but I think you can have workers with different values set for this).
max_active_runs_per_dag: This one's kinda alright, but since it seems to be just a default value for the matching DAG kwarg, it might be nice to reflect that in the name, something like default_max_active_runs_for_dags
So let's move on to the DAG kwargs:
concurrency: Again, having a general name like this, coupled with the fact that concurrency is used for something different elsewhere makes this pretty confusing. I'd call this max_active_tasks.
max_active_runs: This one sounds alright to me.
source: https://issues.apache.org/jira/browse/AIRFLOW-57
max_threads gives the user some control over cpu usage. It specifies scheduler parallelism.
It's 2019 and more updated docs have come out. In short:
AIRFLOW__CORE__PARALLELISM is the max number of task instances that can run concurrently across ALL of Airflow (all tasks across all dags)
AIRFLOW__CORE__DAG_CONCURRENCY is the max number of task instances allowed to run concurrently FOR A SINGLE SPECIFIC DAG
These docs describe it in more detail:
According to https://www.astronomer.io/guides/airflow-scaling-workers/:
parallelism is the max number of task instances that can run
concurrently on airflow. This means that across all running DAGs, no
more than 32 tasks will run at one time.
And
dag_concurrency is the number of task instances allowed to run
concurrently within a specific dag. In other words, you could have 2
DAGs running 16 tasks each in parallel, but a single DAG with 50 tasks
would also only run 16 tasks - not 32
And, according to https://airflow.apache.org/faq.html#how-to-reduce-airflow-dag-scheduling-latency-in-production:
max_threads: Scheduler will spawn multiple threads in parallel to
schedule dags. This is controlled by max_threads with default value of
2. User should increase this value to a larger value(e.g numbers of cpus where scheduler runs - 1) in production.
But it seems like this last piece shouldn't take up too much time, because it's just the "scheduling" portion. Not the actual running portion. Therefore we didn't see the need to tweak max_threads much, but AIRFLOW__CORE__PARALLELISM and AIRFLOW__CORE__DAG_CONCURRENCY did affect us.
The scheduler's max_threads is the number of processes to parallelize the scheduler over. The max_threads cannot exceed the cpu count. The LocalExecutor's parallelism is the number of concurrent tasks the LocalExecutor should run. Both the scheduler and the LocalExecutor use python's multiprocessing library for parallelism.

Resources