I am facing the following performance issues
Task either stuck in queued state for few secs / minutes or waiting
for scheduler with no status update
Once task is set to running i can see there is delay (from 10+ secs to minutes) in
context switching between executor/task_runner and worker
I am using airflow 2.3.4 version on AWS EKS kubernetes using airflow official helm chart https://airflow.apache.org/docs/helm-chart/stable/parameters-ref.html with worker and scheduler pods , aws RDS postgres database instance , AWS elasticache redis cluster(celery broker), with below airflow configuration.
1 Parallelism ------------------ 3200
2. max_active_runs_per_dag--- 200
3. dag_concurrency----------------- 1000
4. worker_concurrency ----------16
5. Scheduler / triggers replicas - 4
There are no resource contraints for memory /cpu constraints on worker / scheduler pods in kubernetes, redis , db etc
airflow_config.yml
task queued
task_log_with_exector_dealy_60secs
Related
Airflow Version 2.0.2
I have three schedulers running in a kubernetes cluster running the CeleryExecutor with a postgres backend. Everything seems to run fine for a couple of weeks, but then the airflow scheduler stops scheduling some tasks. I've done an airflow db reset followed by an airflow db init and a fresh deployment of the airflow-specific images. Below are some of the errors I've received from logging in the database:
According to https://github.com/apache/airflow/issues/19811 the slot_pool issue is expected behavior, but I cannot figure out why DAGs suddenly stop being scheduled on time. For reference, there are ~500 DAGs being run every 15 minutes.
LOG: could not receive data from client: Connection timed out
STATEMENT: SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots
FROM slot_pool FOR UPDATE NOWAIT
The slot_pool table looks like this:
select * from slot_pool;
id | pool | slots | description
----+--------------+-------+--------------
1 | default_pool | 128 | Default pool
(1 row)
I have looked at several posts, but none of the posts seem to explain the issue or provide a solution. Below are a few of them:
Airflow initdb slot_pool does not exists
Running multiple Airflow Schedulers cause Postgres locking issues
Airflow tasks get stuck at "queued" status and never gets running
New to Airflow, so apologies if this question doesn't really make sense. Is there a command or place in the webserver UI where one can see a list of all running workers? Also, if a Celery worker node is not explicitly started with airflow worker, are there "default" workers that are initialized with either the webserver or scheduler?
Reguarding this part of your question:
Also, if a Celery worker node is not explicitly started with airflow worker, are there "default" workers that are initialized with either the webserver or scheduler?
Take a look at the docs on executors.
Using celery requires some configuration changes. If you haven't configured airflow to use celery, then even if you start a celery worker, the worker won't pick up any tasks.
Conversely, if you have configured airflow to use celery, and you have not started any celery workers, then your cluster will not execute a single task.
If you are using SequentialExecutor (the default) or LocalExecutor (requires configuration), with both of these executors, tasks are executed by the scheduler, and no celery workers are used (and if you spun some up, then they wouldn't execute any tasks).
Regarding this:
Is there a way to view a list of all airflow workers?
If you have configured airflow to use celery, then you can run flower to see monitoring of celery workers. In airflow >= 2.0.0 flower is launched with airflow celery flower.
Is there a maximum number of DAGs that can be run in 1 Airflow or Cloud Composer environment?
If this is dependent on several factors (Airflow infrastructure config, Composer cluster specs, number of active runs per DAG etc..) what are all the factors that affect this?
I found from Composer docs that Composer uses CeleryExecutor and runs it on Google Kubernetes Engine (GKE).
There is no limit on the maximum number of dags in Airflow and it is a function of the resources (nodes, CPU, memory) available and then assuming there are resources available, the Airflow configuration options are just a limit setting that will be a bottleneck and have to be modified.
There is a helpful guide on how to do this in Cloud Composer here. So once you enable autoscaling in the underlying GKE cluster, and unlock the hard-limits specified in the Airflow configuration, there should be no limit to maximum number of tasks.
For vanilla Airflow, it will depend on the executor you are using in Airflow, and it will be easier to scale up if you use the KubernetesExecutor and then handle the autoscaling in K8s.
If you are using LocalExecutor then you can improve this if you are facing slow performance by increasing the resources allocated to your Airflow installation (CPU, memory).
it depends on the available resources allowed to your airflow and the type of the executor. And there is a maximum amount of allowed tasks and dags to run concurrently and simultaneously defined in the [core] section of airflow.cfg :
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 124
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 124
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 500
Some of the Airflow tasks are automatically getting shutdown.
I am using Airflow version 1.10.6 with Celery Executor. The Database is PostgreSQL and Broker is Redis. The airflow infrastructure is deployed on Azure.
Few tasks are getting shutdown after 15 hrs, few are getting stopped after 30 minutes. These are long-running tasks, I have set the execution_timeout to 100 hrs.
Any configuration that can prevent these tasks to be shutdown by Airflow ?
{local_task_job.py:167} WARNING - State of this instance has been externally set to shutdown. Taking the poison pill.
I have an Airflow environment running on Cloud Composer (3 n1-standard-1 nodes; image version: composer-1.4.0-airflow-1.10.0; config override: core catchup_by_default=False; PyPI packages: kubernetes==8.0.1).
During a DAG run, a few tasks (all GKEPodOperators) failed due to airflow worker pod eviction. All of these tasks were set to retries=0. One of them was requeued and retried. Why would this happen when the task is set to 0 retries? And why would it only happen to one of the tasks?
"airflow worker pod eviction" means that some pods needed more resources hence some pods were evicted.
To fix this you can use larger machine types or try to reduce the DAGs memory consumption.
Review his document to have a better view.