Airflow stops scheduling tasks after a few days of runs - airflow

Airflow Version 2.0.2
I have three schedulers running in a kubernetes cluster running the CeleryExecutor with a postgres backend. Everything seems to run fine for a couple of weeks, but then the airflow scheduler stops scheduling some tasks. I've done an airflow db reset followed by an airflow db init and a fresh deployment of the airflow-specific images. Below are some of the errors I've received from logging in the database:
According to https://github.com/apache/airflow/issues/19811 the slot_pool issue is expected behavior, but I cannot figure out why DAGs suddenly stop being scheduled on time. For reference, there are ~500 DAGs being run every 15 minutes.
LOG: could not receive data from client: Connection timed out
STATEMENT: SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots
FROM slot_pool FOR UPDATE NOWAIT
The slot_pool table looks like this:
select * from slot_pool;
id | pool | slots | description
----+--------------+-------+--------------
1 | default_pool | 128 | Default pool
(1 row)
I have looked at several posts, but none of the posts seem to explain the issue or provide a solution. Below are a few of them:
Airflow initdb slot_pool does not exists
Running multiple Airflow Schedulers cause Postgres locking issues
Airflow tasks get stuck at "queued" status and never gets running

Related

How to kill ghost jobs in Airflow?

I am new to Airflow and I have done a few bad practices that include marking a task as failed while running and then Airflow leaves behind this task as a ghost running job and I cannot find a way to kill it. I have seen the option on_kill(), but I am not really sure how to implement it.
More specifically, I am starting some servers with a task called get_proxies (that start some servers to get a list of proxies to be used in the pipeline) via SSH; it usually takes some minutes to finish and then it continues with the pipeline; at the end of it I have a task called destroy_proxies (also via SSH), which, as its name suggests, destroys the servers where the proxies are running, and then also stops the containers (docker-compose down -d) and its trigger rule is all_done, so even in case of failure of the pipeline, it destroys the proxies.
Last time, while I was doing some tests and while the task get_proxies was running, I decided to mark it manually as Failed, so the task destroy_proxies was a few seconds later successfully executed, and of course it destroyed the proxies that were (so far) created. However, I noticed afterwards that the proxies were still running, so I need a way to handle those crashes or manually marked failed/success because it leaves those running-jobs in the background and there is no way for me to stop/kill those (I already have a workaround to destroy such proxies but still).
More specifically, I have two questions: 1) How to kill those running jobs? and 2) How to handle those tasks being killed externally so that they do not leave ghost jobs later on that are hard to access?
Some information about my Airflow version:
Apache Airflow
version | 2.3.2
executor | CeleryExecutor
task_logging_handler | airflow.utils.log.file_task_handler.FileTaskHandler
dags_folder | /opt/airflow/airflow-app/dags
plugins_folder | /opt/airflow/plugins
base_log_folder | /opt/airflow/logs
remote_base_log_folder |
System info
OS | Linux
architecture | x86_64
locale | ('en_US', 'UTF-8')
python_version | 3.9.13 (main, May 28 2022, 14:03:04) [GCC 10.2.1 20210110]
python_location | /usr/local/bin/python
For the last task, destroy_proxies, which is a SSHOperator, I was thinking of adding some wait time to the command being executed, but it is not a great solution, since the get_proxies task does not always last the same time.
On the other hand (but maybe it is worth asking a separate question for it), sometimes when I trigger the DAG manually, I get some failed tasks with no logs, and I might think it is related to it since there are some extra running jobs in the background so it might be facing memory issues...
It is my first time writing a question here, so I hope I am being clear, but in any case if more information is needed, or what I wrote is not entirely clear, I am always willing to re-write it and provide more info. Thank you all!

Airflow Scheduler handling queueing of dags

I have the following airflow setup
Executor : KubernetesExecutor
airflow version : 2.1.3
airflow config : parallelism = 256
I have the below scenario
I have a number of dags(eg 10) which are dependent on the success state of another task from another dag. The tasks kept failing with retries enabled for 6 times.
All the dependent dags run hourly and as a result they were added to the queue state by the scheduler. I can see around 800 dags were in queue and nothing was running. So I ended up manually changing their state to Fail.
Below are my questions from this event.
Is there a limit on the number of dags that can run concurrently in airflow set up ?
Is there a limit on how many dags can be enqueued ?
When dags are queued how does the scheduler decides which one to pick ? Is it based on queued time ?
Is is possible for setting up priority among the queued dags ?
How Does airflow 2.1.3 treats task in queue ? Are they counted against max_active_runs parameters ?

Airflow DAG getting psycopg2.OperationalError when running tasks with KubernetesPodOperator

Context
We are running Airflow 2.1.4. on a AKS cluster. The Airflow metadata database is an Azure managed postgreSQL(8 cpu). We have a DAG that has like 30 tasks, each task use a KubernetesPodOperator (using the apache-airflow-providers-cncf-kubernetes==2.2.0) to execute some container logic. Airflow is configured with the Airflow official HELM chart. The executor is Celery.
Issue
Usually the first like 5 tasks execute successfully (taking like 1 or 2 minute each) and get marked as done (and colored green) in the Airflow UI. The tasks after that are also successfully executed on AKS, but Airflow not marked as completed in Airflow as such. In the end this leads up to this error message and marking the already finished task as a fail:
[2021-12-15 11:17:34,138] {pod_launcher.py:333} INFO - Event: task.093329323 had an event of type Succeeded
...
[2021-12-15 11:19:53,866]{base_job.py:230} ERROR - LocalTaskJob heartbeat got an exception
psycopg2.OperationalError: could not connect to server: Connection timed out
Is the server running on host "psql-airflow-dev-01.postgres.database.azure.com" (13.49.105.208) and accepting
TCP/IP connections on port 5432?
Similar posting
This issue is also described in this post: https://www.titanwolf.org/Network/q/98b355ff-d518-4de3-bae9-0d1a0a32671e/y Where in the post a link to Stackoverflow does not work anymore.
The metadata database (Azure managed postgreSQL) is not overloading. Also the AKS node pool we are using does not show any sign of stress. It seems like the scheduler cannot pick up / detect a finished task after like a couple of tasks have run.
We also looked at several configuration option as stated here
We are looking now for a number of days now to get this solved but unfortunately no success.
Anyone any idea's what the cause could be? Any help is appreciated!

Is there way to have 3 set of worker nodes (groups) for airflow

We are setting up airflow for scheduling/orchestration , currently we have Spark python loads, and non-spark loads in different server and push files to gcp available in another server. Is there an option to decide to which worker nodes the airflow task are submitted? Currently we are using ssh connection to run all work loads. Our processing is mostly on-perm
Usage is celery executor model, How to we make sure that a specific task is run on its appropriate node.
task run a non spark server ( no spark binaries available)
task 2 executes PySpark submit - (This has spark binaries)
Task Push the files created from task 2 from another server/nodes ( Only this has the gcp utilities installed to push the files due to security reason ) .
If create a dag, is it possible to mention the task to execute on set of worker nodes ?
Currently we are having wrapper shell script for each task and making 3 ssh runs to complete these process. We would like to avoid such wrapper shell script rather use the inbuild have pythonOperator , SparkSubmitOperator, SparkJdbcOperator and SFTPToGCSOperator and make sure the specific task runs in specific server or worknodes .
In short , can we have 3 worker node groups and make the task to execute on a group of nodes based on the operations?
We can assign a queue to each worker node like
Start the airflow worker with mentioning the queue
airflow worker -q sparkload
airflow worker -q non-sparkload
airflow worker -q gcpload
The start each task with queue mentioned. Similar thread found as well.
How can Airflow be used to run distinct tasks of one workflow in separate machines?

Airflow task retried after failure despite retries=0

I have an Airflow environment running on Cloud Composer (3 n1-standard-1 nodes; image version: composer-1.4.0-airflow-1.10.0; config override: core catchup_by_default=False; PyPI packages: kubernetes==8.0.1).
During a DAG run, a few tasks (all GKEPodOperators) failed due to airflow worker pod eviction. All of these tasks were set to retries=0. One of them was requeued and retried. Why would this happen when the task is set to 0 retries? And why would it only happen to one of the tasks?
"airflow worker pod eviction" means that some pods needed more resources hence some pods were evicted.
To fix this you can use larger machine types or try to reduce the DAGs memory consumption.
Review his document to have a better view.

Resources