Airflow task retried after failure despite retries=0 - airflow

I have an Airflow environment running on Cloud Composer (3 n1-standard-1 nodes; image version: composer-1.4.0-airflow-1.10.0; config override: core catchup_by_default=False; PyPI packages: kubernetes==8.0.1).
During a DAG run, a few tasks (all GKEPodOperators) failed due to airflow worker pod eviction. All of these tasks were set to retries=0. One of them was requeued and retried. Why would this happen when the task is set to 0 retries? And why would it only happen to one of the tasks?

"airflow worker pod eviction" means that some pods needed more resources hence some pods were evicted.
To fix this you can use larger machine types or try to reduce the DAGs memory consumption.
Review his document to have a better view.

Related

mwaa restart functionality for requirements.txt updates

Every time our team puts another requirements.txt file for our MWAA environment, it requires a restart.
Regardless of the environment being in a PENDING or UPDATING state, I can still access the UI and run/monitor DAGS. I expect something to at least be unavailable or locked during this process from a user perspective.
So, my questions are: in the MWAA way of things, what exactly is being "restarted" during this process and why is applied to the entire so-called MWAA environment?
Airflow DAG processor, Airflow workers and Airflow scheduler are reboot
but not Airflow web server
This can be confirmed checking their respective logs.
Beware, some long-running task can fail during a reboot.

Airflow DAG getting psycopg2.OperationalError when running tasks with KubernetesPodOperator

Context
We are running Airflow 2.1.4. on a AKS cluster. The Airflow metadata database is an Azure managed postgreSQL(8 cpu). We have a DAG that has like 30 tasks, each task use a KubernetesPodOperator (using the apache-airflow-providers-cncf-kubernetes==2.2.0) to execute some container logic. Airflow is configured with the Airflow official HELM chart. The executor is Celery.
Issue
Usually the first like 5 tasks execute successfully (taking like 1 or 2 minute each) and get marked as done (and colored green) in the Airflow UI. The tasks after that are also successfully executed on AKS, but Airflow not marked as completed in Airflow as such. In the end this leads up to this error message and marking the already finished task as a fail:
[2021-12-15 11:17:34,138] {pod_launcher.py:333} INFO - Event: task.093329323 had an event of type Succeeded
...
[2021-12-15 11:19:53,866]{base_job.py:230} ERROR - LocalTaskJob heartbeat got an exception
psycopg2.OperationalError: could not connect to server: Connection timed out
Is the server running on host "psql-airflow-dev-01.postgres.database.azure.com" (13.49.105.208) and accepting
TCP/IP connections on port 5432?
Similar posting
This issue is also described in this post: https://www.titanwolf.org/Network/q/98b355ff-d518-4de3-bae9-0d1a0a32671e/y Where in the post a link to Stackoverflow does not work anymore.
The metadata database (Azure managed postgreSQL) is not overloading. Also the AKS node pool we are using does not show any sign of stress. It seems like the scheduler cannot pick up / detect a finished task after like a couple of tasks have run.
We also looked at several configuration option as stated here
We are looking now for a number of days now to get this solved but unfortunately no success.
Anyone any idea's what the cause could be? Any help is appreciated!

Running airflow DAG/tasks on different hosts

We currently have a bunch of independent jobs running on different servers & being scheduled with crontab. The goal would be to have a single view of all the jobs across the servers and whether they've run successfully etc.
Airflow is one of the tools we are considering using to achieve this. But our servers are configured very differently. Is it possible to set up airflow so that DAG1 (and the airflow scheduler & webserver) runs on server1 and DAG2 runs on server2 without RabbitMQ.
Essentially I'd like to achieve something like the first answer given here (or just at a DAG level): Airflow DAG tasks parallelism on different worker nodes
in the quickest & simplest way possible!
Thanks
You can checkout Running Apache-Airflow with Celery Executor in Docker.
To use celery, you can instantiate a redis node as a pod and proceed with managing tasks across multiple hosts.
The link above will also give you a starter docker-compose yaml to help you get started quickly with Apache Airflow on celery executor.
Is it possible to set up airflow so that DAG1 (and the airflow
scheduler & webserver) runs on server1 and DAG2 runs on server2
without RabbitMQ.
Airflow by default will try to use multiple hosts on Celery Executor and the division will always be on task level and not on DAG level.
This post might help you with spawning specific tasks on a specific worker node.

Is there a way to view a list of all airflow workers?

New to Airflow, so apologies if this question doesn't really make sense. Is there a command or place in the webserver UI where one can see a list of all running workers? Also, if a Celery worker node is not explicitly started with airflow worker, are there "default" workers that are initialized with either the webserver or scheduler?
Reguarding this part of your question:
Also, if a Celery worker node is not explicitly started with airflow worker, are there "default" workers that are initialized with either the webserver or scheduler?
Take a look at the docs on executors.
Using celery requires some configuration changes. If you haven't configured airflow to use celery, then even if you start a celery worker, the worker won't pick up any tasks.
Conversely, if you have configured airflow to use celery, and you have not started any celery workers, then your cluster will not execute a single task.
If you are using SequentialExecutor (the default) or LocalExecutor (requires configuration), with both of these executors, tasks are executed by the scheduler, and no celery workers are used (and if you spun some up, then they wouldn't execute any tasks).
Regarding this:
Is there a way to view a list of all airflow workers?
If you have configured airflow to use celery, then you can run flower to see monitoring of celery workers. In airflow >= 2.0.0 flower is launched with airflow celery flower.

Maximum number of DAGs in Airflow and Cloud Composer

Is there a maximum number of DAGs that can be run in 1 Airflow or Cloud Composer environment?
If this is dependent on several factors (Airflow infrastructure config, Composer cluster specs, number of active runs per DAG etc..) what are all the factors that affect this?
I found from Composer docs that Composer uses CeleryExecutor and runs it on Google Kubernetes Engine (GKE).
There is no limit on the maximum number of dags in Airflow and it is a function of the resources (nodes, CPU, memory) available and then assuming there are resources available, the Airflow configuration options are just a limit setting that will be a bottleneck and have to be modified.
There is a helpful guide on how to do this in Cloud Composer here. So once you enable autoscaling in the underlying GKE cluster, and unlock the hard-limits specified in the Airflow configuration, there should be no limit to maximum number of tasks.
For vanilla Airflow, it will depend on the executor you are using in Airflow, and it will be easier to scale up if you use the KubernetesExecutor and then handle the autoscaling in K8s.
If you are using LocalExecutor then you can improve this if you are facing slow performance by increasing the resources allocated to your Airflow installation (CPU, memory).
it depends on the available resources allowed to your airflow and the type of the executor. And there is a maximum amount of allowed tasks and dags to run concurrently and simultaneously defined in the [core] section of airflow.cfg :
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 124
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 124
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 500

Resources