Airflow scheduler does not start after Google Composer update - airflow

I have a composer-2.0.25-airflow-2.2.5. I need to update the number of workers and environment variables in an environment that is already running. After update the environment the sheduler monitoring is unhealthy and the pod continues restarting alone. Sometimes appears the CrashLoopBackOff so indicates that a container is repeatedly crashing after restarting.
I looked the info of the pod where I saw the scheduler restarts.
I need the environment to continue running after the updates.
Do you have any idea about this issue?

Related

mwaa restart functionality for requirements.txt updates

Every time our team puts another requirements.txt file for our MWAA environment, it requires a restart.
Regardless of the environment being in a PENDING or UPDATING state, I can still access the UI and run/monitor DAGS. I expect something to at least be unavailable or locked during this process from a user perspective.
So, my questions are: in the MWAA way of things, what exactly is being "restarted" during this process and why is applied to the entire so-called MWAA environment?
Airflow DAG processor, Airflow workers and Airflow scheduler are reboot
but not Airflow web server
This can be confirmed checking their respective logs.
Beware, some long-running task can fail during a reboot.

airflow jobs stucked in running status

There is a strange phenomenon when using air flow. When I run the airflow scheduler, the unidentified jobs suddenly become running, and when I exit the scheduler, it turns into success.
I tried airflow db reset, but I keep getting that job if I just run the scheduler. can you tell me why?

Airflow DAG getting psycopg2.OperationalError when running tasks with KubernetesPodOperator

Context
We are running Airflow 2.1.4. on a AKS cluster. The Airflow metadata database is an Azure managed postgreSQL(8 cpu). We have a DAG that has like 30 tasks, each task use a KubernetesPodOperator (using the apache-airflow-providers-cncf-kubernetes==2.2.0) to execute some container logic. Airflow is configured with the Airflow official HELM chart. The executor is Celery.
Issue
Usually the first like 5 tasks execute successfully (taking like 1 or 2 minute each) and get marked as done (and colored green) in the Airflow UI. The tasks after that are also successfully executed on AKS, but Airflow not marked as completed in Airflow as such. In the end this leads up to this error message and marking the already finished task as a fail:
[2021-12-15 11:17:34,138] {pod_launcher.py:333} INFO - Event: task.093329323 had an event of type Succeeded
...
[2021-12-15 11:19:53,866]{base_job.py:230} ERROR - LocalTaskJob heartbeat got an exception
psycopg2.OperationalError: could not connect to server: Connection timed out
Is the server running on host "psql-airflow-dev-01.postgres.database.azure.com" (13.49.105.208) and accepting
TCP/IP connections on port 5432?
Similar posting
This issue is also described in this post: https://www.titanwolf.org/Network/q/98b355ff-d518-4de3-bae9-0d1a0a32671e/y Where in the post a link to Stackoverflow does not work anymore.
The metadata database (Azure managed postgreSQL) is not overloading. Also the AKS node pool we are using does not show any sign of stress. It seems like the scheduler cannot pick up / detect a finished task after like a couple of tasks have run.
We also looked at several configuration option as stated here
We are looking now for a number of days now to get this solved but unfortunately no success.
Anyone any idea's what the cause could be? Any help is appreciated!

Airflow task retried after failure despite retries=0

I have an Airflow environment running on Cloud Composer (3 n1-standard-1 nodes; image version: composer-1.4.0-airflow-1.10.0; config override: core catchup_by_default=False; PyPI packages: kubernetes==8.0.1).
During a DAG run, a few tasks (all GKEPodOperators) failed due to airflow worker pod eviction. All of these tasks were set to retries=0. One of them was requeued and retried. Why would this happen when the task is set to 0 retries? And why would it only happen to one of the tasks?
"airflow worker pod eviction" means that some pods needed more resources hence some pods were evicted.
To fix this you can use larger machine types or try to reduce the DAGs memory consumption.
Review his document to have a better view.

Does Airflow restart affect current running jobs?

This seems like a mundane question but just to be on the safe side,
what are the effects of restarting the airflow service on the jobs which are currently running?
If you only restart the airflow webserver/scheduler processes then the running jobs are not affected. However restarting the worker process kills the job (killed as zombie - http://airflow.incubator.apache.org/concepts.html#zombies-undeads) and then it may or may not be retried accordingly to the dag/task rules.

Resources