Airflow DAG getting psycopg2.OperationalError when running tasks with KubernetesPodOperator

Airflow DAG getting psycopg2.OperationalError when running tasks with KubernetesPodOperator - airflow

Context
We are running Airflow 2.1.4. on a AKS cluster. The Airflow metadata database is an Azure managed postgreSQL(8 cpu). We have a DAG that has like 30 tasks, each task use a KubernetesPodOperator (using the apache-airflow-providers-cncf-kubernetes==2.2.0) to execute some container logic. Airflow is configured with the Airflow official HELM chart. The executor is Celery.
Issue
Usually the first like 5 tasks execute successfully (taking like 1 or 2 minute each) and get marked as done (and colored green) in the Airflow UI. The tasks after that are also successfully executed on AKS, but Airflow not marked as completed in Airflow as such. In the end this leads up to this error message and marking the already finished task as a fail:
[2021-12-15 11:17:34,138] {pod_launcher.py:333} INFO - Event: task.093329323 had an event of type Succeeded
...
[2021-12-15 11:19:53,866]{base_job.py:230} ERROR - LocalTaskJob heartbeat got an exception
psycopg2.OperationalError: could not connect to server: Connection timed out
Is the server running on host "psql-airflow-dev-01.postgres.database.azure.com" (13.49.105.208) and accepting
TCP/IP connections on port 5432?
Similar posting
This issue is also described in this post: https://www.titanwolf.org/Network/q/98b355ff-d518-4de3-bae9-0d1a0a32671e/y Where in the post a link to Stackoverflow does not work anymore.
The metadata database (Azure managed postgreSQL) is not overloading. Also the AKS node pool we are using does not show any sign of stress. It seems like the scheduler cannot pick up / detect a finished task after like a couple of tasks have run.
We also looked at several configuration option as stated here
We are looking now for a number of days now to get this solved but unfortunately no success.
Anyone any idea's what the cause could be? Any help is appreciated!

Related

What is the recommended best practice for managing airflow logs?

We started to implement Airflow for task scheduling about a year ago, and we are slowly migrating more and more tasks to Airflow. At some point we noticed that the server was filling up with logs, even after we implemented remote logging to S3. I'm trying to understand what the best way to handle logs is, and I've found a lot of conflicting advice, such as in this stackoverflow question from 4 years ago.
Implementing maintenance dags to clean out logs (airflow-maintenance-dags)
Implementing our own FileTaskHandler
Using the logrotate linux utility
When we implemented remote logging, we expected the local logs to be removed after they were shipped to S3, but this is not the case. Local logs remain on the server. I thought this might be a problem with our configuration but I haven't found any way to fix that. Also, remote logging only applies to task logs, but process logs (specifically scheduler logs) are always local, and they took up the most space.
We tried to implement maintenance dags, but our workers are running from a different location to the rest of airflow, particularly the scheduler, so only task logs were getting cleaned. We could get around this by creating a new worker that shares logs with the scheduler, but we prefer not to create extra workers.
We haven't tried to implement either of the other two suggestions yet. But that is why I want to understand, how are other people solving this, and what is the recommended way?

Airflow scheduler failure notification via slack

I am a newbie to airflow so pardon me for any stupid assumptions I make about it, I have ETL set up at my work where I am running Airflow on company cluster and have a dag with few tasks in it. It is a possible scenario that the cluster on which airflow runs crashes , in that event the DAG will not run.
I wanted to check if we can set up notification on failure of airflow scheduler , my online reading has thrown up several useful articles to monitor the DAG itself , but if the scheduler fails then these failure notifications won't be triggered (correct me if thats not how it works)
Open the below link in incognito if you face firewall and don't have subscription
https://medium.com/datareply/integrating-slack-alerts-in-airflow-c9dcd155105

You have to use an external software for this, like Datadog.
Here you can find more information:
https://docs.datadoghq.com/integrations/airflow/?tab=host
Basically, you have to connect externally Datadog to Airflow through statsD.
In my case, I have Airflow deployed through docker-compose, and Datadog is another container (from Datadog official Docker image), linked to the scheduler and webserver containers.
You can also use Grafana and Prometeus (also through statsD), which is the Open Source way https://databand.ai/blog/everyday-data-engineering-monitoring-airflow-with-prometheus-statsd-and-grafana/

Airflow 2 - error MySQL server has gone away

I am running Airflow with backend mariaDB and periodically when a DAG task is being scheduled, I noticed the following error in airflow worker
sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (2006, 'MySQL server has gone away').
I am not sure if the issue occurres due to misconfiguration of airflow, or it has to do that the backend is mariaDB, which as I saw it is not a recommended database.
Also, in mariaDB logs, I see the following warning repeating almost every minute
[Warning] Aborted connection 305627 to db: 'airflow' user: 'airflow' host: 'hostname' (Got an error reading communication packets)
I've seen some similar issues mentioned, but whatever I tried so far it didn't help.
The question is, Should I change database to MySQL? Or some configuration has to be done in mariaDB's end?
Airflow v2.0.1
MariaDB 10.5.5
SQLAlchemy 1.3.23

Hard to say - you need to look for the reason why your DB connection get aborted. MariaDB for quick testing with single scheduler might work, but there is some reason why your connection to the DB gets disconnected.
There are few things you can do:
airflow has db check command line command and you can run it to test if the DB configuration is working - maybe the errors that you will see will be obious when you try
airflow also has another useful command db shell - it will allow you to connect to the DB and run sql queries for example. This migh tell you if your connection is "stable". You can connect and run some queries and see if your connection is not interrupted in the meantime
You can see at more logs and your network connectivity to see if you have problems
finally check if you have enough resources to run airflow + DB. Ofthen things like that happen when you do not have enough memory for example. Airflow + DB requires at least 4GB RAM minimum from the experience (depends on the DB configuration) and if you are on Mac or Windows and using Docker, Docker VM by default has less memory than that available and you need to increase it.
look at other resources - disk space, memory, number of connections, etc. all can be your problem.

Troubleshoot AWS Fargate healthcheck for spring actuator

I have spring boot application with /health endpoint accessible deployed in AWS ECS Fargate. Sometimes the container is stopped with Task failed container health checks message. Sometimes happens once daily, sometimes once a week, maybe depends on the load. This is the healthcheck command specified in the Task Definition:
CMD-SHELL,curl -f http://localhost/actuator/health || exit 1
My question is how to troubleshoot what AWS receive when health-check is failed.

In case anyone else lands here because of failing container health checks (not the same as ELB health checks), AWS provides some basic advice:
Check that the command works from inside the container. In my case I had not installed curl in the container image, but when I tested it from outside the container it worked fine, which fooled me into thinking it was working.
Check the task logs in CloudWatch
If the checks are only failing sometimes (especially under load), you can try increasing the timeout, but also check the task metrics (memory and CPU usage). Garbage collection can cause the task to pause, and if all the vCPUs are busy handling other requests, the health check may be delayed, so you may need to allocate more memory and/or vCPUs to the task.

Thank #John Velonis,
I don't have enough reputation for commenting on your answer, so I post that in a different answer
For my case, the ecs container keeps getting UNKNOWN status from the ecs cluster. But I can access the healthcheck successfully. when I read this post, and check my base image which is node:14.19.1-alpine3.14, it doesn't have curl command.
so I have to install that in the Dockerfile
RUN apk --no-cache add curl

Airflow 504 gateway time-out

Many times when I try to open the tree view or task duration page of some DAGs in the UI I get the error: 504 gateway time-out.
Sometimes after that I can't even open the page with the list of DAGs.
Do you know where this problem could come from?
The CPU and memory of the machine running Airflow seem to be fine and I use RDS for the metadata.
Thanks!

I've experienced this before as well. I believe it's caused by an HTTP request that takes longer than expected for the webserver's gunicorn worker to fulfill. For example, if you set the DAG tree view to a high setting like 365 DAG runs for a DAG with a lot of tasks, you may be able to reproduce this consistently.
Can you try bumping up the timeout settings on the webserver to see if it makes a difference?
First, try increasing web_server_worker_timeout (default = 120 seconds) under the [webserver] group.
If that doesn't resolve it, you might also try increasing web_server_master_timeout under the same group.
Another technique to try is switching the webserver worker_class (default = sync) to eventlet or gevent.
Reference: https://github.com/apache/incubator-airflow/blob/c27098b8d31fee7177f37108a6c2fb7c7ad37170/airflow/config_templates/default_airflow.cfg#L225-L229
Note that the alternative worker classes require installing Airflow with the async extras like:
pip install apache-airflow[async]
You can find more info about gunicorn worker timeouts in this question: How to resolve the gunicorn critical worker timeout error?.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex