Airflow 504 gateway time-out

Airflow 504 gateway time-out - airflow

Many times when I try to open the tree view or task duration page of some DAGs in the UI I get the error: 504 gateway time-out.
Sometimes after that I can't even open the page with the list of DAGs.
Do you know where this problem could come from?
The CPU and memory of the machine running Airflow seem to be fine and I use RDS for the metadata.
Thanks!

I've experienced this before as well. I believe it's caused by an HTTP request that takes longer than expected for the webserver's gunicorn worker to fulfill. For example, if you set the DAG tree view to a high setting like 365 DAG runs for a DAG with a lot of tasks, you may be able to reproduce this consistently.
Can you try bumping up the timeout settings on the webserver to see if it makes a difference?
First, try increasing web_server_worker_timeout (default = 120 seconds) under the [webserver] group.
If that doesn't resolve it, you might also try increasing web_server_master_timeout under the same group.
Another technique to try is switching the webserver worker_class (default = sync) to eventlet or gevent.
Reference: https://github.com/apache/incubator-airflow/blob/c27098b8d31fee7177f37108a6c2fb7c7ad37170/airflow/config_templates/default_airflow.cfg#L225-L229
Note that the alternative worker classes require installing Airflow with the async extras like:
pip install apache-airflow[async]
You can find more info about gunicorn worker timeouts in this question: How to resolve the gunicorn critical worker timeout error?.

Related

Airflow DAG getting psycopg2.OperationalError when running tasks with KubernetesPodOperator

Context
We are running Airflow 2.1.4. on a AKS cluster. The Airflow metadata database is an Azure managed postgreSQL(8 cpu). We have a DAG that has like 30 tasks, each task use a KubernetesPodOperator (using the apache-airflow-providers-cncf-kubernetes==2.2.0) to execute some container logic. Airflow is configured with the Airflow official HELM chart. The executor is Celery.
Issue
Usually the first like 5 tasks execute successfully (taking like 1 or 2 minute each) and get marked as done (and colored green) in the Airflow UI. The tasks after that are also successfully executed on AKS, but Airflow not marked as completed in Airflow as such. In the end this leads up to this error message and marking the already finished task as a fail:
[2021-12-15 11:17:34,138] {pod_launcher.py:333} INFO - Event: task.093329323 had an event of type Succeeded
...
[2021-12-15 11:19:53,866]{base_job.py:230} ERROR - LocalTaskJob heartbeat got an exception
psycopg2.OperationalError: could not connect to server: Connection timed out
Is the server running on host "psql-airflow-dev-01.postgres.database.azure.com" (13.49.105.208) and accepting
TCP/IP connections on port 5432?
Similar posting
This issue is also described in this post: https://www.titanwolf.org/Network/q/98b355ff-d518-4de3-bae9-0d1a0a32671e/y Where in the post a link to Stackoverflow does not work anymore.
The metadata database (Azure managed postgreSQL) is not overloading. Also the AKS node pool we are using does not show any sign of stress. It seems like the scheduler cannot pick up / detect a finished task after like a couple of tasks have run.
We also looked at several configuration option as stated here
We are looking now for a number of days now to get this solved but unfortunately no success.
Anyone any idea's what the cause could be? Any help is appreciated!

nginx worker process always run only 1

I have following configuration with
worker_process 4;
But I noticed that it always hits only 1 worker.
I am testing on a local Centos VM. I am doing curl http call on specific port and added a file with 1000 curl requests and ran them from multiple terminal windows.
But see alll of them hit only 1 worker. Is there a way that I can have atleast more than 1 worker started. Can someone please share their knowledge on this.

https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/
In the epoll-and-accept the load balancing algorithm differs: Linux seems to choose the last added process, a LIFO-like behavior. The process added to the waiting queue most recently will get the new connection. This behavior causes the busiest process, the one that only just went back to event loop, to receive the majority of the new connections. Therefore, the busiest worker is likely to get most of the load.

Troubleshoot AWS Fargate healthcheck for spring actuator

I have spring boot application with /health endpoint accessible deployed in AWS ECS Fargate. Sometimes the container is stopped with Task failed container health checks message. Sometimes happens once daily, sometimes once a week, maybe depends on the load. This is the healthcheck command specified in the Task Definition:
CMD-SHELL,curl -f http://localhost/actuator/health || exit 1
My question is how to troubleshoot what AWS receive when health-check is failed.

In case anyone else lands here because of failing container health checks (not the same as ELB health checks), AWS provides some basic advice:
Check that the command works from inside the container. In my case I had not installed curl in the container image, but when I tested it from outside the container it worked fine, which fooled me into thinking it was working.
Check the task logs in CloudWatch
If the checks are only failing sometimes (especially under load), you can try increasing the timeout, but also check the task metrics (memory and CPU usage). Garbage collection can cause the task to pause, and if all the vCPUs are busy handling other requests, the health check may be delayed, so you may need to allocate more memory and/or vCPUs to the task.

Thank #John Velonis,
I don't have enough reputation for commenting on your answer, so I post that in a different answer
For my case, the ecs container keeps getting UNKNOWN status from the ecs cluster. But I can access the healthcheck successfully. when I read this post, and check my base image which is node:14.19.1-alpine3.14, it doesn't have curl command.
so I have to install that in the Dockerfile
RUN apk --no-cache add curl

Gunicorn CPU usage increasing to a very high value

We are using Gunicorn with Nginx. After every time we restart gunicorn, the CPU usage took by Gunicorn keeps on increasing gradually. This increases from 0.5% to around 85% in a matter of 3-4 days. On restarting gunicorn, it comes down to 0.5%.
Please suggest what can cause this issue and how to go forward to debug and fix this.

Check workers configuration. Try use the following: cores * 2 -1
Check your application, seems that your application is blocking / freezing threads. Add timeout to all api calls, database queries, etc.
You can add an APM software to analyze your application, for example datadog.

php5-fpm craches

I have a webserver (nginx) running debian and php5-fpm randomly seems to crach, it replys with 504 bad gateway if i call php files.
when it is in a crashed state and i do sudo /etc/init.d/php5-fpm it says that it is running, but it will still it gives 504 bad gateway until i do sudo /etc/init.d/php5-fpm
I'm thinking that it has maybe to do with one of my php files which is in a infinity loop until a certain event occurs (change in mysql database) or until it will be time-outed. I don't know if generally that is a good thing or if i should make the loop quit itself before a timeout occurs.
Thanks in advice!

First look at the nginx error.log for the actual error. I don't think PHP crashed, just your loop is using all available php-fpm processes, so there is none free to serve your next request from nginx. That should produce Timeout error in the logs (nginx will wait for some time for available php-fpm process).
Regarding your second question. You should not use infinite loops for this. And if you do, insert sleep() command inside the loop - otherwise you will overload your CPU with that loop and also database with queries.
Also I guess it is enough to have one PHP process in that loop waiting for a event. In that case use some type of semaphore (file or info in db) to let other processes know that one is already waiting for that event. Otherwise you will always eat up all available PHP processes.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex