I have spring boot application with /health endpoint accessible deployed in AWS ECS Fargate. Sometimes the container is stopped with Task failed container health checks message. Sometimes happens once daily, sometimes once a week, maybe depends on the load. This is the healthcheck command specified in the Task Definition:
CMD-SHELL,curl -f http://localhost/actuator/health || exit 1
My question is how to troubleshoot what AWS receive when health-check is failed.
In case anyone else lands here because of failing container health checks (not the same as ELB health checks), AWS provides some basic advice:
Check that the command works from inside the container. In my case I had not installed curl in the container image, but when I tested it from outside the container it worked fine, which fooled me into thinking it was working.
Check the task logs in CloudWatch
If the checks are only failing sometimes (especially under load), you can try increasing the timeout, but also check the task metrics (memory and CPU usage). Garbage collection can cause the task to pause, and if all the vCPUs are busy handling other requests, the health check may be delayed, so you may need to allocate more memory and/or vCPUs to the task.
Thank #John Velonis,
I don't have enough reputation for commenting on your answer, so I post that in a different answer
For my case, the ecs container keeps getting UNKNOWN status from the ecs cluster. But I can access the healthcheck successfully. when I read this post, and check my base image which is node:14.19.1-alpine3.14, it doesn't have curl command.
so I have to install that in the Dockerfile
RUN apk --no-cache add curl
Related
I have following configuration with
worker_process 4;
But I noticed that it always hits only 1 worker.
I am testing on a local Centos VM. I am doing curl http call on specific port and added a file with 1000 curl requests and ran them from multiple terminal windows.
But see alll of them hit only 1 worker. Is there a way that I can have atleast more than 1 worker started. Can someone please share their knowledge on this.
https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/
In the epoll-and-accept the load balancing algorithm differs: Linux seems to choose the last added process, a LIFO-like behavior. The process added to the waiting queue most recently will get the new connection. This behavior causes the busiest process, the one that only just went back to event loop, to receive the majority of the new connections. Therefore, the busiest worker is likely to get most of the load.
Many times when I try to open the tree view or task duration page of some DAGs in the UI I get the error: 504 gateway time-out.
Sometimes after that I can't even open the page with the list of DAGs.
Do you know where this problem could come from?
The CPU and memory of the machine running Airflow seem to be fine and I use RDS for the metadata.
Thanks!
I've experienced this before as well. I believe it's caused by an HTTP request that takes longer than expected for the webserver's gunicorn worker to fulfill. For example, if you set the DAG tree view to a high setting like 365 DAG runs for a DAG with a lot of tasks, you may be able to reproduce this consistently.
Can you try bumping up the timeout settings on the webserver to see if it makes a difference?
First, try increasing web_server_worker_timeout (default = 120 seconds) under the [webserver] group.
If that doesn't resolve it, you might also try increasing web_server_master_timeout under the same group.
Another technique to try is switching the webserver worker_class (default = sync) to eventlet or gevent.
Reference: https://github.com/apache/incubator-airflow/blob/c27098b8d31fee7177f37108a6c2fb7c7ad37170/airflow/config_templates/default_airflow.cfg#L225-L229
Note that the alternative worker classes require installing Airflow with the async extras like:
pip install apache-airflow[async]
You can find more info about gunicorn worker timeouts in this question: How to resolve the gunicorn critical worker timeout error?.
Unless someone proves otherwise, after installing ShinyProxy from ShinyProxy.io software, which is a well documented piece of software, the machine started a docker image that runs XMRig that takes 100% CPU usage and might be for bitcoin mining. Below some print-screens. If anyone with similar problem, please let us know.
first thing is to ensure that the docker daemon API is not reachable from the outside world. Lots of scans are being performed all days long to track down open docker daemon api service and launch docker instance from there.
Second, as this issue does not relate to a software issue but a suspected breach, I suggest we close this topic and start a thread via mail. You can reach OA security support at itsupport.at.openanalytics.eu
Could you send us a md5sum of the jar file deployed to the above mentioned e-mail?
I cannot figure out why my ecs service will not launch, and keep being given the error "service unable to place a task because the resources could not be found".
In my task definition, I have 500 cpu units dedicated and 250 memory, for just a very small sample node app that's just serving up my static assets.
I am launching my service with 1 task and no ELB.
my guess is that your cpu units is too high. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html#container_definition_environment
it's a harder metric to guess if you haven't really measured it much on your app.
anyway, i'm hitting a similar issue, so i'm right there in the same boat, but i'd try erasing the cpu since it is optional and see if that resolves it.
I have a webserver (nginx) running debian and php5-fpm randomly seems to crach, it replys with 504 bad gateway if i call php files.
when it is in a crashed state and i do sudo /etc/init.d/php5-fpm it says that it is running, but it will still it gives 504 bad gateway until i do sudo /etc/init.d/php5-fpm
I'm thinking that it has maybe to do with one of my php files which is in a infinity loop until a certain event occurs (change in mysql database) or until it will be time-outed. I don't know if generally that is a good thing or if i should make the loop quit itself before a timeout occurs.
Thanks in advice!
First look at the nginx error.log for the actual error. I don't think PHP crashed, just your loop is using all available php-fpm processes, so there is none free to serve your next request from nginx. That should produce Timeout error in the logs (nginx will wait for some time for available php-fpm process).
Regarding your second question. You should not use infinite loops for this. And if you do, insert sleep() command inside the loop - otherwise you will overload your CPU with that loop and also database with queries.
Also I guess it is enough to have one PHP process in that loop waiting for a event. In that case use some type of semaphore (file or info in db) to let other processes know that one is already waiting for that event. Otherwise you will always eat up all available PHP processes.