GKE Persistant-disk container randomly dies and won't remount because of unmount pending - volume

For some reason our google containers sometimes restart which we cannot find the cause of , this is fine i guess if they start up quick again which is sadly not the case.
The issue seems to be that the persistant disk volume isn't unmounted quick enough and when the restarted container tries to mount it, it just get stuck and you have to kill it and start it manually.
Is it possible to configure the containers to wait for unmount before re-mounting persistant disks, what would be the correct solution to tackle this problem?
kubelet gke-xxx-xxx-xx Warning FailedMount
Unable to mount volumes for pod "xxx-xxx-xxx-hdki0_default(UUID)":
timeout expired waiting for volumes to attach/mount
for pod "xxx-xxx-xxx-hdki0"/"default".
list of unattached/unmounted volumes=[xxx-volume]
kubelet gke-xxx-xxx-xx
Warning FailedSync Error syncing pod, skipping:
timeout expired waiting for volumes to attach/mount for pod
"xxx-xxx-xxx-hdki0"/"default".
list of unattached/unmounted volumes=[xxx-volume]

Related

Airflow scheduler does not start after Google Composer update

I have a composer-2.0.25-airflow-2.2.5. I need to update the number of workers and environment variables in an environment that is already running. After update the environment the sheduler monitoring is unhealthy and the pod continues restarting alone. Sometimes appears the CrashLoopBackOff so indicates that a container is repeatedly crashing after restarting.
I looked the info of the pod where I saw the scheduler restarts.
I need the environment to continue running after the updates.
Do you have any idea about this issue?

Airflow 2 - error MySQL server has gone away

I am running Airflow with backend mariaDB and periodically when a DAG task is being scheduled, I noticed the following error in airflow worker
sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (2006, 'MySQL server has gone away').
I am not sure if the issue occurres due to misconfiguration of airflow, or it has to do that the backend is mariaDB, which as I saw it is not a recommended database.
Also, in mariaDB logs, I see the following warning repeating almost every minute
[Warning] Aborted connection 305627 to db: 'airflow' user: 'airflow' host: 'hostname' (Got an error reading communication packets)
I've seen some similar issues mentioned, but whatever I tried so far it didn't help.
The question is, Should I change database to MySQL? Or some configuration has to be done in mariaDB's end?
Airflow v2.0.1
MariaDB 10.5.5
SQLAlchemy 1.3.23
Hard to say - you need to look for the reason why your DB connection get aborted. MariaDB for quick testing with single scheduler might work, but there is some reason why your connection to the DB gets disconnected.
There are few things you can do:
airflow has db check command line command and you can run it to test if the DB configuration is working - maybe the errors that you will see will be obious when you try
airflow also has another useful command db shell - it will allow you to connect to the DB and run sql queries for example. This migh tell you if your connection is "stable". You can connect and run some queries and see if your connection is not interrupted in the meantime
You can see at more logs and your network connectivity to see if you have problems
finally check if you have enough resources to run airflow + DB. Ofthen things like that happen when you do not have enough memory for example. Airflow + DB requires at least 4GB RAM minimum from the experience (depends on the DB configuration) and if you are on Mac or Windows and using Docker, Docker VM by default has less memory than that available and you need to increase it.
look at other resources - disk space, memory, number of connections, etc. all can be your problem.

Troubleshoot AWS Fargate healthcheck for spring actuator

I have spring boot application with /health endpoint accessible deployed in AWS ECS Fargate. Sometimes the container is stopped with Task failed container health checks message. Sometimes happens once daily, sometimes once a week, maybe depends on the load. This is the healthcheck command specified in the Task Definition:
CMD-SHELL,curl -f http://localhost/actuator/health || exit 1
My question is how to troubleshoot what AWS receive when health-check is failed.
In case anyone else lands here because of failing container health checks (not the same as ELB health checks), AWS provides some basic advice:
Check that the command works from inside the container. In my case I had not installed curl in the container image, but when I tested it from outside the container it worked fine, which fooled me into thinking it was working.
Check the task logs in CloudWatch
If the checks are only failing sometimes (especially under load), you can try increasing the timeout, but also check the task metrics (memory and CPU usage). Garbage collection can cause the task to pause, and if all the vCPUs are busy handling other requests, the health check may be delayed, so you may need to allocate more memory and/or vCPUs to the task.
Thank #John Velonis,
I don't have enough reputation for commenting on your answer, so I post that in a different answer
For my case, the ecs container keeps getting UNKNOWN status from the ecs cluster. But I can access the healthcheck successfully. when I read this post, and check my base image which is node:14.19.1-alpine3.14, it doesn't have curl command.
so I have to install that in the Dockerfile
RUN apk --no-cache add curl

Jenkins spikes up the the CPU usage to 100%

I have a jenkins master which has 3 docker slaves and 2 VM slaves. I have installed Jenkins as a service on RedHat linux. It is observed that, the CPU utilization goes upto 100% sometimes and thereby I have to reboot the box. When i check the processes, I can see that there is a main master jenkins process and several other child jenkins processes(which are an exact replica of the master process) are hung and are causing the spike(Confirmed this through new relic).
I am trying to reproduce this issue, however have been unsuccessful in the same.
Below are my queries:
I know the previous process id, can I get some logs or dumps related to it post the server restart?
Is there a better approach to trouble shoot this, so that I can narrow down on the issue?
At this point I am unable to understand where are these child processes getting spawned from and how can I find the culprit.

Airflow 504 gateway time-out

Many times when I try to open the tree view or task duration page of some DAGs in the UI I get the error: 504 gateway time-out.
Sometimes after that I can't even open the page with the list of DAGs.
Do you know where this problem could come from?
The CPU and memory of the machine running Airflow seem to be fine and I use RDS for the metadata.
Thanks!
I've experienced this before as well. I believe it's caused by an HTTP request that takes longer than expected for the webserver's gunicorn worker to fulfill. For example, if you set the DAG tree view to a high setting like 365 DAG runs for a DAG with a lot of tasks, you may be able to reproduce this consistently.
Can you try bumping up the timeout settings on the webserver to see if it makes a difference?
First, try increasing web_server_worker_timeout (default = 120 seconds) under the [webserver] group.
If that doesn't resolve it, you might also try increasing web_server_master_timeout under the same group.
Another technique to try is switching the webserver worker_class (default = sync) to eventlet or gevent.
Reference: https://github.com/apache/incubator-airflow/blob/c27098b8d31fee7177f37108a6c2fb7c7ad37170/airflow/config_templates/default_airflow.cfg#L225-L229
Note that the alternative worker classes require installing Airflow with the async extras like:
pip install apache-airflow[async]
You can find more info about gunicorn worker timeouts in this question: How to resolve the gunicorn critical worker timeout error?.

Resources