Nginx-ingress worker processes constantly restarting - nginx
I recently upgraded my ingress controller to kubernetes-ingress v1.10.0. The ingresses seem to route traffic correctly but after checking the pods logs, I noticed a huge amount of notice were generated:
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2748
2021/02/10 09:40:23 [notice] 19#19: worker process 2748 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2745
2021/02/10 09:40:23 [notice] 19#19: worker process 2745 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
W0210 09:40:23.416499 1 listers.go:79] can not retrieve list of objects using index : Index with name namespace does not exist
W0210 09:40:23.416812 1 listers.go:79] can not retrieve list of objects using index : Index with name namespace does not exist
W0210 09:40:23.416912 1 listers.go:79] can not retrieve list of objects using index : Index with name namespace does not exist
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2735
2021/02/10 09:40:23 [notice] 19#19: worker process 2735 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2737
2021/02/10 09:40:23 [notice] 19#19: worker process 2737 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: worker process 2742 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2746
2021/02/10 09:40:23 [notice] 19#19: worker process 2746 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2744
2021/02/10 09:40:23 [notice] 19#19: worker process 2744 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2740
2021/02/10 09:40:23 [notice] 19#19: worker process 2740 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2736
2021/02/10 09:40:23 [notice] 19#19: worker process 2736 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2741
2021/02/10 09:40:23 [notice] 19#19: worker process 2734 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: worker process 2741 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2739
2021/02/10 09:40:23 [notice] 19#19: worker process 2739 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2738
2021/02/10 09:40:23 [notice] 19#19: worker process 2738 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2743
2021/02/10 09:40:23 [notice] 19#19: worker process 2743 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2749
2021/02/10 09:40:23 [notice] 19#19: worker process 2749 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2747
2021/02/10 09:40:23 [notice] 19#19: worker process 2747 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
2021/02/10 09:40:23 [warn] 2718#2718: *6697105 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/6/79/0000214796 while reading upstream, client: xxxx, server: xxxx, request: "GET /xxxx HTTP/1.1", upstream: "xxxx", host: "xxxx", referrer: "xxxx"
2021/02/10 09:40:23 [notice] 2769#2769: signal process started
2021/02/10 09:40:23 [notice] 19#19: signal 1 (SIGHUP) received from 2769, reconfiguring
2021/02/10 09:40:23 [notice] 19#19: reconfiguring
2021/02/10 09:40:23 [notice] 19#19: using the "epoll" event method
2021/02/10 09:40:23 [notice] 19#19: start worker processes
2021/02/10 09:40:23 [notice] 19#19: start worker process 2770
2021/02/10 09:40:23 [notice] 19#19: start worker process 2771
2021/02/10 09:40:23 [notice] 19#19: start worker process 2772
2021/02/10 09:40:23 [notice] 19#19: start worker process 2773
2021/02/10 09:40:23 [notice] 19#19: start worker process 2774
2021/02/10 09:40:23 [notice] 19#19: start worker process 2775
2021/02/10 09:40:23 [notice] 19#19: start worker process 2776
2021/02/10 09:40:23 [notice] 19#19: start worker process 2777
2021/02/10 09:40:23 [notice] 19#19: start worker process 2778
2021/02/10 09:40:23 [notice] 19#19: start worker process 2779
2021/02/10 09:40:23 [notice] 19#19: start worker process 2780
2021/02/10 09:40:23 [notice] 19#19: start worker process 2781
2021/02/10 09:40:23 [notice] 19#19: start worker process 2782
2021/02/10 09:40:23 [notice] 19#19: start worker process 2783
2021/02/10 09:40:23 [notice] 19#19: start worker process 2784
2021/02/10 09:40:23 [notice] 19#19: start worker process 2785
90.114.22.230 - - [10/Feb/2021:09:40:23 +0000] "GET /xxxx HTTP/1.1" 200 352910 "xxxx" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:84.0) Gecko/20100101 Firefox/84.0" "-"
2021/02/10 09:40:23 [notice] 2753#2753: gracefully shutting down
2021/02/10 09:40:23 [notice] 2755#2755: gracefully shutting down
2021/02/10 09:40:23 [notice] 2760#2760: gracefully shutting down
2021/02/10 09:40:23 [notice] 2755#2755: exiting
2021/02/10 09:40:23 [notice] 2753#2753: exiting
2021/02/10 09:40:23 [notice] 2762#2762: gracefully shutting down
2021/02/10 09:40:23 [notice] 2760#2760: exiting
2021/02/10 09:40:23 [notice] 2766#2766: gracefully shutting down
2021/02/10 09:40:23 [notice] 2762#2762: exiting
2021/02/10 09:40:23 [notice] 2766#2766: exiting
2021/02/10 09:40:23 [notice] 2759#2759: gracefully shutting down
2021/02/10 09:40:23 [notice] 2759#2759: exiting
2021/02/10 09:40:23 [notice] 2763#2763: gracefully shutting down
2021/02/10 09:40:23 [notice] 2761#2761: gracefully shutting down
2021/02/10 09:40:23 [notice] 2767#2767: gracefully shutting down
2021/02/10 09:40:23 [notice] 2763#2763: exiting
2021/02/10 09:40:23 [notice] 2767#2767: exiting
2021/02/10 09:40:23 [notice] 2761#2761: exiting
2021/02/10 09:40:23 [notice] 2760#2760: exit
2021/02/10 09:40:23 [notice] 2753#2753: exit
2021/02/10 09:40:23 [notice] 2766#2766: exit
2021/02/10 09:40:23 [notice] 2764#2764: gracefully shutting down
2021/02/10 09:40:23 [notice] 2764#2764: exiting
2021/02/10 09:40:23 [notice] 2752#2752: gracefully shutting down
2021/02/10 09:40:23 [notice] 2752#2752: exiting
2021/02/10 09:40:23 [notice] 2763#2763: exit
2021/02/10 09:40:23 [notice] 2762#2762: exit
2021/02/10 09:40:23 [notice] 2764#2764: exit
2021/02/10 09:40:23 [notice] 2759#2759: exit
2021/02/10 09:40:23 [notice] 2755#2755: exit
2021/02/10 09:40:23 [notice] 2752#2752: exit
2021/02/10 09:40:23 [notice] 2767#2767: exit
2021/02/10 09:40:23 [notice] 2761#2761: exit
2021/02/10 09:40:23 [notice] 2758#2758: gracefully shutting down
2021/02/10 09:40:23 [notice] 2758#2758: exiting
2021/02/10 09:40:23 [notice] 2756#2756: gracefully shutting down
2021/02/10 09:40:23 [notice] 2756#2756: exiting
2021/02/10 09:40:23 [notice] 2758#2758: exit
2021/02/10 09:40:23 [notice] 2756#2756: exit
2021/02/10 09:40:23 [notice] 2765#2765: gracefully shutting down
2021/02/10 09:40:23 [notice] 2765#2765: exiting
2021/02/10 09:40:23 [notice] 2757#2757: gracefully shutting down
2021/02/10 09:40:23 [notice] 2757#2757: exiting
2021/02/10 09:40:23 [notice] 2754#2754: gracefully shutting down
2021/02/10 09:40:23 [notice] 2754#2754: exiting
2021/02/10 09:40:23 [notice] 2754#2754: exit
2021/02/10 09:40:23 [notice] 2765#2765: exit
2021/02/10 09:40:23 [notice] 2757#2757: exit
I0210 09:40:23.604803 1 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"xxxx", Name:"xxxx", UID:"82a71705-194e-4919-a7e2-a511d52c1a7a", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"77919848", FieldPath:""}): type: 'Normal' reason: 'AddedOrUpdated' Configuration for xxxx/xxxx was added or updated
I0210 09:40:23.604873 1 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"xxxx", Name:"xxxx", UID:"10246997-07ae-41e1-b811-0ec630647f3b", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"182677830", FieldPath:""}): type: 'Normal' reason: 'AddedOrUpdated' Configuration for xxxx/xxxx was added or updated
I0210 09:40:23.605520 1 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"xxxx", Name:"xxxx", UID:"d628825f-1b06-4719-b4b0-4d971b8c0a54", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"182677778", FieldPath:""}): type: 'Normal' reason: 'AddedOrUpdated' Configuration for xxxx/xxxx was added or updated
I0210 09:40:23.605557 1 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"xxxx", Name:"xxxx", UID:"4b7b1fa1-1d7d-41a5-9d97-5f5aee52ade7", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"182678922", FieldPath:""}): type: 'Normal' reason: 'AddedOrUpdated' Configuration for xxxx/xxxx was added or updated
I0210 09:40:23.605569 1 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"xxxx", Name:"xxxx", UID:"b86b8b8e-82b9-40d0-b02d-073db557c0e1", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"182678955", FieldPath:""}): type: 'Normal' reason: 'AddedOrUpdated' Configuration for xxxx/xxxx was added or updated
I0210 09:40:23.605577 1 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"xxxx", Name:"xxxx", UID:"585ccdee-9807-442e-9b4f-7d1a97264216", APIVersion:"networking.k8s.io/v1beta1", ResourceVersion:"182677754", FieldPath:""}): type: 'Normal' reason: 'AddedOrUpdated' Configuration for xxxx/xxxx was added or updated
W0210 09:40:23.614001 1 listers.go:79] can not retrieve list of objects using index : Index with name namespace does not exist
W0210 09:40:23.614213 1 listers.go:79] can not retrieve list of objects using index : Index with name namespace does not exist
W0210 09:40:23.614304 1 listers.go:79] can not retrieve list of objects using index : Index with name namespace does not exist
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2763
2021/02/10 09:40:23 [notice] 19#19: worker process 2755 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: worker process 2763 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: worker process 2767 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2766
2021/02/10 09:40:23 [notice] 19#19: worker process 2752 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: worker process 2753 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: worker process 2766 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2756
2021/02/10 09:40:23 [notice] 19#19: worker process 2756 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: worker process 2758 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: worker process 2759 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: worker process 2760 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: worker process 2761 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: worker process 2762 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: worker process 2764 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2754
2021/02/10 09:40:23 [notice] 19#19: worker process 2754 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
2021/02/10 09:40:23 [notice] 19#19: signal 17 (SIGCHLD) received from 2765
2021/02/10 09:40:23 [notice] 19#19: worker process 2765 exited with code 0
2021/02/10 09:40:23 [notice] 19#19: signal 29 (SIGIO) received
This seem to be looping forever and very quickly, on all the pods.
I deployed my controller using these manifests and recreated the default server secret as mentioned in the release note.
The controller arguments are:
args:
- -nginx-configmaps=$(POD_NAMESPACE)/nginx-config
- -default-server-tls-secret=$(POD_NAMESPACE)/default-server-secret
- -global-configuration=$(POD_NAMESPACE)/nginx-configuration
- -report-ingress-status
- -enable-prometheus-metrics
- -enable-snippets
And here is the content of my nginx-config CM:
data:
client-max-body-size: 50m
proxy-read-timeout: 5m
server-tokens: "False"
Any idea what is happening there and how to solve this issue?
Edit:
After some more research I found out that two of my ingresses are constantly being updated:
Name: xxxx
Namespace: xxxx
Address:
Default backend: default-http-backend:80 (<none>)
TLS:
xxxx terminates xxxx
Rules:
Host Path Backends
---- ---- --------
* * default-http-backend:80 (<none>)
Annotations:
ingress.kubernetes.io/ssl-redirect: true
kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"extensions/v1beta1","kind":"Ingress","metadata":{"annotations":{"ingress.kubernetes.io/ssl-redirect":"true","kubernetes.io/ingress.class":"nginx","nginx.org/mergeable-ingress-type":"master"},"labels":{"app.kubernetes.io/component":"xxxx","app.kubernetes.io/instance":"xxxx","app.kubernetes.io/name":"xxxx","app.kubernetes.io/part-of":"xxxx","argocd.argoproj.io/instance":"xxxx"},"name":"xxxx","namespace":"xxxx"},"spec":{"rules":[{"host":"xxxx"}],"tls":[{"hosts":["xxxx"],"secretName":"xxxx"}]}}
kubernetes.io/ingress.class: nginx
nginx.org/mergeable-ingress-type: master
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal AddedOrUpdated 3m5s (x2600127 over 6d) nginx-ingress-controller Configuration for xxxx/xxxx was added or updated
Normal AddedOrUpdated 2m12s (x2599793 over 6d) nginx-ingress-controller Configuration for xxxx/xxxx was added or updated
Normal AddedOrUpdated 66s (x2600182 over 6d) nginx-ingress-controller Configuration for xxxx/xxxx was added or updated
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
annotations:
ingress.kubernetes.io/ssl-redirect: "true"
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"extensions/v1beta1","kind":"Ingress","metadata":{"annotations":{"ingress.kubernetes.io/ssl-redirect":"true","kubernetes.io/ingress.class":"nginx","nginx.org/mergeable-ingress-type":"master"},"labels":{"app.kubernetes.io/component":"xxxx","app.kubernetes.io/instance":"xxxx","app.kubernetes.io/name":"xxxx","app.kubernetes.io/part-of":"xxxx","argocd.argoproj.io/instance":"xxxx"},"name":"xxxx","namespace":"xxxx"},"spec":{"rules":[{"host":"xxxx"}],"tls":[{"hosts":["xxxx"],"secretName":"xxxx"}]}}
kubernetes.io/ingress.class: nginx
nginx.org/mergeable-ingress-type: master
creationTimestamp: "2021-01-18T09:55:07Z"
generation: 1
labels:
app.kubernetes.io/component: xxxx
app.kubernetes.io/instance: xxxx
app.kubernetes.io/name: xxxx
app.kubernetes.io/part-of: xxxx
argocd.argoproj.io/instance: xxxx
name: xxxx
namespace: xxxx
resourceVersion: "182677754"
selfLink: /apis/extensions/v1beta1/namespaces/xxxx/ingresses/xxxx
uid: 585ccdee-9807-442e-9b4f-7d1a97264216
spec:
rules:
- host: xxxx
tls:
- hosts:
- xxxx
secretName: xxxx
status:
loadBalancer:
ingress:
- {}
My environment is managed by ArgoCD but after checking the logs it doesn't look like the updates are coming from ArgoCD. I wonder if the updates are related to the -report-ingress-status option.
Edit II:
I removed the -report-ingress-status and it didn't change anything.
I don't know the actual root cause but I deleted all the TLS secrets, certificates and ingresses that were being constantly being updated and recreated them. It solved this issue.
Different incidents happened prior to this issue and might have been related to it: 2 of my 3 ingress nodes failed, during the upgrade the wrong CRDs were applied before being quickly fixed.
That's all I can say at the moment, but deleting the resources related to the ingresses being constantly updated and recreating them do solve the issue.
Related
NGINX in containerized deployement retruns error "getrlimit(RLIMIT_NOFILE): 1048576:1048576"
When we moved to Azure for testing our deployments, NGINX returns error which seems to be OS level error. Same deployment works well on other cloud platforms, also the OS version is maintained uniform across all our testing cloud platforms. OS version: Linux version 4.19.0-18-cloud-amd64 (debian-kernel#lists.debian.org) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.208-1 (2021-09-29) Docker container error: 2022/03/04 14:42:58 [notice] 14#14: using the "epoll" event method 2022/03/04 14:42:58 [notice] 14#14: nginx/1.21.6 2022/03/04 14:42:58 [notice] 14#14: built by gcc 10.2.1 20210110 (Debian 10.2.1-6) 2022/03/04 14:42:58 [notice] 14#14: OS: Linux 4.19.0-18-cloud-amd64 2022/03/04 14:42:58 [notice] 14#14: getrlimit(RLIMIT_NOFILE): 1048576:1048576 2022/03/04 14:42:58 [notice] 14#14: start worker processes 2022/03/04 14:42:58 [notice] 14#14: start worker process 15 2022/03/04 14:42:58 [notice] 14#14: start worker process 16 PLease advise.
Kong k8s deployment fails after seemingly innocent eks worker ami upgrade
After AWS AMI workers upgrade to a new version our kong deployment on k8s fails. kong version: 1.4 old ami version: amazon-eks-node-1.14-v20200423 new ami version: amazon-eks-node-1.14-v20200723 kubernetes version: 1.14 I see that the new AMI comes with a new docker version: 19.03.06, while the old one ships with 18.09.09. could this cause the issue? I can see in kong pod logs a lot of signal 9 exits: 2020/08/11 09:00:48 [notice] 1#0: using the "epoll" event method 2020/08/11 09:00:48 [notice] 1#0: openresty/1.15.8.2 2020/08/11 09:00:48 [notice] 1#0: built by gcc 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) 2020/08/11 09:00:48 [notice] 1#0: OS: Linux 4.14.181-140.257.amzn2.x86_64 2020/08/11 09:00:48 [notice] 1#0: getrlimit(RLIMIT_NOFILE): 1048576:1048576 2020/08/11 09:00:48 [notice] 1#0: start worker processes 2020/08/11 09:00:48 [notice] 1#0: start worker process 38 2020/08/11 09:00:48 [notice] 1#0: start worker process 39 2020/08/11 09:00:48 [notice] 1#0: start worker process 40 2020/08/11 09:00:48 [notice] 1#0: start worker process 41 2020/08/11 09:00:50 [notice] 1#0: signal 17 (SIGCHLD) received from 40 2020/08/11 09:00:50 [alert] 1#0: worker process 40 exited on signal 9 2020/08/11 09:00:50 [notice] 1#0: start worker process 42 2020/08/11 09:00:51 [notice] 1#0: signal 17 (SIGCHLD) received from 39 2020/08/11 09:00:51 [alert] 1#0: worker process 39 exited on signal 9 2020/08/11 09:00:51 [notice] 1#0: start worker process 43 2020/08/11 09:00:52 [notice] 1#0: signal 17 (SIGCHLD) received from 41 2020/08/11 09:00:52 [alert] 1#0: worker process 41 exited on signal 9 2020/08/11 09:00:52 [notice] 1#0: signal 29 (SIGIO) received 2020/08/11 09:00:52 [notice] 1#0: start worker process 44 2020/08/11 09:00:48 [debug] 38#0: *1 [lua] globalpatches.lua:243: randomseed(): seeding PRNG from OpenSSL RAND_bytes() 2020/08/11 09:00:48 [debug] 38#0: *1 [lua] globalpatches.lua:269: randomseed(): random seed: 255136921215 for worker nb 0 2020/08/11 09:00:48 [debug] 38#0: *1 [lua] events.lua:211: do_event_json(): worker-events: handling event; source=resty-worker-events, event=started, pid=38, data=nil 2020/08/11 09:00:48 [notice] 38#0: *1 [lua] cache_warmup.lua:42: cache_warmup_single_entity(): Preloading 'services' into the cache ..., context: init_worker_by_lua* 2020/08/11 09:00:48 [warn] 38#0: *1 [lua] socket.lua:159: tcp(): no support for cosockets in this context, falling back to LuaSocket, context: init_worker_by_lua* 2020/08/11 09:00:53 [notice] 1#0: signal 17 (SIGCHLD) received from 38 2020/08/11 09:00:53 [alert] 1#0: worker process 38 exited on signal 9 2020/08/11 09:00:53 [notice] 1#0: start worker process 45 2020/08/11 09:00:54 [notice] 1#0: signal 17 (SIGCHLD) received from 42 2020/08/11 09:00:54 [alert] 1#0: worker process 42 exited on signal 9 2020/08/11 09:00:54 [notice] 1#0: signal 29 (SIGIO) received 2020/08/11 09:00:54 [notice] 1#0: start worker process 46 2020/08/11 09:00:55 [notice] 1#0: signal 29 (SIGIO) received 2020/08/11 09:00:55 [notice] 1#0: signal 17 (SIGCHLD) received from 43 2020/08/11 09:00:55 [alert] 1#0: worker process 43 exited on signal 9 2020/08/11 09:00:55 [notice] 1#0: start worker process 47 2020/08/11 09:00:56 [notice] 1#0: signal 17 (SIGCHLD) received from 44 2020/08/11 09:00:56 [alert] 1#0: worker process 44 exited on signal 9 2020/08/11 09:00:56 [notice] 1#0: signal 29 (SIGIO) received 2020/08/11 09:00:56 [notice] 1#0: start worker process 48 2020/08/11 09:00:56 [notice] 1#0: signal 17 (SIGCHLD) received from 45 2020/08/11 09:00:56 [alert] 1#0: worker process 45 exited on signal 9 2020/08/11 09:00:58 [notice] 1#0: signal 29 (SIGIO) received 2020/08/11 09:00:58 [notice] 1#0: start worker process 49 2020/08/11 09:00:59 [notice] 1#0: signal 17 (SIGCHLD) received from 46 2020/08/11 09:00:59 [alert] 1#0: worker process 46 exited on signal 9 2020/08/11 09:00:59 [notice] 1#0: signal 29 (SIGIO) received 2020/08/11 09:00:59 [notice] 1#0: start worker process 50 2020/08/11 09:00:59 [notice] 1#0: signal 17 (SIGCHLD) received from 47 only critical message is: [crit] 235#0: *45 [lua] balancer.lua:749: init(): failed loading initial list of upstreams: failed to get from node cache: could not acquire callback lock: timeout, context: ngx.timer looking at kubectl describe pod kong... I see OOMKilled could this be a memory issue?
The new node ami ulimit (nofile) has changed to 1048576, which is a big change from 65536 which caused memory issues with our current Kong setup, and thus failing to deploy. Changing the new node file limit to the previous value fixed the kong deployment. Although we decided to increase Kong memory request instead, which also fixes the issue. relevant github issue
Nginx cache, redis_pass
I have been banging my head against a wall all day. I am using the following nginx configuration to test something location /help { set $redis_key "cache:$scheme://$host$request_uri"; default_type text/html; redis_pass 127.0.0.1:6379; error_page 404 = #upstream; } There is a key and value inside my redis instance for the cache:$scheme.... (in my case cache:http://localhost/help) I know they exist because I can monitor redis-cli for the nginx redis request, copy the "get" "cache:http://localhost/help", paste it into another redis-cli window and get the expected response. The problem comes with nginx, it's not getting the response. Again I can see it connect from inside redis-cli -> monitor and I know the key and value exist. From the nginx error log I can see this 2016/04/08 16:52:42 [notice] 9304#0: worker process 6328 exited with code 0 2016/04/08 16:52:42 [notice] 9304#0: signal 29 (SIGIO) received terminate called after throwing an instance of 'std::length_error' what(): basic_string::append 2016/04/08 16:52:49 [notice] 9304#0: signal 17 (SIGCHLD) received 2016/04/08 16:52:49 [alert] 9304#0: worker process 7328 exited on signal 6 (core dumped) 2016/04/08 16:52:49 [notice] 9304#0: start worker process 7516 2016/04/08 16:52:49 [notice] 9304#0: signal 29 (SIGIO) received terminate called after throwing an instance of 'std::length_error' what(): basic_string::append 2016/04/08 16:52:50 [notice] 9304#0: signal 17 (SIGCHLD) received 2016/04/08 16:52:50 [alert] 9304#0: worker process 7335 exited on signal 6 (core dumped) 2016/04/08 16:52:50 [notice] 9304#0: start worker process 7544 2016/04/08 16:52:50 [notice] 9304#0: signal 29 (SIGIO) received terminate called after throwing an instance of 'std::length_error' what(): basic_string::append Has this appened to anyone else or can someone kick me in the right direction? Thanks in advance
For anyone reading this in the future. Firstly, Hello from the past! Secondly, turns out nginx pagespeed module and this kind of caching are incompatible.
How to solve exit signal Segmentation fault (11)
After I migrated wordpress from a shared host to my own VPS I got the unpleasant surprise that half of the back-end pages on my websites rendered a No data receive ERR_EMPTY_RESPONSE. Determined to find out what caused the problem I started troubleshooting. 7 of my 8 websites were affected, all running wordpress 4.1.5. Upgrading to 4.2.2 did not fix the problem. The only unaffected website is an old website running on wordpress 3.3.1. Upgrading this website to run 4.2.2 results in the same errors. When I try to do a fresh wordpress install the same error pops up after step one (both when installing 4.2.2 and 3.3.1). The 7 sites are running on 4 different themes, and I tried dis-enabling all plugins, still no luck. I had a look at the error logs, I will copy a fragment here since they might provide some useful info. I've been googling all these lines but can't find the solution yet. [Sun Jun 21 10:22:41 2015] [notice] caught SIGTERM, shutting down [Sun Jun 21 10:22:42 2015] [notice] SSL FIPS mode disabled [Sun Jun 21 10:22:42 2015] [warn] RSA server certificate CommonName (CN) `localhost' does NOT match server name!? [Sun Jun 21 10:22:42 2015] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec) [Sun Jun 21 10:22:43 2015] [notice] SSL FIPS mode disabled [Sun Jun 21 10:22:43 2015] [warn] RSA server certificate CommonName (CN) `localhost' does NOT match server name!? [Sun Jun 21 10:22:43 2015] [notice] Apache/2.2.22 (Unix) mod_ssl/2.2.22 OpenSSL/1.0.0-fips DAV/2 PHP/5.2.17 configured -- resuming normal operations [Sun Jun 21 10:22:59 2015] [notice] child pid 3943 exit signal Segmentation fault (11) [Sun Jun 21 10:23:00 2015] [notice] child pid 3944 exit signal Segmentation fault (11) [Sun Jun 21 10:23:02 2015] [notice] child pid 3945 exit signal Segmentation fault (11) [Sun Jun 21 10:23:03 2015] [notice] child pid 3942 exit signal Segmentation fault (11) [Sun Jun 21 10:23:04 2015] [notice] child pid 4080 exit signal Segmentation fault (11) [Sun Jun 21 10:23:05 2015] [notice] child pid 3946 exit signal Segmentation fault (11) [Sun Jun 21 10:23:06 2015] [notice] child pid 4083 exit signal Segmentation fault (11) [Sun Jun 21 10:23:07 2015] [notice] child pid 4082 exit signal Segmentation fault (11)
NGINX + PHP5-FPM segfaults under high load
I have been dealing with this problem all day and it is driving me insane. All Google results and searches here lead to dead ends. I hope someone can work with me to provide a solution for myself and future victims. Here we go. I am running a very popular website with over 3M page views a day. On average that is 34 page views per second, but more realistically, during peak hours, it gets to over 300 page views per second. Think of these as requests. I am running a Ubuntu 10.04 64-bit server with 2 E5620 CPUs, 12GB RAM, and a Micron P300 6Gb/s SSD. During the peak hours the CPU and memory load is average (20-30% CPU and half of memory is used). The software that powers this site is: NGINX, MySQL, PHP5-FPM, PHP-APC, and Memcached. Ok, now finally the meat of the post, here are my error logs. There a bunch of these errors logged. /var/log/php5-fpm Jul 20 14:49:47.289895 [NOTICE] fpm is running, pid 29373 Jul 20 14:49:47.337092 [NOTICE] ready to handle connections Jul 20 14:51:23.957504 [ERROR] [pool www] unable to retrieve process activity of one or more child(ren). Will try again later. Jul 20 14:51:41.846439 [WARNING] [pool www] child 29534 exited with code 1 after 114.518174 seconds from start Jul 20 14:51:41.846797 [NOTICE] [pool www] child 29597 started Jul 20 14:51:41.896653 [WARNING] [pool www] child 29408 exited on signal 11 SIGSEGV after 114.596706 seconds from start Jul 20 14:51:41.897178 [NOTICE] [pool www] child 29598 started Jul 20 14:51:41.903286 [WARNING] [pool www] child 29398 exited with code 1 after 114.605761 seconds from start Jul 20 14:51:41.903719 [NOTICE] [pool www] child 29600 started Jul 20 14:51:41.907816 [WARNING] [pool www] child 29437 exited with code 1 after 114.601417 seconds from start Jul 20 14:51:41.908253 [NOTICE] [pool www] child 29601 started Jul 20 14:51:41.916002 [WARNING] [pool www] child 29513 exited with code 1 after 114.592514 seconds from start Jul 20 14:51:41.916501 [NOTICE] [pool www] child 29602 started Jul 20 14:51:41.916558 [WARNING] [pool www] child 29494 exited on signal 11 SIGSEGV after 114.597355 seconds from start Jul 20 14:51:41.916873 [NOTICE] [pool www] child 29603 started Jul 20 14:51:41.921389 [WARNING] [pool www] child 29502 exited with code 1 after 114.600405 seconds from start /var/log/nginx/error.log 2011/07/20 15:48:42 [error] 29583#0: *569743 readv() failed (104: Connection reset by peer) while reading upstream, client: 77.223.197.193, server: domain.com, request: "GET /favicon.ico HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.domain.com" 2011/07/20 15:48:42 [error] 29578#0: *571695 readv() failed (104: Connection reset by peer) while reading upstream, client: 150.70.64.196, server: domain.com, request: "GET /page HTTP/1.0", upstream: "fastcgi://127.0.0.1:9000", host: "www.domain.com" 2011/07/20 15:48:42 [error] 29581#0: *571050 readv() failed (104: Connection reset by peer) while reading upstream, client: 110.136.157.66, server: domain.com, request: "GET /page HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.domain.com" 2011/07/20 15:48:42 [error] 29581#0: *564892 readv() failed (104: Connection reset by peer) while reading upstream, client: 110.136.161.214, server: domain.com, request: "GET /page HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.domain.com" 2011/07/20 15:48:42 [error] 29585#0: *456171 readv() failed (104: Connection reset by peer) while reading upstream, client: 93.223.33.135, server: domain.com, request: "GET /favicon.ico HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.domain.com" 2011/07/20 15:48:42 [error] 29585#0: *471192 readv() failed (104: Connection reset by peer) while reading upstream, client: 74.90.33.142, server: domain.com, request: "GET /page HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.domain.com" 2011/07/20 15:48:42 [error] 29580#0: *570132 readv() failed (104: Connection reset by peer) while reading upstream, client: 180.246.182.191, server: domain.com, request: "GET /page HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.domain.com" Finally, I want to point out that I did try to disable PHP-APC to see if it was a bug with the opt cacher, but the segfaults still persisted. I also have PHP5-SUHOSIN installed and I disabled it too, but the errors still keep happening.
This issue just happend to me. PHP5-FPM was having segfaults on most of its children. In my case, we had 0bytes available on the harddisk. A quick log shredding stopped the segfaults.
2011/07/20 15:48:42 [error] 29583#0: *569743 readv() failed (104: Connection reset by peer) while reading upstream, client: 77.223.197.193, server: domain.com, request: "GET /favicon.ico HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "www.domain.com" thats just some problem with your config for your upstream server / router / client reset? of nginx dropped the request but running a site at 3 times the load you described i never saw that message, the requested resource isnt even handed to a php-fpm process, its a favicon and for the php-fpm messages the children seem to stop after the 114 sec limit, is that a limit set by your php.ini file? seg faults in php often occur when using high memory, your php scripts could leak memory and will eventually reach the memory limit, having the php-fpm processes serve less requests helps in dealing with memory leaks
See my answer here that's related to your question (about nginx + magento and high load) NGINX-FPM configuration settings for magento Its not a direct answer per say, but it may help you configure your nginx + php-fpm to help eliminate the faults.
You are probably using suhosin Disable ths suhosin.ini under /etc/php5/fpm/conf.d and restart the php5-fpm service Check the suhosin version and try to install another one.