Nginx Status Code 499 using Google Cloud Kubernetes TCP Load Balancer - nginx

We are running on GKE using a public-facing Nginx Ingress Controller exposed under a TCP Load Balancer which is automatically configured by Kubernetes.
The problem is that 0.05% of our requests have status code 499 (An Nginx unique status code which means that the client cancelled). Our P99 Latency on average is always below 100ms.

This error code 499 relates to the clients browser closing the connections before a response is sent from the backends.

As per DerSkythe's answer.
My problem is solved by adding the following in the config map.
apiVersion: v1
kind: ConfigMap
data:
http-snippet: |
proxy_ignore_client_abort on;
See http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_ignore_client_abort
After turning this on, I have almost zero 499 errors!
I highly recommend trying this configuration if you are encountering the same problem.

Related

Nginx reports "upstream connection timeout"

Discription:
The k8s nginx-ingress-controllers are exposed in loadbalancer type(implemented by metallb) with ip addr 192.168.1.254. Another nginx cluster is in front of k8s cluster and it has only one upstream which is 192.168.1.254(lb ip addr).The request flow route:client -> nginx clusters -> nginx-ingress-controllers-> services.
Question:
Sometimes nginx cluster reports very tiny little "upstream(192.168.1.254) time out" and finally the client got 504 timeout from nginx.
But When I dropped the nginx cluster and switch request flow to : client -> nginx-ingress-controllers-> services.It goes well and client didn't get 504 timeout any more.I am sure the network between nginx cluster and nginx ingress controller works well.
Most of requests can be handled by nginx cluster and return status 200.I have no idea why few requests report "upstream time out" and return status 504.
system architecture
nginx cluster reports timeout
tcpdump package track
That's most likely slow file uploads (the requests you've showed are all POSTs), something that can't fit into the limit.
You can set a greater timeout value for application paths where uploads can be possible. If you are using and ingress controller you'd better create a separate ingress object for that. You can manage timeouts with these annotations, for example:
annotations:
nginx.ingress.kubernetes.io/proxy-send-timeout: 300s
nginx.ingress.kubernetes.io/proxy-read-timeout: 300s
These two annotations define the maximum upload time to 5 minutes.
If you are configuring nginx manually, you can set limits with proxy_read_timeout and proxy_send_timeout.

How to fix catch-22 with GCLB and Wordpress returning 301

I have setup a Kubernetes cluster on GKE. Installed the stable/wordpress Helm chart. Added an Ingress with a SSL certificate. But now the Google load balancer reports that my service is unhealthy. This is caused by the Wordpress pod that returns a 301 on the health check because it wants to enforce HTTPS, which is good. But the Google load balancer refuses to send a x-forwarded-proto: https header. So the pod thinks the health check was done over http. How can I work around this?
I have tried to add an .htaccess which always returns 200 for the GoogleHC User-agent but this is not possible with the helm chart which overrides the .htaccess after start-up.
Also see: https://github.com/kubernetes/ingress-gce/issues/937 and https://github.com/helm/charts/issues/18779
WAY : 1
If you are using Kubernetes cluster on GKE then you can use ingress indirectly it will create the Loadbalancer indirectly.
You can add SSL certificate store it inside secret and apply to ingress. For SSL you can also choose another approach to install cert-manager on GKE.
If you want to setup nginx-ingress with cert-manager you can follow this guide also :
https://www.digitalocean.com/community/tutorials/how-to-set-up-an-nginx-ingress-with-cert-manager-on-digitalocean-kubernetes
WAY : 2
Edit the helm chart locally add the liveness & readinesss probe to deployment and it will check wordpress health checkup over http only.
Update :
To add x-forwarded-proto in ingress you can use this annotation
nginx.ingress.kubernetes.io/server-snippet: |
location /service {
proxy_set_header X-Forwarded-Proto https;
}
As the HTTPS load balancer terminates the client SSL/TLS session at the LB, you would need to configure HTTPS between the load balancer and your application (wordpress). Health checks use HTTP by default, to use HTTPS health checks with your backend services, the backend services would also require their own SSL/TLS certificate(See #4 of HTTP load balancing which HTTPS load balancing inherits). To make the backend certificates simpler to configure, you can use self-signed certificates, which do not interfere with any client <-> load balancer encryption as the client session is terminated at the LB.
You can of course use HTTP health checks (less configuring!) for your backend(s), this will not cause any client traffic encryption issues, as it only affects the health check and not the data being sent to your application.
Why do you need https between Load Balancer and Wordpress in the first place? Wouldn't it be enough to have https on Load Balancer frontend side(between LB and outside world)?
Do you have SSL termination done twice?
This is what I did when I was migrating my Wordpress site to GKE:
Removed all Wordpress plugins related to https/ssl/tls. Lukily for me it didn't even require any Db changes.
Added Google-managed certificate. With Google-managed certificates, it's very easy to add it. GKE even has a separate definition for a certificate. On top of that you just need to update your DNS records:
apiVersion: networking.gke.io/v1beta2
kind: ManagedCertificate
metadata:
name: my-certificate
namespace: prod
spec:
domains:
#Wildcard domains are not supported(https://cloud.google.com/kubernetes-engine/docs/how-to/managed-certs).
- example.com
---
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: prod-ingress
namespace: prod
annotations:
kubernetes.io/ingress.allow-http: "false"
kubernetes.io/ingress.global-static-ip-name: load-balancer-ip
networking.gke.io/managed-certificates: my-certificate
I realize you have helm on top of it, but there's always a way to edit it/or configs/params.

Client never receives response from service on kubernetes

I have the following configuration:
A kubernetes cluster based on k3s, deployed on a private cloud.
A service runs on this cluster behind and nginx-based ingress. The service's task is to read a file sent by the client, and upload it to an S3 database.
I have a client (running on a docker container) that sends https requests to upload files to this service.
For each of those requests, it is expected that the server takes about 2 to 3 minutes to respond.
My problem is that some responses issued by our service don't reach the client (the client timeouts after waiting 15mn), even though the client itself successfully sent its HTTP requests, and the service successfully processed it and responded. In particular I can see from the nginx logs that the response 200 is returned by our service, yet the client does not receive this response.
Here is an example of nginx log from which I deduce our service responded 200:
10.42.0.0 - [10.42.0.0] - - [10/Oct/2019:09:55:12 +0000] \"POST /api/upload/ot_35121G3118C_1.las HTTP/1.1\" 200 0 \"-\" \"reqwest/0.9.20\" 440258210 91.771 [file-upload-service-8000] [] 10.42.0.6:8000 0 15.368 200 583d70db408b6be596f5012e7893a3c3\n
For example, I tried to let the client perform requests continuously during 24h (waiting for the sever to respond before issuing a new request), and about one or two requests per hours have this problem while the 18 others behave as expected.
Because nginx tells me a 200 response was returned by our service, it feels like the response was lost somewhere between the nginx ingress and the client. Any idea about what causes this problem? Is there a way for me to debug this?
EDIT:
To clarify, the exact ingress controller I'm using is nginx-ingress ingress controller, deployed with helm.
Also, the failure rate is completely random. The tendency is 1 or 2 per hour, but it can sometimes be more or less. In addition it does not seem to be correlated to the size of the file to upload nor to the number of requests that succeeded so far.

Nginx ingress 504 gateway timeout on EKS with NLB connected to Nginx ingress

We are using a NLB in AWS connected to our EKS cluster via a nginx ingress controller. Some of our requests get a random 504 gateway timeout.
We think we debugged the problem to our nginx ingress.
Based on some Stackoverflow recommendations we played around with Connection headers.
1) We set Connection "close" this had no effect
2) We set Connection "keep-alive" again no effect
We also noticed another behavior with our proxy_read_timeout when it was 60seconds our request from the browser would be fulfilled at 60.xx seconds. When we reduced it to 30 it became 30.xx, 20 became 20.xx.
We went to 1 but still get random 504 gateway timeouts and do not understand why proxy_read_timeout has this behavior in our environment.
We want to understand what is the effect of proxy_read_timeout and why do we get above behavior? Also is there a way to set Connection "" on our nginx ingress (we are not able to do this via nginx.ingress.kubernetes.io/connection-proxy-header: "")
Thanks in advance!
We think our issue was related to this:
https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-troubleshooting.html#loopback-timeout
We're using an internal nlb with our nginx ingress controller, with targets registered by instance ID. We found that the 504 timeouts and the X second waits were only occurring on applications that were sharing a node with one of our ingress controller replicas. We used a combination of nodeSelectors, labels, taints, and tolerations to force the ingress controllers onto their own node, and it appears to have eliminated the timeouts.
We also changed our externalTrafficPolicy setting to Local.

Nginx server behind Nginx-Ingress controller

We decided to move our apps from Service:LoadBalancer to Ingress, and I've chosen Nginx Ingress Controller, as I'm familiar with it, and because it's one of the most popular controllers in Kubernetes world
Previously we had Nginx => Uwsgi combination that stands behind ELB. We compile nginx from source, as we need some 3rd party modules and Lua support.
ELB => Nginx Server => UWSGI
ELB => Nginx Ingress (Load Balancer) => Nginx (Server) => UWSGI
My question is: is it okay to have 2 nginx in a proxy chain?
I understand that one plays the role of LoadBalancer, and another is a server itself. But for me it comes with a pain, like if I change some options in Server nginx.conf, like increase the size of client body to 8MB, I should do the same on Nginx-Ingress. Also I'm wondering how to set timeouts: as there is a timeout between ingress=>server and server=>uwsgi, and in general how to tune the performance while having 3 proxies before request hits the app?
Is it a good practice to remove Nginx Server, so Ingress Controller acts like a server and loadbalancer at the same time? What about 3rd party modules that we use?
There's nothing wrong in principle with having 2 or more nginx in a proxy chain, other than, as alluded to in the question and in the below, the extra complexity.
It is a pain to maintain consistent configuration across multiple proxies, and in particular to have upstream configuration bleed into ingress. It can get very complicated when the same ingress serves multiple upstreams each with different traffic requirements. But this is often nevertheless unavoidable.
Each hop will have its own distinct timeout and retry configuration, and managing them can be complicated, especially the downstream timeout when upstream has retries. One can wind up with very strange failure patterns.
It is not a good idea to bundle an application with an ingress controller. Ingress is about offering a stable entry point into the cluster for out-of-cluster traffic, and distributing that traffic to multiple upstream applications in the cluster. If there is only one upstream application, one really does not need ingress, so if possible much better to just expose it as a Service, either using NodePort or LoadBalancer, depending on circumstance.

Resources