Nginx reports "upstream connection timeout"

Nginx reports "upstream connection timeout" - nginx

Discription:
The k8s nginx-ingress-controllers are exposed in loadbalancer type(implemented by metallb) with ip addr 192.168.1.254. Another nginx cluster is in front of k8s cluster and it has only one upstream which is 192.168.1.254(lb ip addr).The request flow route:client -> nginx clusters -> nginx-ingress-controllers-> services.
Question:
Sometimes nginx cluster reports very tiny little "upstream(192.168.1.254) time out" and finally the client got 504 timeout from nginx.
But When I dropped the nginx cluster and switch request flow to : client -> nginx-ingress-controllers-> services.It goes well and client didn't get 504 timeout any more.I am sure the network between nginx cluster and nginx ingress controller works well.
Most of requests can be handled by nginx cluster and return status 200.I have no idea why few requests report "upstream time out" and return status 504.
system architecture
nginx cluster reports timeout
tcpdump package track

That's most likely slow file uploads (the requests you've showed are all POSTs), something that can't fit into the limit.
You can set a greater timeout value for application paths where uploads can be possible. If you are using and ingress controller you'd better create a separate ingress object for that. You can manage timeouts with these annotations, for example:
annotations:
nginx.ingress.kubernetes.io/proxy-send-timeout: 300s
nginx.ingress.kubernetes.io/proxy-read-timeout: 300s
These two annotations define the maximum upload time to 5 minutes.
If you are configuring nginx manually, you can set limits with proxy_read_timeout and proxy_send_timeout.

Related

Nginx Status Code 499 using Google Cloud Kubernetes TCP Load Balancer

We are running on GKE using a public-facing Nginx Ingress Controller exposed under a TCP Load Balancer which is automatically configured by Kubernetes.
The problem is that 0.05% of our requests have status code 499 (An Nginx unique status code which means that the client cancelled). Our P99 Latency on average is always below 100ms.

This error code 499 relates to the clients browser closing the connections before a response is sent from the backends.

As per DerSkythe's answer.
My problem is solved by adding the following in the config map.
apiVersion: v1
kind: ConfigMap
data:
http-snippet: |
proxy_ignore_client_abort on;
See http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_ignore_client_abort
After turning this on, I have almost zero 499 errors!
I highly recommend trying this configuration if you are encountering the same problem.

Client never receives response from service on kubernetes

I have the following configuration:
A kubernetes cluster based on k3s, deployed on a private cloud.
A service runs on this cluster behind and nginx-based ingress. The service's task is to read a file sent by the client, and upload it to an S3 database.
I have a client (running on a docker container) that sends https requests to upload files to this service.
For each of those requests, it is expected that the server takes about 2 to 3 minutes to respond.
My problem is that some responses issued by our service don't reach the client (the client timeouts after waiting 15mn), even though the client itself successfully sent its HTTP requests, and the service successfully processed it and responded. In particular I can see from the nginx logs that the response 200 is returned by our service, yet the client does not receive this response.
Here is an example of nginx log from which I deduce our service responded 200:
10.42.0.0 - [10.42.0.0] - - [10/Oct/2019:09:55:12 +0000] \"POST /api/upload/ot_35121G3118C_1.las HTTP/1.1\" 200 0 \"-\" \"reqwest/0.9.20\" 440258210 91.771 [file-upload-service-8000] [] 10.42.0.6:8000 0 15.368 200 583d70db408b6be596f5012e7893a3c3\n
For example, I tried to let the client perform requests continuously during 24h (waiting for the sever to respond before issuing a new request), and about one or two requests per hours have this problem while the 18 others behave as expected.
Because nginx tells me a 200 response was returned by our service, it feels like the response was lost somewhere between the nginx ingress and the client. Any idea about what causes this problem? Is there a way for me to debug this?
EDIT:
To clarify, the exact ingress controller I'm using is nginx-ingress ingress controller, deployed with helm.
Also, the failure rate is completely random. The tendency is 1 or 2 per hour, but it can sometimes be more or less. In addition it does not seem to be correlated to the size of the file to upload nor to the number of requests that succeeded so far.

Nginx ingress 504 gateway timeout on EKS with NLB connected to Nginx ingress

We are using a NLB in AWS connected to our EKS cluster via a nginx ingress controller. Some of our requests get a random 504 gateway timeout.
We think we debugged the problem to our nginx ingress.
Based on some Stackoverflow recommendations we played around with Connection headers.
1) We set Connection "close" this had no effect
2) We set Connection "keep-alive" again no effect
We also noticed another behavior with our proxy_read_timeout when it was 60seconds our request from the browser would be fulfilled at 60.xx seconds. When we reduced it to 30 it became 30.xx, 20 became 20.xx.
We went to 1 but still get random 504 gateway timeouts and do not understand why proxy_read_timeout has this behavior in our environment.
We want to understand what is the effect of proxy_read_timeout and why do we get above behavior? Also is there a way to set Connection "" on our nginx ingress (we are not able to do this via nginx.ingress.kubernetes.io/connection-proxy-header: "")
Thanks in advance!

We think our issue was related to this:
https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-troubleshooting.html#loopback-timeout
We're using an internal nlb with our nginx ingress controller, with targets registered by instance ID. We found that the 504 timeouts and the X second waits were only occurring on applications that were sharing a node with one of our ingress controller replicas. We used a combination of nodeSelectors, labels, taints, and tolerations to force the ingress controllers onto their own node, and it appears to have eliminated the timeouts.
We also changed our externalTrafficPolicy setting to Local.

Nginx server behind Nginx-Ingress controller

We decided to move our apps from Service:LoadBalancer to Ingress, and I've chosen Nginx Ingress Controller, as I'm familiar with it, and because it's one of the most popular controllers in Kubernetes world
Previously we had Nginx => Uwsgi combination that stands behind ELB. We compile nginx from source, as we need some 3rd party modules and Lua support.
ELB => Nginx Server => UWSGI
ELB => Nginx Ingress (Load Balancer) => Nginx (Server) => UWSGI
My question is: is it okay to have 2 nginx in a proxy chain?
I understand that one plays the role of LoadBalancer, and another is a server itself. But for me it comes with a pain, like if I change some options in Server nginx.conf, like increase the size of client body to 8MB, I should do the same on Nginx-Ingress. Also I'm wondering how to set timeouts: as there is a timeout between ingress=>server and server=>uwsgi, and in general how to tune the performance while having 3 proxies before request hits the app?
Is it a good practice to remove Nginx Server, so Ingress Controller acts like a server and loadbalancer at the same time? What about 3rd party modules that we use?

There's nothing wrong in principle with having 2 or more nginx in a proxy chain, other than, as alluded to in the question and in the below, the extra complexity.
It is a pain to maintain consistent configuration across multiple proxies, and in particular to have upstream configuration bleed into ingress. It can get very complicated when the same ingress serves multiple upstreams each with different traffic requirements. But this is often nevertheless unavoidable.
Each hop will have its own distinct timeout and retry configuration, and managing them can be complicated, especially the downstream timeout when upstream has retries. One can wind up with very strange failure patterns.
It is not a good idea to bundle an application with an ingress controller. Ingress is about offering a stable entry point into the cluster for out-of-cluster traffic, and distributing that traffic to multiple upstream applications in the cluster. If there is only one upstream application, one really does not need ingress, so if possible much better to just expose it as a Service, either using NodePort or LoadBalancer, depending on circumstance.

haproxy how to make it switch to a different backend in case of 404

I need your help.
I have implemented a haproxy configuration which correctly manages both http and websocket backends, except in one specific scenario.
Here below a summary about how this stuff works:
When I connect to :2703/webapp, haproxy correctly redirects to one of the two http configured backends (webapp-lb1 or webapp-lb2).
When I connect to :2703/webapp/events, haproxy correctly redirects to one of the two websocket configured backends (websocket-lb1 or websocket-lb2)
Webapp is a servlet running in apache tomcat.
When I stop one of the two backend tomcats, haproxy correctly switches to the other one (for both the http and the websocket).
On the contrary, when I try to simulate an outage of one of the http backends by stopping the webapp via the tomcat manager, haproxy reports a HTTP Status 404 error but does not switch to the other backend.
Being that I explicitly configured the http-check expect status 302 directive, I would expect that - in case of a 404 status - haproxy switches to the other backend.
I had a look at the haproxy official documentation and I also tested the http-check disable-on-404 configuration but this is not what I need, as the haproxy behavior remains exactly the same of the above one.
For info, activating the http-check disable-on-404, haproxy detects the DOWN of the backend I stopped but does nothing (which as far as I understand, is exactly what we have to expect from the http-check disable-on-404 configuration in case of 404 status); here below the haproxy log when this option is enabled:
Jul 23 14:19:23 localhost haproxy[4037]: Server webapp-lb/webapp-lb2 is stopping, reason: Layer7 check conditionally passed, code: 404, info: "Not Found", check duration: 0ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Here below an extract of my haproxy configuration:
frontend haproxy-webapp
bind *:2703
monitor-uri /haproxy_check
stats enable
stats uri /haproxy_stats
stats realm Strictly Private
stats auth admin:zqxwcevr
acl is_websocket url_beg /webapp/events
use_backend websocket-lb if is_websocket
default_backend webapp-lb
log global
backend webapp-lb
server webapp-lb1 192.168.136.129:8888 maxconn 400 check cookie webapp-lb1
server webapp-lb2 192.168.136.130:8888 maxconn 400 check cookie webapp-lb2
balance roundrobin
cookie JSESSIONID prefix nocache
log global
#http-check disable-on-404
option httpchk GET /webapp
http-check expect status 302
backend websocket-lb
server websocket-lb1 192.168.136.129:8888 maxconn 400 check
server websocket-lb2 192.168.136.130:8888 maxconn 400 check
balance roundrobin
log global
Please give me a hint as I am spending ages in reading documentation and forums with no success.
Thanks!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex