I have the following configuration:
A kubernetes cluster based on k3s, deployed on a private cloud.
A service runs on this cluster behind and nginx-based ingress. The service's task is to read a file sent by the client, and upload it to an S3 database.
I have a client (running on a docker container) that sends https requests to upload files to this service.
For each of those requests, it is expected that the server takes about 2 to 3 minutes to respond.
My problem is that some responses issued by our service don't reach the client (the client timeouts after waiting 15mn), even though the client itself successfully sent its HTTP requests, and the service successfully processed it and responded. In particular I can see from the nginx logs that the response 200 is returned by our service, yet the client does not receive this response.
Here is an example of nginx log from which I deduce our service responded 200:
10.42.0.0 - [10.42.0.0] - - [10/Oct/2019:09:55:12 +0000] \"POST /api/upload/ot_35121G3118C_1.las HTTP/1.1\" 200 0 \"-\" \"reqwest/0.9.20\" 440258210 91.771 [file-upload-service-8000] [] 10.42.0.6:8000 0 15.368 200 583d70db408b6be596f5012e7893a3c3\n
For example, I tried to let the client perform requests continuously during 24h (waiting for the sever to respond before issuing a new request), and about one or two requests per hours have this problem while the 18 others behave as expected.
Because nginx tells me a 200 response was returned by our service, it feels like the response was lost somewhere between the nginx ingress and the client. Any idea about what causes this problem? Is there a way for me to debug this?
EDIT:
To clarify, the exact ingress controller I'm using is nginx-ingress ingress controller, deployed with helm.
Also, the failure rate is completely random. The tendency is 1 or 2 per hour, but it can sometimes be more or less. In addition it does not seem to be correlated to the size of the file to upload nor to the number of requests that succeeded so far.
Related
Let’s say we have a pretty standard blue/green deployment setup. NGINX is currently proxying all traffic to the live server (BLUE). We then deploy an updated version of our code to the idle server (GREEN). Finally, we refresh the NGINX configuration so that all HTTP traffic is being routed to GREEN.
From my understanding, NGINX will handle this reload gracefully. All future HTTP requests will be routed to GREEN and all pending HTTP requests that made it to BLUE will be handled by BLUE.
But what about WebSockets? Let’s say that BLUE had ten live WebSocket connections when the configuration was reloaded. What happens to them?
From my understanding on the NGINX documentation, the connections will be terminated in 60 seconds if no new data is sent.
However, if the clients are using some sort of keep-alive or ping then I’d imagine the WebSocket connections will be maintained indefinitely even though BLUE is no longer serving any other HTTP traffic.
Is my intuition right here? If so, I’d imagine that BLUE either needs to close the connections itself, or the client-side code does, unless NGINX has a feature that I am missing.
Discription:
The k8s nginx-ingress-controllers are exposed in loadbalancer type(implemented by metallb) with ip addr 192.168.1.254. Another nginx cluster is in front of k8s cluster and it has only one upstream which is 192.168.1.254(lb ip addr).The request flow route:client -> nginx clusters -> nginx-ingress-controllers-> services.
Question:
Sometimes nginx cluster reports very tiny little "upstream(192.168.1.254) time out" and finally the client got 504 timeout from nginx.
But When I dropped the nginx cluster and switch request flow to : client -> nginx-ingress-controllers-> services.It goes well and client didn't get 504 timeout any more.I am sure the network between nginx cluster and nginx ingress controller works well.
Most of requests can be handled by nginx cluster and return status 200.I have no idea why few requests report "upstream time out" and return status 504.
system architecture
nginx cluster reports timeout
tcpdump package track
That's most likely slow file uploads (the requests you've showed are all POSTs), something that can't fit into the limit.
You can set a greater timeout value for application paths where uploads can be possible. If you are using and ingress controller you'd better create a separate ingress object for that. You can manage timeouts with these annotations, for example:
annotations:
nginx.ingress.kubernetes.io/proxy-send-timeout: 300s
nginx.ingress.kubernetes.io/proxy-read-timeout: 300s
These two annotations define the maximum upload time to 5 minutes.
If you are configuring nginx manually, you can set limits with proxy_read_timeout and proxy_send_timeout.
I've put my linux apache webserver running on GCP behind the google load balancer. Because i only want https traffic i've redirected port 80 to 443 as shown below:
<VirtualHost *:80>
ServerName spawnparty.com
ServerAlias www.spawnparty.com
DocumentRoot /var/www/html/wwwroot
Redirect permanent / https://www.spawnparty.com
</VirtualHost>
i've given the vm an external ip adress to test if the redirect works and it does.
I've then configured the load balancer. i've made it so that the frondend accepts both http and https. for the backend i made 2 services:
one that uses http and one that uses https so that if somoeone enters though http it is forwarded and then redirected to https by the code shown above.
for both backend services is made a basic health check:
for http: port: 80, timeout: 5s, check interval: 5s, unhealthy
threshold: 2 attempts
for https: port: 443, timeout: 5s, check interval: 5s, unhealthy
threshold: 2 attempts
the https one works fine and states 1 of 1 instance healthy but the http health check states 0 of 1 instance healthy
if change the health check from http to https and back again for the http back end service it works for a short period of time but after a few minutes it states 0 of 1 instance healthy again.
What must i change to keep it healthy?
TL;DR - Use the same HTTPS health check for both the backend services.
Health Checking and response codes
You will need to respond with 200 response code and close the connection normally within the configured period.
HTTP and HTTPS health checks
If traffic from the load balancer to your instances uses the HTTP or
HTTPS protocol, then HTTP or HTTPS health checks verify that the
instance is healthy and the web server is up and serving traffic.
For an HTTP(S) health check probe to be deemed successful, the
instance must return a valid HTTP response with code 200 and close the
connection normally within the configured period. If it does this a
specified number of times in a row, the health check returns a status
of HEALTHY for that instance. If an instance fails a specified number
of health check probes in a row, it is marked UNHEALTHY without any
notification being sent. UNHEALTHY instances do not receive new
connections, but existing connections are allowed to continue.
UNHEALTHY instances continue to receive health check probes. If an
instance later passes a health check by successfully responding to a
specified number of consecutive health check probes, it is marked
HEALTHY and starts receiving new connections, again without any
notification.
Since you have 2 separate backend services (one for HTTP and other for HTTPS), you will need 2 health checks (although backend services allows reusing the same health check too if needed - read on) since the load balancer considers them independent services.
As you have already confirmed, using the HTTPS health check will work with the HTTPS based service, but using the HTTP health check will not. The reason is you are actually returning a HTTP 301 response code for permanent URL redirection instead of the expected HTTP 200 response code.
Possible Solution(s)
One way to fix this is to use HTTPS health checks for both the backend services, since your underlying service is still the same. You lose the ability to health check the redirection, but that unfortunately is not supported by the Google Cloud Load Balancer. You can share the same HTTPS health check resource too for both the backend services.
The solution posted by CharlesB will also work, but I feel you're adding additional redirection rules just to satisfy health checks and are not used on your service path anyway. You will also need a separate HTTP health check resource. Using just HTTPS health checks for both the backend services I feel is much simpler and also verifies that your service is alive to handle new requests.
Redirect everything except your health check page to HTTPS. The How to force rewrite to HTTPS except for a few pages in Apache? question explains how you can do that. GCE Network load balancing mentions this requirement, saying "Even if your service does not use HTTP, you'll need to at least run a basic web server on each instance that the health check system can query."
I'm seeing a weird situation where either Nginx or uwsgi seems to be building up a long queue of incoming requests, and attempting to process them long after the client connection timed out. I'd like to understand and stop that behavior. Here's more info:
My Setup
My server uses Nginx to pass HTTPS POST requests to uWSGI and Flask via a Unix file socket. I have basically the default configurations on everything.
I have a Python client sending 3 requests per second to that server.
The Problem
After running the client for about 4 hours, the client machine started reporting that all the connections were timing out. (It uses the Python requests library with a 7-second timeout.) About 10 minutes later, the behavior changed: the connections began failing with 502 Bad Gateway.
I powered off the client. But for about 10 minutes AFTER powering off the client, the server-side uWSGI logs showed uWSGI attempting to answer requests from that client! And top showed uWSGI using 100% CPU (25% per worker).
During those 10 minutes, each uwsgi.log entry looked like this:
Thu May 25 07:36:37 2017 - SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) on request /api/polldata (ip 98.210.18.212) !!!
Thu May 25 07:36:37 2017 - uwsgi_response_writev_headers_and_body_do(): Broken pipe [core/writer.c line 296] during POST /api/polldata (98.210.18.212)
IOError: write error
[pid: 34|app: 0|req: 645/12472] 98.210.18.212 () {42 vars in 588 bytes} [Thu May 25 07:36:08 2017] POST /api/polldata => generated 0 bytes in 28345 msecs (HTTP/1.1 200) 2 headers in 0 bytes (0 switches on core 0)
And the Nginx error.log shows a lot of this:
2017/05/25 08:10:29 [error] 36#36: *35037 connect() to unix:/srv/my_server/myproject.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 98.210.18.212, server: example.com, request: "POST /api/polldata HTTP/1.1", upstream: "uwsgi://unix:/srv/my_server/myproject.sock:", host: "example.com:5000"
After about 10 minutes the uWSGI activity stops. When I turn the client back on, Nginx happily accepts the POST requests, but uWSGI gives the same "writing to a closed pipe" error on every request, as if it's permanently broken somehow. Restarting the webserver's docker container does not fix the problem, but rebooting the host machine fixes it.
Theories
In the default Nginx -> socket -> uWSGI configuration, is there a long queue of requests with no timeout? I looked in the uWSGI docs and I saw a bunch of configurable timeouts, but all default to around 60 seconds, so I can't understand how I'm seeing 10-minute-old requests being handled. I haven't changed any default timeout settings.
The application uses almost all the 1GB RAM in my small dev server, so I think resource limits may be triggering the behavior.
Either way, I'd like to change my configuration so that requests > 30 seconds old get dropped with a 500 error, rather than getting processed by uWSGI. I'd appreciate any advice on how to do that, and theories on what's happening.
This appears to be an issue downstream on the uWSGI side.
It sounds like your backend code may be faulty in that it takes too long to process the requests, does not implement any sort of rate limiting for the requests, and does not properly catch if any of the underlying connections have been terminated (hence, you're receiving the errors that your code tries to write to closed pipelines, and possibly even start processing new requests long after the underlying connections have been terminated).
As per http://lists.unbit.it/pipermail/uwsgi/2013-February/005362.html, you might want to abort processing within your backend if not uwsgi.is_connected(uwsgi.connection_fd()).
You might want to explore https://uwsgi-docs.readthedocs.io/en/latest/Options.html#harakiri.
As last resort, as per Re: Understanding "proxy_ignore_client_abort" functionality (2014), you might want to change uwsgi_ignore_client_abort from off to on in order to not drop the ongoing uWSGI connections that have already been passed to the upstream (even if the client does subsequently disconnect) in order to not receive the closed pipe errors from uWSGI, as well as to enforce any possible concurrent connection limits within nginx itself (otherwise, the connections to uWSGI will get dropped by nginx should the client disconnect, and nginx would have no clue how many requests are being queued up within uWSGI for subsequent processing).
Seems like DoS attack on Nginx uWSGI returning 100% CPU usage with Nginx 502, 504, 500. IP spoofing is common in DoS attack. Exclude by checking the logs.
Seeing this in nginx logs:
"upstream": "52.86.112.192:443, 52.86.78.197:443",
"upstream_response_time": "7.005, 7.016",
The documentation of nginx says:
If several servers were contacted during request processing,
their addresses are separated by commas,
e.g. “192.168.1.1:80, 192.168.1.2:80, unix:/tmp/sock”.
Unfortunately it is not clear to me how can two servers be invoked when processing 1 request? This is w/o internal redirects? Was it a retry after the first attempt failed?
The DNS query returns two A records:
xxx.us-east-1.elb.amazonaws.com. 60 IN A 54.84.139.107
xxx.us-east-1.elb.amazonaws.com. 60 IN A 52.71.207.21
Does it mean, than nginx does a retry for every failed request automatically? Or can it be configured? (this is AWS so the IP's of the load balancers are changing constantly)