nginx 499 errors with node upstream and http2 - nginx

Recently we've switched apache to nginx with http2 support for our web application, and we're seeing quite a lot of 499 errors.
Our setup:
Ubuntu machine running on Amazon AWS
Nginx/1.9.12 acting as proxy (and ssl offload) for node application (same machine)
Single Page App on the client side
My initial thought was that clients simply close their browsers, but from the logs, i see that ~95% clients are alive, and there are requests after getting 499.
55% of 499 errors occur for http2 and 45% for http1.1 version, so no trend here.
80% of the requests come from mobile devices (bad connection?)
But of particular worry there is one endpoint which might take 5-15 seconds to complete (PUT request). For that endpoint:
~95% of the 499 errors are for http2 version
~95% of the request are from mobile devices
almost all clients are alive (we see that from logs because after failed request client side javascript issues another request to different endpoint)
There is no time pattern - sometimes client gets 499 after just 0.1 second, sometimes 3-9 seconds
Logs don't indicate any problems on the node upstream, and this happens regularly, and there is no heavy load.
I've tried adding keepalive to upstream, and enabling proxy_ignore_client_abort, but that does not seem to help.
Any hints how to troubleshoot this?

I was reading this unanswered question which suggests that one potential source is from impatient clients hitting the refresh button.
This seems to be consistent with your observation that clients are alive, and there are requests after getting 499.

Related

What happens to existing HTTP and Websocket connections when the NGINX configuration reloads?

Let’s say we have a pretty standard blue/green deployment setup. NGINX is currently proxying all traffic to the live server (BLUE). We then deploy an updated version of our code to the idle server (GREEN). Finally, we refresh the NGINX configuration so that all HTTP traffic is being routed to GREEN.
From my understanding, NGINX will handle this reload gracefully. All future HTTP requests will be routed to GREEN and all pending HTTP requests that made it to BLUE will be handled by BLUE.
But what about WebSockets? Let’s say that BLUE had ten live WebSocket connections when the configuration was reloaded. What happens to them?
From my understanding on the NGINX documentation, the connections will be terminated in 60 seconds if no new data is sent.
However, if the clients are using some sort of keep-alive or ping then I’d imagine the WebSocket connections will be maintained indefinitely even though BLUE is no longer serving any other HTTP traffic.
Is my intuition right here? If so, I’d imagine that BLUE either needs to close the connections itself, or the client-side code does, unless NGINX has a feature that I am missing.

Nginx Keep Alive: Simulataneous SSL Handshakes Taking 25s

Thanks for reading :)
This is a super tough issue, and would love any ideas to figure this out.
Problem: The application on a user logging in initiates ~20 api requests in parallel. The first request will do the SSL handshake and then around the 10th to 13th request, I see two requests initiate the SSL handshake at the same time with each handshake getting stuck and taking over 25 seconds to repeat. The issue manifests for users as a 30 second login.
Setup: I have a setup with hardware based load balancer and about 8 nginx nodes that reverse proxy for a java application running on the same node. FE is a SPA, and all traffic flowing through nginx is dynamic content.
Additional Details
Tweaking the keepalive from 65s to 10s reduced the total SSL handshake time from >30s (which is the FE timeout) to 25s, so the issue is related to keepalive in some way.
The issue used to only be present on FF, and has now spread to safari
Upgraded nginx to latest LTS
Load balancer is distributing requests round robin.
Nginx logs do not include any mention of the issue.
The api requests are ordered, and usually affects 2 of the same 3 requests.

Nginx is giving uWSGI very old requests?

I'm seeing a weird situation where either Nginx or uwsgi seems to be building up a long queue of incoming requests, and attempting to process them long after the client connection timed out. I'd like to understand and stop that behavior. Here's more info:
My Setup
My server uses Nginx to pass HTTPS POST requests to uWSGI and Flask via a Unix file socket. I have basically the default configurations on everything.
I have a Python client sending 3 requests per second to that server.
The Problem
After running the client for about 4 hours, the client machine started reporting that all the connections were timing out. (It uses the Python requests library with a 7-second timeout.) About 10 minutes later, the behavior changed: the connections began failing with 502 Bad Gateway.
I powered off the client. But for about 10 minutes AFTER powering off the client, the server-side uWSGI logs showed uWSGI attempting to answer requests from that client! And top showed uWSGI using 100% CPU (25% per worker).
During those 10 minutes, each uwsgi.log entry looked like this:
Thu May 25 07:36:37 2017 - SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) on request /api/polldata (ip 98.210.18.212) !!!
Thu May 25 07:36:37 2017 - uwsgi_response_writev_headers_and_body_do(): Broken pipe [core/writer.c line 296] during POST /api/polldata (98.210.18.212)
IOError: write error
[pid: 34|app: 0|req: 645/12472] 98.210.18.212 () {42 vars in 588 bytes} [Thu May 25 07:36:08 2017] POST /api/polldata => generated 0 bytes in 28345 msecs (HTTP/1.1 200) 2 headers in 0 bytes (0 switches on core 0)
And the Nginx error.log shows a lot of this:
2017/05/25 08:10:29 [error] 36#36: *35037 connect() to unix:/srv/my_server/myproject.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 98.210.18.212, server: example.com, request: "POST /api/polldata HTTP/1.1", upstream: "uwsgi://unix:/srv/my_server/myproject.sock:", host: "example.com:5000"
After about 10 minutes the uWSGI activity stops. When I turn the client back on, Nginx happily accepts the POST requests, but uWSGI gives the same "writing to a closed pipe" error on every request, as if it's permanently broken somehow. Restarting the webserver's docker container does not fix the problem, but rebooting the host machine fixes it.
Theories
In the default Nginx -> socket -> uWSGI configuration, is there a long queue of requests with no timeout? I looked in the uWSGI docs and I saw a bunch of configurable timeouts, but all default to around 60 seconds, so I can't understand how I'm seeing 10-minute-old requests being handled. I haven't changed any default timeout settings.
The application uses almost all the 1GB RAM in my small dev server, so I think resource limits may be triggering the behavior.
Either way, I'd like to change my configuration so that requests > 30 seconds old get dropped with a 500 error, rather than getting processed by uWSGI. I'd appreciate any advice on how to do that, and theories on what's happening.
This appears to be an issue downstream on the uWSGI side.
It sounds like your backend code may be faulty in that it takes too long to process the requests, does not implement any sort of rate limiting for the requests, and does not properly catch if any of the underlying connections have been terminated (hence, you're receiving the errors that your code tries to write to closed pipelines, and possibly even start processing new requests long after the underlying connections have been terminated).
As per http://lists.unbit.it/pipermail/uwsgi/2013-February/005362.html, you might want to abort processing within your backend if not uwsgi.is_connected(uwsgi.connection_fd()).
You might want to explore https://uwsgi-docs.readthedocs.io/en/latest/Options.html#harakiri.
As last resort, as per Re: Understanding "proxy_ignore_client_abort" functionality (2014), you might want to change uwsgi_ignore_client_abort from off to on in order to not drop the ongoing uWSGI connections that have already been passed to the upstream (even if the client does subsequently disconnect) in order to not receive the closed pipe errors from uWSGI, as well as to enforce any possible concurrent connection limits within nginx itself (otherwise, the connections to uWSGI will get dropped by nginx should the client disconnect, and nginx would have no clue how many requests are being queued up within uWSGI for subsequent processing).
Seems like DoS attack on Nginx uWSGI returning 100% CPU usage with Nginx 502, 504, 500. IP spoofing is common in DoS attack. Exclude by checking the logs.

Why does the user agent resubmit a request after server does a TCP reset?

We've recently noticed a problem where some user agents would repeat the same POST request without the user actually physically triggering it twice.
After further study, we noticed this only happens when the request goes through our load balancer and when the server took a long time to process the request. A packet capture session eventually revealed that the load balancer drops the connection after a 5 minute timeout by sending a TCP Reset to the client; however, the client automatically resubmitted the request without user intervention.
We observed this behavior in Apache HTTP client for Java, Firefox and IE 8. (I cannot install other browsers to test.) This makes me think this behavior is part of the HTTP standard, but this is not very easy to google.
Also, it seems this only happen if the first request is submitted via a kept-alive TCP connection.
This is part of the HTTP 1.1 protocol to handle connections that are closed prematurely by servers.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html#sec8.2.4.
I experienced a similar situation whereby we had the same form posted a couple of times within milliseconds apart.
A packet capture via wireshark confirmed the retransmission by the browser and logs from the server indicated the arrival of the requests.
Also further investigation also revealed that the load balancer such as F5 have reported incidence of retransmission behavior. So it is worth checking with your load balancer vendor as well.

HTTP Proxy/FastCGI/SCGI not closing connection when client disconnected - bug or feature?

I'm working on Comet support for CppCMS framework via long XMLHttpRequest polls. In many cases, such request is closed by client before any response from server was given -- for example the page is closed, user moves to other page or it is just refeshed.
At the server side I expect that I would recieve the notification that connection is dropped. I tested the application via 3 connectors: FastCGI, SCGI and simple HTTP Proxy.
From 3 major UNIX web servers, Apache2, lighttpd and Nginx, only the last one had closed
connection as expected allowing my application to remove the request from wait queue -- this worked for both FastCGI and HTTP Proxy connectors. (Nginx does not have scgi module by default).
Others, Apache and Lighttpd do not close connection or inform the backend about disconnected
clients, the proceed as if the client is still on line. This happens for all 3 supported APIs: FastCGI, SCGI and HTTP Proxy.
I had opened an issue for Lighttpd, but what
more conserns me is the fact that Apache -- mature and well supported web server as lighttpd
and does not discloses the server backend that client had gone.
Questions:
Is this a bug or this is a feature? Is there any reason not to close the connection between web server and application backend?
Are there real life Comet application working behind these servers via FastCGI/SCGI/HTTP-Proxy backends?
If the above true, how do they deal with this issue? I understand that I can timeout all connections every 10 seconds, but I would like to keep them idle as far as client listens -- because this allows easier scale up -- each connection is very cheep -- the cost is only the opended socket.
Thanks!
(1) Feature. Or, more specifically, fallout from an implementation detail.
A TCP/IP connection does not involve a constant flow of traffic back and forth. Thus, there is no way to know that a client is gone without (a) the client telling you it is closing the connection or (b) a timeout.
(2) I'm not specifically familiar with Comet or CppCMS. But, yes, there are all kinds of CMS servers running behind the mentioned web servers and they all have to deal with this issue (and, yes, it is a pain).
(3) Timeouts are the only way, but you can mitigate the pain, so to speak. Have the client ping the server across the connection every N seconds when there is otherwise no activity. Doesn't have to do anything and you can tack stuff on the reply; notifications of concurrent edits or whatever you need.
You are correct in that it is surprising that mod_fastcgi doesn't support telling the backend that Apache has detected the disconnect or the connection timed out. And you aren't the first to be dismayed.
The second patch on this page should fix that particular issue:
http://osdir.com/ml/web.fastcgi.devel/2006-02/msg00015.html
http://ncannasse.fr/blog/tora_comet
I don't have any concrete information for you, but this article does mention that they can detect when the client has disconnected from Apache. See tora.Queue. And it sounds like the source is available in the neko CVS, so you might be able to find some clues there. Good luck.

Resources