Thanks for reading :)
This is a super tough issue, and would love any ideas to figure this out.
Problem: The application on a user logging in initiates ~20 api requests in parallel. The first request will do the SSL handshake and then around the 10th to 13th request, I see two requests initiate the SSL handshake at the same time with each handshake getting stuck and taking over 25 seconds to repeat. The issue manifests for users as a 30 second login.
Setup: I have a setup with hardware based load balancer and about 8 nginx nodes that reverse proxy for a java application running on the same node. FE is a SPA, and all traffic flowing through nginx is dynamic content.
Additional Details
Tweaking the keepalive from 65s to 10s reduced the total SSL handshake time from >30s (which is the FE timeout) to 25s, so the issue is related to keepalive in some way.
The issue used to only be present on FF, and has now spread to safari
Upgraded nginx to latest LTS
Load balancer is distributing requests round robin.
Nginx logs do not include any mention of the issue.
The api requests are ordered, and usually affects 2 of the same 3 requests.
Related
Let’s say we have a pretty standard blue/green deployment setup. NGINX is currently proxying all traffic to the live server (BLUE). We then deploy an updated version of our code to the idle server (GREEN). Finally, we refresh the NGINX configuration so that all HTTP traffic is being routed to GREEN.
From my understanding, NGINX will handle this reload gracefully. All future HTTP requests will be routed to GREEN and all pending HTTP requests that made it to BLUE will be handled by BLUE.
But what about WebSockets? Let’s say that BLUE had ten live WebSocket connections when the configuration was reloaded. What happens to them?
From my understanding on the NGINX documentation, the connections will be terminated in 60 seconds if no new data is sent.
However, if the clients are using some sort of keep-alive or ping then I’d imagine the WebSocket connections will be maintained indefinitely even though BLUE is no longer serving any other HTTP traffic.
Is my intuition right here? If so, I’d imagine that BLUE either needs to close the connections itself, or the client-side code does, unless NGINX has a feature that I am missing.
What I did: I tweaked a Wordpress website so it will load within 2.3 seconds instead of the former 5 seconds. As a final step, I enabled HTTP/2 in the LAMP server (PHP7.0, Apache 2.4) and restarted Apache.
Issues: On doing this, the Time to First Byte jumped from 500ms to 9 seconds and the website PNG and JPEG images in the Speed Load Test (GTmetrix and Webpagetest.org) are now resulting in 404. Please, see screenshot here https://www.diigo.com/item/image/5pj5q/vt0g
Overall, the Speed Load Test is showing me that the website is loading at 20-30 seconds, instead of the former 2.3 seconds.
I am at a loss about how to solve those two issues caused by HTTP/2. Any advice is welcome.
I may have found the answer. It has to do with SSL Certs. At least for my situation. I use WHM/CPanel and the SSL certs get shared. HTTP2 detects that and sends a 421 error.
From Apache : https://httpd.apache.org/docs/2.4/mod/mod_http2.html
Multiple Hosts and Misdirected Requests
Many sites use the same TLS certificate for multiple virtual hosts.
The certificate either has a wildcard name, such as '*.example.org' or
carries several alternate names. Browsers using HTTP/2 will recognize
that and reuse an already opened connection for such hosts.
While this is great for performance, it comes at a price: such vhosts
need more care in their configuration. The problem is that you will
have multiple requests for multiple hosts on the same TLS connection.
And that makes renegotiation impossible, in face the HTTP/2 standard
forbids it.
So, if you have several virtual hosts using the same certificate and
want to use HTTP/2 for them, you need to make sure that all vhosts
have exactly the same SSL configuration. You need the same protocol,
ciphers and settings for client verification.
If you mix things, Apache httpd will detect it and return a special
response code, 421 Misdirected Request, to the client.
I disabled http2 and the 404/421 errors stopped.
I have an ejabberd cluster in AWS that I want to load balance. I initially tried putting an ELB in front of the nodes, but that makes the sessions to be non-sticky. I then enabled proxy protocol on the ELB and introduced an HAProxy node between the ELB and the ejabberd cluster. My assumption / understanding here was that the HAProxy instance would use the TCP proxy and ensure the sessions are sticky on the ejabberd servers.
However, that still does not seem to be happening! Is this even possible in the first place? Introducing the cookie config in the HAProxy.cfg file gives an error that cookies are enabled only for HTTP, so how can I have TCP sessions stay sticky on the server...
Please do help as seem to be lost on ideas here!
ejabberd does not require sticky load balancing. You do not need to implement this. Just use ejabberd cluster with ELB or HAProxy on front, without stickyness.
Thanks #Michael-sqlbot and #Mickael - seems it had to do with the idle timeout in the ELB. That was set to 60 seconds, so the TCP connection was getting refreshed if I didnt push any data from the client to the ejabberd server. On playing with that plus the health check interval, I can see the ELB giving me a long-running connection... Thanks.
I still have to figure out how to get the client IP's captured in ejabberd (believe enabling proxy protocol on the ELB would help) but that is a separate investigation...
We are experiencing a weird problem with Varnish 3.0. We are observing a rate of 10-20 failures per node per minute in our varnish farm. Varnish talks to a backend server which is fronted by a load balancer application (F5) in this case. We took TCP dumps on the Varnish layer and the load balancer layer. It appears that the backend server is responding in around 3 seconds. In the TCP dump we see the 200 Ok being received by Varnish after 3 seconds. After this is where we see the strange behaviour. Varnish server sends the ACK message to the load balancer within milliseconds. The FIN, ACK message is sent after a delay of about 10 seconds. This time matches the 10 second configuration in the Varnish layer and we see the 503 error being returned from the Varnish layer. This is the Varnish backend configuration. The backend has been renamed due to security reasons.
backend backend1{
.host = "<load balancer virtual server name>";
.port = "<port>";
.first_byte_timeout = 120s;
.connect_timeout = 10s;
.between_bytes_timeout = 10s;
}
Have any of you experienced a similar issue. Any pointers on troubleshooting this issue would be greatly appreciated.
The problem seems to be in the between_bytes_timeout configuration. You have set it to be 10 seconds, and according to you, the load balancer takes 10 seconds to send the FIN, ACK message.
From the varnish docs:
between_bytes_timeout
Units: s
Default: 60
Default timeout between bytes when receiving data from backend. We only wait for this many seconds between bytes before giving up. A value of 0 means it will never time out. VCL can override this default value for each backend request and backend request. This parameter does not apply to pipe.
Try to increase this number and see what happens
Recently we've switched apache to nginx with http2 support for our web application, and we're seeing quite a lot of 499 errors.
Our setup:
Ubuntu machine running on Amazon AWS
Nginx/1.9.12 acting as proxy (and ssl offload) for node application (same machine)
Single Page App on the client side
My initial thought was that clients simply close their browsers, but from the logs, i see that ~95% clients are alive, and there are requests after getting 499.
55% of 499 errors occur for http2 and 45% for http1.1 version, so no trend here.
80% of the requests come from mobile devices (bad connection?)
But of particular worry there is one endpoint which might take 5-15 seconds to complete (PUT request). For that endpoint:
~95% of the 499 errors are for http2 version
~95% of the request are from mobile devices
almost all clients are alive (we see that from logs because after failed request client side javascript issues another request to different endpoint)
There is no time pattern - sometimes client gets 499 after just 0.1 second, sometimes 3-9 seconds
Logs don't indicate any problems on the node upstream, and this happens regularly, and there is no heavy load.
I've tried adding keepalive to upstream, and enabling proxy_ignore_client_abort, but that does not seem to help.
Any hints how to troubleshoot this?
I was reading this unanswered question which suggests that one potential source is from impatient clients hitting the refresh button.
This seems to be consistent with your observation that clients are alive, and there are requests after getting 499.