Connection closed in AKS with Nginx ingress - nginx

I have an AKS cluster using Nginx ingress controllers, and I infrequently but reliably receive errors like this:
The connection was closed unexpectedly
An existing connection was forcibly closed by the remote host
What can I do to fix this?

I found the solution to these errors in the documentation at https://blogs.msdn.microsoft.com/jpsanders/2009/01/07/you-receive-one-or-more-error-messages-when-you-try-to-make-an-http-request-in-an-application-that-is-built-on-the-net-framework-2-0, which says:
Also check and ensure the Keep-Alive timeouts on the server, load
balancer and client (.NET) are set so that the client is set to less
than the load balancer, which in turn is set less than the server.
In my case I needed to increase the upstream-keepalive-timeout setting to something larger than the default timeout of an Azure loadbalancer (which is 4 minutes). I set the value to 300 seconds, and the errors went away.
Edit
I also had to see the worker-shutdown-timeout value, as described in https://github.com/kubernetes/minikube/issues/3039.

Related

R-Shiny script timeout behind a load-balancer

I am testing one Shiny script on an instance in GCP. The instance resides behind a load-balancer that serves as a front end with a static IP address and SSL certificate to secure connections. I configured the GCP instance as part of a backend service to which the load-balancer forwards the requests. The connection between the load-balancer and the instance is not secured!
The issue:
accessing the Shiny script via the load-balancer works, but the web browser's screen gets grayed (time-out) on the client-side after a short time of initiating the connection!! When the browser screen grayed out, I have to start over!!
If I try to access the Shiny script on the GCP instance directly (not through the load-balancer), the script works fine. I suppose that the problem is in the load-balancer, not the script.
I appreciate any help with this issue.
Context: Shiny uses a websocket (RFC 6455) for its constant client-server communication. If, for whatever reason, this websocket connection gets dicsonnected, the user experience is the described "greying out". Fortunately GCP supports websockets.
However, it seems that your load balancer has an unexpected http time out value set.
Depending on what type of load balancer you are using (TCP, HTTPS) this can be configured differently. For their HTTPS offering:
The default value for the backend service timeout is 30 seconds. The full range of timeout values allowed is 1-2,147,483,647 seconds.
Consider increasing this timeout under any of these circumstances:
[...]
The connection is upgraded to a WebSocket.
Answer:
You should be able to increase the timeout for your backend service with the help of this support document.
Mind you, depending on your configuration there could be more proxies involved which might complicate things.
Alternatively you can try to prevent any timeout by adding a heartbeat mechanism to the Shiny application. Some ways of doing this have been discussed in this issue on GitHub.

What would a WebException for a connection killed by a Heroku server look like?

Background:
I am working with an external vendor where requests for a file download go through a web service. I believe their service is hosted on herokuapp and recently we have been seeing connections get killed after exactly 30 seconds. 99.99% of our requests receive subsecond responses, occasionally one or two requests will take 20+ seconds and for any that hit the 30 second mark we see the issue. So I only see this issue on about 1/10,000 requests (happens once every few days).
I've done some looking around and the only thing that I've found to be common is that Heroku has a 30 second HTTP request timeout that a few people have issues with and on the server side it's pretty easy to spot one of these. The issue is we don't have access to server side logs and only have a generic error returning client side.
What I have tried:
In terms of debugging I have pointed the service endpoint to a local dummy web service that literally just sleeps for 3 minutes and it doesn't timeout until the 120 second mark (which is our server's default).
Error in WebException message after 30 seconds to external vendor's service: "The underlying connection was closed: An unexpected error occurred on a receive"
As a note to the above error message, tls1.2 is already being forced for these requests.
Actual Question:
Is it possible based on these logs that Heroku is actually killing this connection server side and this would result in the generic error seen?
Sources:
Heroku HTTP request timeouts: https://devcenter.heroku.com/articles/request-timeout
Outsystems blaming these errors on remote server, not client: https://www.outsystems.com/forums/discussion/15641/tip-the-underlying-connection-was-closed-an-unexpected-error-occurred-on-a-rece/

Problems with k8s service after few minutes

we started a redis chart (bitnami's redis helm, cluster mode disabled) and connected our java app (using redisson) to the service.
After 5 minutes the connection to the redis is closed ("Reading from client: Connection reset by peer in redis debug log"). (The networking still seems fine, and we can establish new connections, but the old ones are closed and redis conf timeout is 0)
When configuring the Java to access directly to the redis pod (without the k8s service in the middle), the problem doesn't happen.
We didn't find any similar problem on the web, which is weird (pretty out of the box settings, nothing special).
We have no idea what can cause such problems, any ideas?
Kuberenets version 1.11.7 installed on AWS via kops, we tried Kubenet cluster and new calico-flannel (canal) cluster (without policies) and switching redis versions (4 and 5), and accessing the service by IP/name but it didn't help.
redis timeout is 0 (disabled)
Bitnami's helm chart https://github.com/helm/charts/tree/master/stable/redis
using usePassword: false and loglevel debug.

Fixing kubernetes service redeploy errors with keep-alive enabled

We have a kubernetes service running on three machines. Clients both inside and outside of our cluster talk to this service over http with the keep-alive option enabled. During a deploy of the service, the exiting pods have a readiness check that starts to fail when shutdown starts, and are removed from the service endpoints list appropriately, however they still receive traffic and some requests fail as the container will abruptly exit. We believe this is because of the keep-alive which allows the the client to re-use these connections that were established when the host was Ready. Is there a series of steps one should follow to make sure we don't run into these issues? We'd like to allow keep-alive connections if at all possible.
The issue happens if the proxying/load balancing happens in layer 4 instead of layer 7. For the internal services (Kubernetes service of type ClusterIP), since the Kube-proxy does the proxying using layer 4 proxying, the clients will keep the connection even after the pod isn't ready to serve anymore. Similarly, for the services of type LoadBalancer, if the backend type is set to TCP (which is by default with AWS ELB), the same issue happens. Please see this issue for more details.
The solution to this problem as of now is:
If you are using a cloud LoadBalancer, go ahead and set the backend to HTTP. For example, You can add service.beta.kubernetes.io/aws-load-balancer-backend-protocol annotation to kubernetes service and set it to HTTP so that ELB uses HTTP proxying instead of TCP.
Use a layer 7 proxy/ingress controller within the cluster to route the traffic instead of sending it via kube-proxy
We're running into the same issue, so just wondering if you figured out a way around this issue. According to this link it should be possible to do so by having a Load Balancer in front of the service which will make direct requests to the pods and handle Keep-Alive connections on it's own.
We will continue to investigate this issue and see if we can find a way of doing zero downtime deployments with keep-alive connections.

1 second connection times for localhost vs. nill connection times agains my IP 192.168.1.x

I'm seeing an odd behavior connecting to a development webserver running on my local machine.
When I connect to http://192.168.1.14 (my ip) everything loads fast (sub 200ms for a page + 20 resources, an expected result).
When I connect to http://localhost I get 5 second load times for the same page and what I see when I profile the page using Chrome is:
1st page, ~1 sec connection time (varies from 0.98 to 1.02), then loads fast
The next 3 resources have a ~1 sec connection time then load fast.
The next 6 then wait ~1 sec connection time (after the last group loaded) then load fast.
This pattern repeats its self in batches of 6 resources until everything is loaded.
Any ideas on this? Where could the 1 sec connection time delay be coming from?
I'm referencing "Connection Time" from the resource profile graph provided in the Google Chrome Developer Tools. For example, one resource shows:
Proxy: 0ms
DNS Lookup: 1ms
Connecting: 1.00s
Sending: 0ms
Waiting: 3ms
Receiving: 2ms
I had a similar problem recently.
The problem was that I had configured a proxy server in Windows' proxy settings and every request went through the proxy even though I had explicitly enabled the option "Bypass proxy server for local addresses". Apparently this setting was not working or didn't do what I expected.
So I went ahead and configured an exception for localhost in the advanced proxy settings. Now everything from my development server loads pretty fast.
What happens when you try 127.0.0.1 directly? I imaging it could be the fact that something still has to translate localhost into 127.0.0.1 be it a DNS server or your /etc/hosts file or the yellow-pages service.
If 127.0.0.1 is fast, I would say it's a translation delay in which case the next step is to figure out what's doing the translation.
If it's slow, I'd be looking at the appserver or webserver logs. Apache, if that's what you're using, is one of the only two infinitely-configurable applications ever built, the other being Emacs, of course :-)
The only solution I found was adding a host file entry and referencing the local website through the host file entry. For some inexplicable reason, this rendered the page in real time.

Resources