Fixing kubernetes service redeploy errors with keep-alive enabled - http

We have a kubernetes service running on three machines. Clients both inside and outside of our cluster talk to this service over http with the keep-alive option enabled. During a deploy of the service, the exiting pods have a readiness check that starts to fail when shutdown starts, and are removed from the service endpoints list appropriately, however they still receive traffic and some requests fail as the container will abruptly exit. We believe this is because of the keep-alive which allows the the client to re-use these connections that were established when the host was Ready. Is there a series of steps one should follow to make sure we don't run into these issues? We'd like to allow keep-alive connections if at all possible.

The issue happens if the proxying/load balancing happens in layer 4 instead of layer 7. For the internal services (Kubernetes service of type ClusterIP), since the Kube-proxy does the proxying using layer 4 proxying, the clients will keep the connection even after the pod isn't ready to serve anymore. Similarly, for the services of type LoadBalancer, if the backend type is set to TCP (which is by default with AWS ELB), the same issue happens. Please see this issue for more details.
The solution to this problem as of now is:
If you are using a cloud LoadBalancer, go ahead and set the backend to HTTP. For example, You can add service.beta.kubernetes.io/aws-load-balancer-backend-protocol annotation to kubernetes service and set it to HTTP so that ELB uses HTTP proxying instead of TCP.
Use a layer 7 proxy/ingress controller within the cluster to route the traffic instead of sending it via kube-proxy

We're running into the same issue, so just wondering if you figured out a way around this issue. According to this link it should be possible to do so by having a Load Balancer in front of the service which will make direct requests to the pods and handle Keep-Alive connections on it's own.
We will continue to investigate this issue and see if we can find a way of doing zero downtime deployments with keep-alive connections.

Related

Correct way to get a gRPC client to communicate with one of many ECS instances of the gRPC service?

I have a gRPC client, not dockerised, and server application, which I want to dockerise.
What I don't understand is that gRPC first creates a connection with a server, which involves a handshake. So, if I want to deploy the dockerised server on ECS with multiple instances, then how will the client switch from one to the other (e.g., if one gRPC server falls over).
I know AWS loadbalancer now works with HTTP 2, but I can't find information on how to handle the fact that the server might change after the client has already opened a connection to another one.
What is involved?
You don't necessarily need an in-line load balancer for this. By using a Round Robin client-side load balancing policy along with a DNS record that points to multiple backend instances, you should be able to get some level of redundancy.

R-Shiny script timeout behind a load-balancer

I am testing one Shiny script on an instance in GCP. The instance resides behind a load-balancer that serves as a front end with a static IP address and SSL certificate to secure connections. I configured the GCP instance as part of a backend service to which the load-balancer forwards the requests. The connection between the load-balancer and the instance is not secured!
The issue:
accessing the Shiny script via the load-balancer works, but the web browser's screen gets grayed (time-out) on the client-side after a short time of initiating the connection!! When the browser screen grayed out, I have to start over!!
If I try to access the Shiny script on the GCP instance directly (not through the load-balancer), the script works fine. I suppose that the problem is in the load-balancer, not the script.
I appreciate any help with this issue.
Context: Shiny uses a websocket (RFC 6455) for its constant client-server communication. If, for whatever reason, this websocket connection gets dicsonnected, the user experience is the described "greying out". Fortunately GCP supports websockets.
However, it seems that your load balancer has an unexpected http time out value set.
Depending on what type of load balancer you are using (TCP, HTTPS) this can be configured differently. For their HTTPS offering:
The default value for the backend service timeout is 30 seconds. The full range of timeout values allowed is 1-2,147,483,647 seconds.
Consider increasing this timeout under any of these circumstances:
[...]
The connection is upgraded to a WebSocket.
Answer:
You should be able to increase the timeout for your backend service with the help of this support document.
Mind you, depending on your configuration there could be more proxies involved which might complicate things.
Alternatively you can try to prevent any timeout by adding a heartbeat mechanism to the Shiny application. Some ways of doing this have been discussed in this issue on GitHub.

How should I healthcheck an event-driven service

Suppose I have a service which rather than listening for http request, or gRPC procedure calls only consumes messages from a broker (Kafka, rabbitMQ, Google Pub/Sub, what have you). How should I go about healthchecking the service (eg. k8s liveness and readyness probes) ?
Should the service also listen for http solely for the purpose of healthchecking or is there some other technique which can be used ?
Having the service listen to HTTP solely to expose a liveness/readiness check (although in services that pull input from a message broker, readiness isn't necessarily something that a container scheduler like k8s would be concerned with) isn't really a problem (and it also opens up the potential to expose diagnostic and control endpoints).
Kubernetes supports three different types of probes, see also Kubernetes docs:
Running a command
Making an HTTP request
Checking a TCP socket
So, in your case you can run a command that fails when your service is unhealthy.
Also be aware that liveness probes may be dangerous to use.

Intra ServiceFabric communication with internal reverse proxy on localhost

I have a ServiceFabric with two applications. On application gets invoked from outside the ServiceFabric and then issues HTTP get requests to the other application inside the ServiceFabric.
My first attempt was to address the second application with the ServiceFabric's reverse proxy IP, the same as the first application is addressed with:
http://10.0.0.1:19081/App2/App2.Service/
This led to unreliable communication inside the ServiceFabric and the first request always failed, while the second mostly succeeded.
Then I read about internal ServiceFabric communication at https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reverseproxy. Now I address my second application with localhost and it seems to work as expected:
http://localhost:19081/App2/App2.Service/
The only open question is: Does addressing applications inside the ServiceFabric with localhost only work because the application is also running on the same node? Or does it work because there is real reverse proxy behavior and even if the application does not run on the same node, the request gets to it regardless?
The reverse proxy runs on all nodes, so it can be reached on localhost at all times. It forwards your call to the second service, which is resolved automatically.
You could also use the built-in DNS service to resolve internal services. This way, you save some of the overhead of the reverse proxy.
Opposed to using the ip address, you don't need to know whether the service runs on localhost or on a different node. Also, you don't get into trouble if your service is moved at run-time.

gRPC client reconnect inside Kubernetes

If we define our microservice inside Kubernetes pods, do we need to instrument a gRPC client reconnection if the service pod is restarting?
When the pod restarts the host name is not changed, but we cannot guarantee the IP address remains the same. So is the gRPC client still be able to detect the new server to reconnect to?
When the TCP connection is disconnected (because the old pod stopped) gRPC's channel will attempt to reconnect with exponential backoff. Each reconnect attempt implies resolving the DNS address, although it may not detect the new address immediately because of the TTL (time-to-live) of the old DNS entry. Also, I believe some implementations resolve the address when a failure is detected instead of before an attempt.
This process happens naturally without your application doing anything, although it may experience RPC failures until the connection is re-established. Enabling "wait for ready" on an RPC would reduce the chances the RPC fails during this period of transition, although such an RPC generally implies you don't care about response latency.
If the DNS address is not (eventually) re-resolved, then that would be a bug and you should file an issue.
You need client-side load balancing as described here. You can watch the endpoints of a service with Kubernetes api. I have created a package for Go programming language and it is on github. Sorry but I didn't write a documentation yet. Basic concept is get service endpoints at beginning than watch service endpoints for changes.

Resources