gRPC client reconnect inside Kubernetes - grpc

If we define our microservice inside Kubernetes pods, do we need to instrument a gRPC client reconnection if the service pod is restarting?
When the pod restarts the host name is not changed, but we cannot guarantee the IP address remains the same. So is the gRPC client still be able to detect the new server to reconnect to?

When the TCP connection is disconnected (because the old pod stopped) gRPC's channel will attempt to reconnect with exponential backoff. Each reconnect attempt implies resolving the DNS address, although it may not detect the new address immediately because of the TTL (time-to-live) of the old DNS entry. Also, I believe some implementations resolve the address when a failure is detected instead of before an attempt.
This process happens naturally without your application doing anything, although it may experience RPC failures until the connection is re-established. Enabling "wait for ready" on an RPC would reduce the chances the RPC fails during this period of transition, although such an RPC generally implies you don't care about response latency.
If the DNS address is not (eventually) re-resolved, then that would be a bug and you should file an issue.

You need client-side load balancing as described here. You can watch the endpoints of a service with Kubernetes api. I have created a package for Go programming language and it is on github. Sorry but I didn't write a documentation yet. Basic concept is get service endpoints at beginning than watch service endpoints for changes.

Related

R-Shiny script timeout behind a load-balancer

I am testing one Shiny script on an instance in GCP. The instance resides behind a load-balancer that serves as a front end with a static IP address and SSL certificate to secure connections. I configured the GCP instance as part of a backend service to which the load-balancer forwards the requests. The connection between the load-balancer and the instance is not secured!
The issue:
accessing the Shiny script via the load-balancer works, but the web browser's screen gets grayed (time-out) on the client-side after a short time of initiating the connection!! When the browser screen grayed out, I have to start over!!
If I try to access the Shiny script on the GCP instance directly (not through the load-balancer), the script works fine. I suppose that the problem is in the load-balancer, not the script.
I appreciate any help with this issue.
Context: Shiny uses a websocket (RFC 6455) for its constant client-server communication. If, for whatever reason, this websocket connection gets dicsonnected, the user experience is the described "greying out". Fortunately GCP supports websockets.
However, it seems that your load balancer has an unexpected http time out value set.
Depending on what type of load balancer you are using (TCP, HTTPS) this can be configured differently. For their HTTPS offering:
The default value for the backend service timeout is 30 seconds. The full range of timeout values allowed is 1-2,147,483,647 seconds.
Consider increasing this timeout under any of these circumstances:
[...]
The connection is upgraded to a WebSocket.
Answer:
You should be able to increase the timeout for your backend service with the help of this support document.
Mind you, depending on your configuration there could be more proxies involved which might complicate things.
Alternatively you can try to prevent any timeout by adding a heartbeat mechanism to the Shiny application. Some ways of doing this have been discussed in this issue on GitHub.

How should I healthcheck an event-driven service

Suppose I have a service which rather than listening for http request, or gRPC procedure calls only consumes messages from a broker (Kafka, rabbitMQ, Google Pub/Sub, what have you). How should I go about healthchecking the service (eg. k8s liveness and readyness probes) ?
Should the service also listen for http solely for the purpose of healthchecking or is there some other technique which can be used ?
Having the service listen to HTTP solely to expose a liveness/readiness check (although in services that pull input from a message broker, readiness isn't necessarily something that a container scheduler like k8s would be concerned with) isn't really a problem (and it also opens up the potential to expose diagnostic and control endpoints).
Kubernetes supports three different types of probes, see also Kubernetes docs:
Running a command
Making an HTTP request
Checking a TCP socket
So, in your case you can run a command that fails when your service is unhealthy.
Also be aware that liveness probes may be dangerous to use.

Trouble connecting to gRPC server on AWS Fargate

I have a Python gRPC server running on AWS Fargate (configured very similar to this AWS guide here), and another AWS Fargate task (call it the "client") that attempts to make a connection to my gRPC server (also using Python gRPC). However, the client is unable to make a call to my server, with the following error:
<_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"#1619057124.216955000","description":"Failed to pick subchannel",
"file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5397,
"referenced_errors":[{"created":"#1619057124.216950000","description":"failed to connect to all addresses",
"file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc",
"file_line":398,"grpc_status":14}]}"
Based on my reading online, it seems like there are myriad situations in which this error is thrown, and I'm having trouble figuring out which one pertains to my case. Here is some additional information:
When running client and server locally, I am able to successfully connect by having the client connect to localhost:[PORT]
I have configured an application load balancer target group following the guide from AWS here that makes health check requests to the / route of my gRPC server, using the gRPC protocol, and expect gRPC response code 12 (UNIMPLEMENTED); these health check requests are coming back as expected, which I believe implies the load balancer is able to successfully communicate with the server (although I could be misunderstanding)
I configured a service discovery system (following this guide here) that should allow me to reach my gRPC server within my VPC via the name service-name.dev.co.local. I can confirm that the corresponding DNS record exists in Route 53, and when I SSH into my VPC, I am indeed able to ping service-name.dev.co.local successfully.
Anyone have any ideas? Would appreciate any and all advice, and I'm happy to answer any further questions.
Thank you for your help!
on your grpc server use 0.0.0.0:[port] and expose this port with TCP on your container.

Fixing kubernetes service redeploy errors with keep-alive enabled

We have a kubernetes service running on three machines. Clients both inside and outside of our cluster talk to this service over http with the keep-alive option enabled. During a deploy of the service, the exiting pods have a readiness check that starts to fail when shutdown starts, and are removed from the service endpoints list appropriately, however they still receive traffic and some requests fail as the container will abruptly exit. We believe this is because of the keep-alive which allows the the client to re-use these connections that were established when the host was Ready. Is there a series of steps one should follow to make sure we don't run into these issues? We'd like to allow keep-alive connections if at all possible.
The issue happens if the proxying/load balancing happens in layer 4 instead of layer 7. For the internal services (Kubernetes service of type ClusterIP), since the Kube-proxy does the proxying using layer 4 proxying, the clients will keep the connection even after the pod isn't ready to serve anymore. Similarly, for the services of type LoadBalancer, if the backend type is set to TCP (which is by default with AWS ELB), the same issue happens. Please see this issue for more details.
The solution to this problem as of now is:
If you are using a cloud LoadBalancer, go ahead and set the backend to HTTP. For example, You can add service.beta.kubernetes.io/aws-load-balancer-backend-protocol annotation to kubernetes service and set it to HTTP so that ELB uses HTTP proxying instead of TCP.
Use a layer 7 proxy/ingress controller within the cluster to route the traffic instead of sending it via kube-proxy
We're running into the same issue, so just wondering if you figured out a way around this issue. According to this link it should be possible to do so by having a Load Balancer in front of the service which will make direct requests to the pods and handle Keep-Alive connections on it's own.
We will continue to investigate this issue and see if we can find a way of doing zero downtime deployments with keep-alive connections.

Service Fabric and TCP connections

We have developed a TeamViewer-like service where clients connect via SSL to our centralized servers. Other clients can connect to the server as well and we can setup a tunnel through our service to allow peer-to-peer connectivity without NAT or firewall issues.
This works fine with Azure Cloud Services, but we would like to move away from Azure Cloud Services. Service Fabric seems to be the way to go, because it supports ARM and also allows much fine-grained services and make updating parts of the system much more easy.
I know that microservices in Service Fabric can be stateful, but all examples use persistent data as state. In my situation the TCP connection is also part of the state. Is it possible to use TCP with service fabric?
The TCP endpoint should be kept alive on the same instance (for several days), so this makes the entire service fabric model much more difficult.
Sure, you can have users connect to your services over any protocol you want. Your service sounds very stateful to me in the same way that user session state is stateful - you want users to return to the same place where their data is. In your case, that "data" is a TCP connection. But there's no guarantee a TCP endpoint will be kept alive for days in any system - machines fail, software crashes, OSes get patched, etc. You need to be prepared for the connection to break so you can quickly re-establish it. Service Fabric stateful services are great for this. Failover of a stateful service to another machine is extremely fast (milliseconds). Of course, you can't actually replicate a live connection, but you sure can replicate all the metadata you need to re-establish a connection if it breaks.

Resources