Spring boot RabbitMq health check reports UP even when network connection is down - spring-boot-actuator

Here is the scenario I'm testing. I require a vpn connection to reach the RabbitMQ server. When I start my spring boot based application with the vpn up, all is good: the application starts up fine and the rabbitMQ health check report UP with the proper version to rabbitMQ server.
When I disconnect the vpn, the RabbitMQ health check continues to repost UP even though the RabbitMQ server isn't reachable anymore.
At the end, what I want is some retry capabilities (Backoff) for normal rabbitMQ operations but not for the health check. Health checks should always fail fast while the application tries to recover from a lost network connection. But even the simplest scenario (without any retries) the health check doesn't hang (yeah!) but doesn't report any failures
I even followed this with no luck: SpringBoot Disable rabbitTemplate retry policy for rabbit health check
I would appreciate any help/pointers.
Environment:
java 8 (1.8.0_162)
Spring boot 1.5.4
spring-amqp & spring-rabbit 1.7.3
Rabbit MQ server: 3.7.5
No special rabbitmq configuration aside from host, port, username and password.

Related

Trouble connecting to gRPC server on AWS Fargate

I have a Python gRPC server running on AWS Fargate (configured very similar to this AWS guide here), and another AWS Fargate task (call it the "client") that attempts to make a connection to my gRPC server (also using Python gRPC). However, the client is unable to make a call to my server, with the following error:
<_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"#1619057124.216955000","description":"Failed to pick subchannel",
"file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5397,
"referenced_errors":[{"created":"#1619057124.216950000","description":"failed to connect to all addresses",
"file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc",
"file_line":398,"grpc_status":14}]}"
Based on my reading online, it seems like there are myriad situations in which this error is thrown, and I'm having trouble figuring out which one pertains to my case. Here is some additional information:
When running client and server locally, I am able to successfully connect by having the client connect to localhost:[PORT]
I have configured an application load balancer target group following the guide from AWS here that makes health check requests to the / route of my gRPC server, using the gRPC protocol, and expect gRPC response code 12 (UNIMPLEMENTED); these health check requests are coming back as expected, which I believe implies the load balancer is able to successfully communicate with the server (although I could be misunderstanding)
I configured a service discovery system (following this guide here) that should allow me to reach my gRPC server within my VPC via the name service-name.dev.co.local. I can confirm that the corresponding DNS record exists in Route 53, and when I SSH into my VPC, I am indeed able to ping service-name.dev.co.local successfully.
Anyone have any ideas? Would appreciate any and all advice, and I'm happy to answer any further questions.
Thank you for your help!
on your grpc server use 0.0.0.0:[port] and expose this port with TCP on your container.

AMQP handshake timeout error while deploying AWS Corda Enterprise Template

I am deploying AWS Corda Enterprise Template. The Quick start deployed the stack as per the defined CloudFormation template. I can see 2 AWS instances, up and running as Corda nodes, in Hot-Cold setup with a load balancer.
However the Log for Corda node has following ERROR related to AMQP communication.
[ERROR] 2018-10-18T05:47:55,743Z [Thread-3
(ActiveMQ-scheduled-threads)] core.server.lambda$channelActive$0 -
AMQ224088: Timeout (10 seconds) while handshaking has occurred. {}
What can be possible reason for this error? This error keeps on occurring after a certain time interval. So it looks like some connectivity issue to me.
Note: The load balancer shows the status of this AWS Corda instances as healty (In Service). So I believe the Corda node has booted up successfully.
The ERROR message isn't necessarily tied to AMQP. Perhaps you were confused by the "AMQ" in the error ID (AMQ224088)?
In any event, this error indicates that something on the network is connecting to the ActiveMQ Artemis broker, but it's not completing any protocol handshake. This is commonly seen with, for example, load balancers that do a health check by creating a socket connection without sending any real data just to see if the port is open on the target machine.

Kafka Receives Messages But Fails To Add To Topic - With Setup Local Kafka VM and Minikube Kubernetes Cluster

Set Up
Laptop with:-
Kafka in a virtualbox vm : vagrant 9092 port forwarded from laptop's localhost
Kubernetes Cluster in a virtualbox VM : miniKube
Desired Outcome
Microservices on my miniKube cluster can fire messages to Kafka VM.
Note that this works in Google Container Engine (GKE)
Actual Outcome
From the laptop I can use a console producer to send messages to Kafka VM and it happily obliges adding these to the topic. But when a microservice from the kubernetes cluster sends a message, the message is received but it is not added to the topic.
Instead I get the error on the microservice ...
Batch containing 1 record(s) expired due to timeout while requesting metadata from brokers for generated-test-script-0
If I tail kafka-request.log I see ...
[2017-02-08 21:57:05,891] TRACE Completed request:{api_key=3,api_version=1,correlation_id=0,client_id=producer-5} -- {topics=[generated-test-script]} from connection 10.0.2.15:9092-10.0.2.2:50124;totalTime:0,requestQueueTime:0,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0,securityProtocol:PLAINTEXT,principal:User:ANONYMOUS (kafka.request.logger)
While in the "success" case when I simply use a console producer on the laptop I see 2 lines. 1 the same as above but I guess another ACK ...
[2017-02-08 22:08:12,764] TRACE Completed request:{api_key=3,api_version=2,correlation_id=0,client_id=console-producer} -- {topics=[test]} from connection 10.0.2.15:9092-10.0.2.2:50748;totalTime:6,requestQueueTime:0,localTime:6,remoteTime:0,responseQueueTime:0,sendTime:0,securityProtocol:PLAINTEXT,principal:User:ANONYMOUS (kafka.request.logger)
[2017-02-08 22:08:13,799] TRACE Completed request:{api_key=0,api_version=2,correlation_id=1,client_id=console-producer} -- {acks=1,timeout=1500,topic_data=[{topic=test,data=[{partition=0,record_set=java.nio.HeapByteBuffer[pos=0 lim=39 cap=39]}]}]} from connection 10.0.2.15:9092-10.0.2.2:53696;totalTime:22,requestQueueTime:1,localTime:21,remoteTime:0,responseQueueTime:0,sendTime:0,securityProtocol:PLAINTEXT,principal:User:ANONYMOUS (kafka.request.logger)
Conclusion And Thoughts
So there is no ERROR as such on the kafka server side, just on the client. My guess is that this a a network issue setup ( NAT? ) whereby the microserice in the virtual Kubernetes cluster can talk to my Kafka VM but the reply route is dropped?
The metadata is required to be returned by Kafka on the first sent message so making the batch size == 0, or "acks" = 0 doesn't really help as a hack because of the initial requirement to send this metadata back.
Any thoughts or pointers would be great as I really want to run this cluster and Kafka VM locally for dev work.

Service Fabric and TCP connections

We have developed a TeamViewer-like service where clients connect via SSL to our centralized servers. Other clients can connect to the server as well and we can setup a tunnel through our service to allow peer-to-peer connectivity without NAT or firewall issues.
This works fine with Azure Cloud Services, but we would like to move away from Azure Cloud Services. Service Fabric seems to be the way to go, because it supports ARM and also allows much fine-grained services and make updating parts of the system much more easy.
I know that microservices in Service Fabric can be stateful, but all examples use persistent data as state. In my situation the TCP connection is also part of the state. Is it possible to use TCP with service fabric?
The TCP endpoint should be kept alive on the same instance (for several days), so this makes the entire service fabric model much more difficult.
Sure, you can have users connect to your services over any protocol you want. Your service sounds very stateful to me in the same way that user session state is stateful - you want users to return to the same place where their data is. In your case, that "data" is a TCP connection. But there's no guarantee a TCP endpoint will be kept alive for days in any system - machines fail, software crashes, OSes get patched, etc. You need to be prepared for the connection to break so you can quickly re-establish it. Service Fabric stateful services are great for this. Failover of a stateful service to another machine is extremely fast (milliseconds). Of course, you can't actually replicate a live connection, but you sure can replicate all the metadata you need to re-establish a connection if it breaks.

gRPC client reconnect inside Kubernetes

If we define our microservice inside Kubernetes pods, do we need to instrument a gRPC client reconnection if the service pod is restarting?
When the pod restarts the host name is not changed, but we cannot guarantee the IP address remains the same. So is the gRPC client still be able to detect the new server to reconnect to?
When the TCP connection is disconnected (because the old pod stopped) gRPC's channel will attempt to reconnect with exponential backoff. Each reconnect attempt implies resolving the DNS address, although it may not detect the new address immediately because of the TTL (time-to-live) of the old DNS entry. Also, I believe some implementations resolve the address when a failure is detected instead of before an attempt.
This process happens naturally without your application doing anything, although it may experience RPC failures until the connection is re-established. Enabling "wait for ready" on an RPC would reduce the chances the RPC fails during this period of transition, although such an RPC generally implies you don't care about response latency.
If the DNS address is not (eventually) re-resolved, then that would be a bug and you should file an issue.
You need client-side load balancing as described here. You can watch the endpoints of a service with Kubernetes api. I have created a package for Go programming language and it is on github. Sorry but I didn't write a documentation yet. Basic concept is get service endpoints at beginning than watch service endpoints for changes.

Resources