Kafka Listeners stop reading from topics after a few hours - spring-kafka

An app I have been working on has started causing issues in our staging and production environment that are seemingly due to Kafka listeners no longer reading anything from their assigned topics after a few hours from the app starting.
The app is running in a cloud foundry environment and it has 13 #KafkaListener, reading from multiple topics based on their given pattern. The amount of topics is equal (each user on the app creates its own topic for each of the 13 listeners using the pattern). Topics have 3 partitions. Auto-scaling is also used, with a minimum of 2 instances of the app running at the same time. One of the topics is under heavier load than the others, receiving between 1 to 200 messages each second. The processing time for each message is short, as we receive batches and the processing part only proceeds to write the batch to a DB.
The current issue is, as stated, that it works for a while after starting and then suddenly the listeners are no longer picking up messages. With no apparent error or warning in the logs. A temporary endpoint was created where KafkaListenerEndpointRegistry is used to look at the Listener Containers, and all of them seem to be running and have proper partitions assigned. Doing a .stop() and .start() on the containers leads to one additional batch of messages being processed, and then nothing else.
The following are the configs used:
#Bean
public ConsumerFactory<String, String> consumerFactory(){
return new DefaultKafkaConsumerFactory<>(kafkaConfig.getConfiguration());
}
#Bean
public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory(){
ConcurrentKafkaListenerContainerFactory<String, String> factory = new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory());
factory.setBatchListener(true);
factory.setConcurrency(3);
factory.getContainerProperties().setPollTimeout(5000);
factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.MANUAL_IMMEDIATE);
}
The kafkaConfig sets the following settings:
PARTITION_ASSIGNMENT_STRATEGY_CONFIG: RoundRobinAssignor
MAX_POLL_INTERVAL_MS_CONFIG: 60000
MAX_POLL_RECORDS_CONFIG: 10
MAX_PARTITION_FETCH_BYTES_CONFIG: Integer.MAX_VALUE
ENABLE_AUTO_COMMIT_CONFIG: false
METADATA_MAX_AGE_CONFIG: 15000
REQUEST_TIMEOUT_MS_CONFIG: 30000
HEARTBEAT_INTERVAL_MS_CONFIG: 15000
SESSION_TIMEOUT_MS_CONFIG: 60000
Additionally, each listener is in its own class and has the listen method as follows:
#KafkaListener(id="<patternName>-container", topicPattern = "<patternName>.*", groupId = "<patternName>Group")
public void listen(#Payload List<String> payloads,
#Header(KafkaHeaders.RECEIVED_TOPIC) String topics,
Acknowledgement acknowledgement){
//processPayload...
acknowledgement.acknowledge();
}
The spring-kakfa version is 2.7.4.
Is there an issue with this config that could solve the issue? I have recently tried multiple changes with no success, changing these config settings around, moving the #KafkaListener annotation at class level, restarting the Listener Containers when they stop reading, and even having all the processing on the messages be done asynchronously and acknowledging the messages the moment they are picked up by the listener method. There were no errors or warning logs, and I wasn't able to see anything helpful on debug logging due to the amount of messages sent each second. We also have another app running the same settings in the same environments, but only 3 listeners (different topic patterns), where this issue does not occur. It is under a similar load, as the messages received by those 3 listeners are being output to the topic causing the large load on the app with the problem.
I would very much appreciate any help or pointers to what else I can do, since this issue is blocking us heavily in our production. Let me know if I missed something that could help.
Thank you.

Most problems like this are due to the listener thread being stuck in user code somplace; take a thread dump when this happens to see what the threads are doing.

Related

Out of Memory due to multiple consumers ActiveMQ

I am using DefaultMessageListenerContainer as below :
private static final AnnotationConfigApplicationContext context = new AnnotationConfigApplicationContext(MessageConsumer.class);
public static final DefaultMessageListenerContainer container = context.getBean(DefaultMessageListenerContainer.class);
For a given queue that my listener listens to , the first time my program runs it starts the listener .... and this creates a Consumer for the queue that I can see in the Active MQ Console.
The problem I have is every time I checkin some new code, there is another new Consumer created for the queue and the old one is still hanging on creating some Out of Memory issue.
What am I doing wrong here? How do I make sure there is only 1 consumer and the old consumer is killed with every new code checkin? Hope I explained the issue clearly.

How to deduplicate events when using RabbitMQ Publish/Subscribe Microservice Event Bus

I have been reading This Book on page 58 to understand how to do asynchronous event integration between microservices.
Using RabbitMQ and publish/subscribe patterns facilitates pushing events out to subscribers. However, given microservice architectures and docker usage I expect to have more than once instance of a microservice 'type' running. From what I understand all instances will subscribe to the event and therefore would all receive it.
The book doesn't clearly explain how to ensure only one of the instances handle the request.
I have looked into the duplication section, but that describes a pattern that explains how to deduplicate within a service instance but not necessarily against them...
Each microservice instance would subscribe using something similar to:
public void Subscribe<T, TH>()
where T : IntegrationEvent
where TH : IIntegrationEventHandler<T>
{
var eventName = _subsManager.GetEventKey<T>();
var containsKey = _subsManager.HasSubscriptionsForEvent(eventName);
if (!containsKey)
{
if (!_persistentConnection.IsConnected)
{
_persistentConnection.TryConnect();
}
using (var channel = _persistentConnection.CreateModel())
{
channel.QueueBind(queue: _queueName,
exchange: BROKER_NAME,
routingKey: eventName);
}
}
_subsManager.AddSubscription<T, TH>();
}
I need to understand how a multiple microservice instances of the same 'type' of microservice can deduplicate without loosing the message if the service goes down while processing.
From what I understand all instances will subscribe to the event and
therefore would all receive it.
Only one instance of subscriber will process the message/event. When you have multiple instances of a service running and subscribed to same subscription the first one to pick the message will set the message invisible from the subscription (called visibility timeout). If the service instance is able to process the message in given time it will tell the queue to delete the message and if it's not able to process the message in time , the message will re-appear in queue for any instance to pick it up again.
All standard service bus (rabbitMQ, SQS, Azure Serivce bus etc) provide this feature out of box.
By the way i have read this book and used the above code from eShotContainers and it works the way i described.
You should look into following pattern as well
Competing Consumers pattern
Hope that helps!

JMS - Cannot retrieve message from queue. Happens intermittently

We have a Java class that listens to a database (Oracle) queue table and process it if there are records placed in that queue. It worked normally in UAT and development environments. Upon deployment in production, there are times when it cannot read a record from the queue. When a record is inserted, it cannot detect it and the records remain in the queue. This seldom happens but it happens. If I would give statistic, out of 30 records queued in a day, about 8 don't make it. We would need to restart the whole app for it to be able to read the records.
Here is a code snippet of my class..
public class SomeListener implements MessageListener{
public void onMessage(Message msg){
InputStream input = null;
try {
TextMessage txtMsg = (TextMessage) msg;
String text = txtMsg.getText();
input = new ByteArrayInputStream(text.getBytes());
} catch (Exception e1) {
// TODO Auto-generated catch block
logger.error("Parsing from the queue.... failed",e1);
e1.printStackTrace();
}
//process text message
}
}
Weird thing we cant find any traces of exceptions from the logs.
Can anyone help? by the way we set the receiveTimeout to 10 secs
We would need to restart the whole app for it to be able to read the records.
The most common reason for this is the listener thread is "stuck" in user code (//process text message). You can take a thread dump with jstack or jvisualvm or similar to see what the thread is doing.
Another possibility (with low volume apps like this) is the network (most likely a router someplace in the network) silently closes an idle socket because it has not been used for some time. If the container (actually the broker's JMS client library) doesn't know the socket is dead, it will never receive any more messages.
The solution to the first is to fix the code; the solution to the second is to enable some kind of heartbeat or keepalives on the connection so that the network/router does not close the socket when it has no "real" traffic on it.
You would need to consult your broker's documentation about configuring heartbeats/keepalives.

ActiveMQ Override scheduled message

I am trying to implement delayed queue with overriding of messages using Active MQ.
Each message is scheduled to be delivered with delay of x (say 60 seconds)
In between if same message is received again it should override previous message.
So even if I receive 10 messages say in x seconds. Only one message should be processed.
Is there clean way to accomplish this?
The question has two parts that need to be addressed separately:
Can a message be delayed in ActiveMQ?
Yes - see Delay and Schedule Message Delivery. You need to set <broker ... schedulerSupport="true"> in your ActiveMQ config, as well as setting the AMQ_SCHEDULED_DELAY property of the JMS message saying how long you want the message to be delayed (10000 in your case).
Is there any way to prevent the same message being consumed more than once?
Yes, but that's an application concern rather than an ActiveMQ one. It's often referred to as de-duplication or idempotent consumption. The simplest way if you only have one consumer is to keep track of messages received in a map, and check that map whether you receive a message. It it has been seen, discard.
For more complex use cases where you have multiple consumers on different machines, or you want that state to survive application restart, you will need to keep a table of messages seen in a database, and query it each time.
Please vote this answer up if it helps, as it encourages people to help you out.
Also according to method from ActiveMQ BrokerService class you should configure persistence to have ability to use scheduler functionality.
public boolean isSchedulerSupport() {
return this.schedulerSupport && (isPersistent() || jobSchedulerStore != null);
}
you can configure activemq broker to enable "schedulerSupport" with the following entry in your activemq.xml file located in conf directory of your activemq home directory.
<broker xmlns="http://activemq.apache.org/schema/core" brokerName="localhost" dataDirectory="${activemq.data}" schedulerSupport="true">
You can Override the BrokerService in your configuration
#Configuration
#EnableJms
public class JMSConfiguration {
#Bean
public BrokerService brokerService() throws Exception {
BrokerService brokerService = new BrokerService();
brokerService.setSchedulerSupport(true);
return brokerService;
}
}

Are group subscriptions automatically handled on Reconnect?

I have a chat room using SignalR Hub for its messaging. Occasionally I get reports from users where it 'freezes'. Now this can be interpreted as no messages are coming through, I suspect as they have been dropped from a group.
My question is, does the connection get re-subscribed back into its groups automatically, or do you have to do something yourself in the Reconnect method:
public Task Reconnect(IEnumerable<string> groups)
{
return Clients.rejoined(Context.ConnectionId, DateTime.Now.ToString());
}
Yes, in 1.0.0.0-alpha1 you can enable auto rejoining of groups by using the new AutoRejoiningGroupsModule pipeline module using the EnableAutoRejoiningGroups extension method for the hub pipeline you build. This feature was not available in previous versions of the framework.
So you would end up with this somewhere in your startup code:
GlobalHost.HubPipeline.EnableAutoRejoiningGroups();
UPDATE:
Please note that the final version of SignalR 1.0 made auto-rejoining of groups the default behavior and so EnableAutoRejoiningGroups was removed. You can see this answer for more details.

Resources