Spring kafka - multiple consumers receiving same message - spring-kafka

I am using spring kafka to consume messages from kafka. Consumer listener is as below.
#KafkaListener(topics = "topicName",
groupId = "groupId",
containerFactory = "kafkaListenerFactory")
public void onMessage(ConsumerRecord record) {
logger.info("Received Message from kafka topic " + record.topic() + " with record key " + kafkaRecordKey + " partition " + record.partition() + " offset " +record.offset());
}
Single instance of application with ConcurrentKafkaListenerContainerFactory concurrency=6.
The topic has 6 partitions.
Time: 5/27/22 6:28:52.864 PM
message: Received Message from kafka topic payment-topic with record key ti9:a1956769-28d2-4329-a0ff-9003003a3cde partition 4 offset 325
thread: org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1
threadId: 69
Time: 5/27/22 6:28:52.864 PM
message: Received Message from kafka topic payment-topic with record key ti9:a1956769-28d2-4329-a0ff-9003003a3cde partition 4 offset 325
thread: org.springframework.kafka.KafkaListenerEndpointContainer#0-3-C-1
threadId: 66
From above logs, it is clear that 2 consumers received same message from partition and offset and exactly same time.
Each thread continues processing the message. In the end one of the consumer fails with below error
Time: 5/27/22 6:28:52.887 PM
message: [Consumer clientId=consumer-payment-consumer-5, groupId=payment-consumer] Offset commit failed on partition payment-topic-4 at offset 326: The coordinator is not aware of this member.
thread: org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1
threadId: 69
Time: 5/27/22 6:28:53.902 PM
message: Error handler threw an exception
thread: org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1
threadId: 69
threadPriority: 5
thrown: { [-]
cause: { [+]
}
commonElementCount: 0
extendedStackTrace: [ [+]
]
localizedMessage: Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
message: Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
name: org.springframework.kafka.KafkaException
I understand above error can when there is load or processing of the message takes time. In this case processing is less than a second and there are less than 10 messages in kafka topic.
Please advise on why multiple consumers receiving same message.
Also the error logs says "Offset commit failed on partition payment-topic-4 at offset 326" for message at offset 325
Library versions
Spring boot - 2.5.7
org.springframework.kafka.spring-kafka - 2.7.8
org.apache.kafka.kafka-clients - 2.8.1

The process time of the record has to be less than max.poll.interval.ms otherwise the rebalance happens and it is likely the currently processed record offset is not committed therefore another assigned consumer fetches only from the previously committed offset for that partition.
https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html#consumerconfigs_max.poll.interval.ms

Related

Spring kafka MessageListener and max.poll.records

I am using spring kafka 2.7.8 to consume messages from kafka.
Consumer listener is as below
#KafkaListener(topics = "topicName",
groupId = "groupId",
containerFactory = "kafkaListenerFactory")
public void onMessage(ConsumerRecord record) {
}
Above onMessage method receives single message at a time.
Does this mean max.poll.records is set to 1 by spring library or it polls 500 at a time(default value) and the method receives one by one.
Reason for this question is, we often see below errors together in prod.
Received all 4 errors below for multiple consumers in under a minute.
Trying to understand whether it is due to intermittent kafka broker connectivity issue or due to load. Please advise.
Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group.
seek to current after exception; nested exception is org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before successfully committing offsets {topic-9=OffsetAndMetadata{offset=2729058, leaderEpoch=null, metadata=''}}
Consumer clientId=consumer-groupName-5, groupId=consumer] Offset commit failed on partition topic-33 at offset 2729191: The coordinator is not aware of this member.
Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records
max.poll.records is not changed by Spring; it will take the default (or whatever you set it to). The records are handed to the listener one at a time before the next poll.
This means that your listener must be able to process max.poll.records within max.poll.interval.ms.
You need to reduce max.poll.records and/or increase max.poll.interval.ms so that you can process the records in that time, with a generous margin, to avoid these rebalances.

SCS kafka consumer attempts to acquire info from a partition that is no longer assigned to it

spring-cloud-stream-binder-kafka 3.0.9-RELEASE
spring-boot 2.2.13.RELEASE
Hi, we have a project using Spring Cloud Stream with kafka and we are having a problem in reconnecting the consumers when the broker nodes are down for a period of time.
The problem is that the consumer is not able to reconnect and acquire the partitions because it is trying to check the offset position of a partition that is no longer assigned to it? how can this happen?
The logs are shown below:
2021-06-09T09:39:25.358Z [mecstkac-45-6gvd4] [WARN] [KafkaConsumerDestination{consumerDestinationName='topicName1', partitions=0, dlqName='null'}.container-0-C-1] [messageKey=] [Consumer clientId=clientid-0, groupId=groupid-v1] Connection to node 2147483644 (hostnode/10.71.34.4:9092) could not be established. Broker may not be available.
2021-06-09T09:42:30.217Z [mecstkac-45-6gvd4] [ERROR] [KafkaConsumerDestination{consumerDestinationName='topicName1', partitions=0, dlqName='null'}.container-0-C-1] [messageKey=] [Consumer clientId=clientid-0, groupId=groupid-v1] User provided listener org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer$ListenerConsumerRebalanceListener failed on invocation of onPartitionsAssigned for partitions [topicName1-1]org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition topicName1-1 could be determined
2021-06-09T09:42:30.217Z [mecstkac-45-6gvd4] [ERROR] [KafkaConsumerDestination{consumerDestinationName='topicName1', partitions=0, dlqName='null'}.container-0-C-1] [messageKey=] Error while processing: nullorg.apache.kafka.common.KafkaException: User rebalance callback throws an error\\n at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:403)\\nCaused by: org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition topicName1-1 could be determined"
2021-06-09T09:43:03.924Z [mecstkac-45-6gvd4] [ERROR] [KafkaConsumerDestination{consumerDestinationName='topicName1', partitions=0, dlqName='null'}.container-0-C-1] [messageKey=] Error while processing: nulljava.lang.IllegalStateException: You can only check the position for partitions assigned to this consumer.\\n at org.apache.kafka.clients.consumer.KafkaConsumer.position(KafkaConsumer.java:1717)
Is it possible that the kafka binder stores information about the previously assigned partition and tries to connect to it even though a rebalance has already been performed and it is now assigned to another consumer?
NOTE: The configuration of the consumer Assignor is the default (RangeAssignor).

Spring cloud stream unexpected shutdown is not covered by DLQ

We are using Spring Cloud Stream 2.2 with the Kafka binder. Something we have noticed is if the pod is killed in the middle of doing the job for whatever reason, then we will miss the message to be sent to DLQ.
We are managing exceptions by catching the failure to log it first, and then send the failure to another service to keep track of this situation, and finally throw exception again to be caught by error channel and captured by DLQ. This approach works seamlessly in normal failure, but if the failure has been triggered externally (like unexpected shutdown), then we miss the DLQ part as it seems the corresponding process is killed before reaching out to the error channel. I wonder if this is a known issue as it's impacting the at-least-once guarantee of this framework in our use case.
22:34:48.077 INFO Shutting down ExecutorService
22:34:48.135 INFO Consumer stopped
22:34:48.136 INFO stopped org.springframework.integration.kafka.inbound.KafkaMessageDrivenChannelAdapter#5174b135
22:34:48.155 INFO Registering MessageChannel outbox-usermgmt.event.job-creator-outbox-event-syncs.errors
22:34:48.241 INFO Channel 'application.outbox-usermgmt.event.job-creator-outbox-event-syncs.errors' has 1 subscriber(s).
22:34:48.241 INFO Channel 'application.outbox-usermgmt.event.job-creator-outbox-event-syncs.errors' has 0 subscriber(s).
22:34:48.246 INFO Registering MessageChannel progress-report.errors
22:34:48.258 INFO Channel 'application.progress-report.errors' has 0 subscriber(s).
22:34:48.262 INFO Registering MessageChannel job-created.errors
22:34:48.273 INFO Registering MessageChannel progress-report.errors
22:34:48.350 INFO Channel 'application.job-created.errors' has 0 subscriber(s).
22:34:48.366 INFO Registering MessageChannel job-created.errors
22:34:48.458 INFO Removing {logging-channel-adapter:_org.springframework.integration.errorLogger} as a subscriber to the 'errorChannel' channel
22:34:48.458 INFO Channel 'application.errorChannel' has 1 subscriber(s).
22:34:48.459 INFO stopped _org.springframework.integration.errorLogger
22:34:48.459 INFO Shutting down ExecutorService 'taskScheduler'
22:34:48.467 WARN Destroy method 'close' on bean with name 'genericSpecificFlexibleDeserializer' threw an exception: java.lang.NullPointerException
22:34:48.472 ERROR Job has failed, Fail to retrieve record's full tree, Connection closed unexpectedly
22:34:48.472 ERROR Fail to retrieve record's full tree
22:34:48.472 DEBUG Sending progress update of 0.0 with status of failed
22:34:48.474 ERROR Job has failed, Fail to retrieve record's full tree
22:34:48.538 INFO Closing JPA EntityManagerFactory for persistence unit 'default'
22:34:48.538 INFO Shutting down ExecutorService
22:34:48.541 INFO HikariPool-1 - Shutdown initiated...
22:34:48.543 INFO HikariPool-1 - Shutdown completed.
Code snippet:
try {
...
} catch (Exception ex) {
//capture the failure details in logs
//send failure progress update to another service
throw new JobProcessingException(ex);
}
It appears the framework commits the message before ensuring that the DLQ message is published to Kafka so the offset has moved but the message was skipped as nothing was published to DLQ.
P.S: This scenario happens for us whenever Kubernetes sends a restart signal to the pod for whatever reason like pod eviction, new release, etc. So I suppose if the kill signal was forced then we would not have the commit in the first place and the job was restarted.
This is a known problem - see https://github.com/spring-projects/spring-integration/issues/3450
The issue is that a PublishSubscribeChannel allows zero subscribers and no exception is thrown if there are none.
It has been resolved in Spring Integration (5.4.x) but is still a problem in the binder because it creates a pub/sub error channel by default.
See my comment there...
Yes; I think that solution makes sense; it shouldn't cause any real problems because the default errorChannel always gets one subscriber.
However, it won't solve the problem in the binder because the message producer gets a binding-specific error channel (which is bridged to the global error channel), so we'd need a similar change there.
It should be possible to work around it by declaring the binding's error channel as a DirectChannel #Bean in which case an exception will be thrown if the consumer has unsubscribed (during shutdown). However, this will mean errors will only go to the binding-specific error channel and won't be bridged to the global errorChannel.
https://github.com/spring-cloud/spring-cloud-stream/issues/2082

ChangeFeedProcessorBuilder checkpointing after unsuccessful processing

I was investigating the behavior of a ChangeFeedProcessorBuilder processor1 that throws an exception or goes down while processing the particular change. Upon recovery, the same change will not be picked up anymore. Is there any way to checkpoint only after the successful processing of the notification?
The delegate is as follows:
var builder = container.GetChangeFeedProcessorBuilder("migrationProcessor",
(IReadOnlyCollection<object> input, CancellationToken cancellationToken) =>
{
Console.WriteLine(input.Count + " Changes Received by " + a);
// just first try will fail (static variable)
if (a++ == 0)
{
throw new Exception();
}
return Task.CompletedTask;
});
Thank you!
The default behavior of the Change Feed Processor is to checkpoint after a successful delegate execution: https://learn.microsoft.com/azure/cosmos-db/change-feed-processor#processing-life-cycle
The normal life cycle of a host instance is:
Read the change feed.
If there are no changes, sleep for a predefined amount of time (customizable with WithPollInterval in the Builder) and go to #1.
If there are changes, send them to the delegate.
When the delegate finishes processing the changes successfully, update the lease store with the latest processed point in time and go to #1.
If your delegate handler throws an unhandled exception, there is no checkpoint.
Adding from comments: The only scenario where the batch might not be retried is if the batch that throws is the first ever (lease has no Continuation). Because when the host picks up the lease again to reprocess, it has no point in time to retry from. Based on the official documentation, one lease is owned by a single instance, so there is no way that other instance could have picked up the same lease and be processing it in parallel (within the same Deployment Unit context).

Writting more event to channel leads to channel full exception

I am using the flume JMS source to dequeue message from ActiveMQ and convert this message into List<Event> using custom converter
Channel configuration
agent.channels.c1.type = memory
agent.channels.c1.capacity = 1000000
agent.channels.c1.transactionCapacity = 1500
when size of the List<Event> is less than or equal to 1500 (channel transaction capacity),then flume write the events to channel , but if the event size is greater than 1500 then i am getting the below exception
Error log
21 Apr 2015 12:19:28,245 WARN [PollableSourceRunner-JMSSource-s1] (org.apache.flume.source.jms.JMSSource.doProcess:263) - Error appending event to channel. Channel might be full. Consider increasing the channel capacity or make sure the sinks perform faster.
org.apache.flume.ChannelException: Unable to put batch on required channel: org.apache.flume.channel.MemoryChannel{name: c1}
at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:200)
at org.apache.flume.source.jms.JMSSource.doProcess(JMSSource.java:257)
at org.apache.flume.source.AbstractPollableSource.process(AbstractPollableSource.java:54)
at org.apache.flume.source.PollableSourceRunner$PollingRunner.run(PollableSourceRunner.java:139)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.flume.ChannelException: Put queue for MemoryTransaction of capacity 1500 full, consider committing more frequently, increasing capacity or increasing thread count
at org.apache.flume.channel.MemoryChannel$MemoryTransaction.doPut(MemoryChannel.java:84)
at org.apache.flume.channel.BasicTransactionSemantics.put(BasicTransactionSemantics.java:93)
at org.apache.flume.channel.BasicChannelSemantics.put(BasicChannelSemantics.java:80)
at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:189)
... 4 more
How to solve this problem?
Note: Event size varies dynamically based on ActiveMQ message

Resources