SCS kafka consumer attempts to acquire info from a partition that is no longer assigned to it - spring-cloud-stream-binder-kafka

spring-cloud-stream-binder-kafka 3.0.9-RELEASE
spring-boot 2.2.13.RELEASE
Hi, we have a project using Spring Cloud Stream with kafka and we are having a problem in reconnecting the consumers when the broker nodes are down for a period of time.
The problem is that the consumer is not able to reconnect and acquire the partitions because it is trying to check the offset position of a partition that is no longer assigned to it? how can this happen?
The logs are shown below:
2021-06-09T09:39:25.358Z [mecstkac-45-6gvd4] [WARN] [KafkaConsumerDestination{consumerDestinationName='topicName1', partitions=0, dlqName='null'}.container-0-C-1] [messageKey=] [Consumer clientId=clientid-0, groupId=groupid-v1] Connection to node 2147483644 (hostnode/10.71.34.4:9092) could not be established. Broker may not be available.
2021-06-09T09:42:30.217Z [mecstkac-45-6gvd4] [ERROR] [KafkaConsumerDestination{consumerDestinationName='topicName1', partitions=0, dlqName='null'}.container-0-C-1] [messageKey=] [Consumer clientId=clientid-0, groupId=groupid-v1] User provided listener org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer$ListenerConsumerRebalanceListener failed on invocation of onPartitionsAssigned for partitions [topicName1-1]org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition topicName1-1 could be determined
2021-06-09T09:42:30.217Z [mecstkac-45-6gvd4] [ERROR] [KafkaConsumerDestination{consumerDestinationName='topicName1', partitions=0, dlqName='null'}.container-0-C-1] [messageKey=] Error while processing: nullorg.apache.kafka.common.KafkaException: User rebalance callback throws an error\\n at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:403)\\nCaused by: org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition topicName1-1 could be determined"
2021-06-09T09:43:03.924Z [mecstkac-45-6gvd4] [ERROR] [KafkaConsumerDestination{consumerDestinationName='topicName1', partitions=0, dlqName='null'}.container-0-C-1] [messageKey=] Error while processing: nulljava.lang.IllegalStateException: You can only check the position for partitions assigned to this consumer.\\n at org.apache.kafka.clients.consumer.KafkaConsumer.position(KafkaConsumer.java:1717)
Is it possible that the kafka binder stores information about the previously assigned partition and tries to connect to it even though a rebalance has already been performed and it is now assigned to another consumer?
NOTE: The configuration of the consumer Assignor is the default (RangeAssignor).

Related

Spring kafka - multiple consumers receiving same message

I am using spring kafka to consume messages from kafka. Consumer listener is as below.
#KafkaListener(topics = "topicName",
groupId = "groupId",
containerFactory = "kafkaListenerFactory")
public void onMessage(ConsumerRecord record) {
logger.info("Received Message from kafka topic " + record.topic() + " with record key " + kafkaRecordKey + " partition " + record.partition() + " offset " +record.offset());
}
Single instance of application with ConcurrentKafkaListenerContainerFactory concurrency=6.
The topic has 6 partitions.
Time: 5/27/22 6:28:52.864 PM
message: Received Message from kafka topic payment-topic with record key ti9:a1956769-28d2-4329-a0ff-9003003a3cde partition 4 offset 325
thread: org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1
threadId: 69
Time: 5/27/22 6:28:52.864 PM
message: Received Message from kafka topic payment-topic with record key ti9:a1956769-28d2-4329-a0ff-9003003a3cde partition 4 offset 325
thread: org.springframework.kafka.KafkaListenerEndpointContainer#0-3-C-1
threadId: 66
From above logs, it is clear that 2 consumers received same message from partition and offset and exactly same time.
Each thread continues processing the message. In the end one of the consumer fails with below error
Time: 5/27/22 6:28:52.887 PM
message: [Consumer clientId=consumer-payment-consumer-5, groupId=payment-consumer] Offset commit failed on partition payment-topic-4 at offset 326: The coordinator is not aware of this member.
thread: org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1
threadId: 69
Time: 5/27/22 6:28:53.902 PM
message: Error handler threw an exception
thread: org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1
threadId: 69
threadPriority: 5
thrown: { [-]
cause: { [+]
}
commonElementCount: 0
extendedStackTrace: [ [+]
]
localizedMessage: Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
message: Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
name: org.springframework.kafka.KafkaException
I understand above error can when there is load or processing of the message takes time. In this case processing is less than a second and there are less than 10 messages in kafka topic.
Please advise on why multiple consumers receiving same message.
Also the error logs says "Offset commit failed on partition payment-topic-4 at offset 326" for message at offset 325
Library versions
Spring boot - 2.5.7
org.springframework.kafka.spring-kafka - 2.7.8
org.apache.kafka.kafka-clients - 2.8.1
The process time of the record has to be less than max.poll.interval.ms otherwise the rebalance happens and it is likely the currently processed record offset is not committed therefore another assigned consumer fetches only from the previously committed offset for that partition.
https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html#consumerconfigs_max.poll.interval.ms

Spring kafka MessageListener and max.poll.records

I am using spring kafka 2.7.8 to consume messages from kafka.
Consumer listener is as below
#KafkaListener(topics = "topicName",
groupId = "groupId",
containerFactory = "kafkaListenerFactory")
public void onMessage(ConsumerRecord record) {
}
Above onMessage method receives single message at a time.
Does this mean max.poll.records is set to 1 by spring library or it polls 500 at a time(default value) and the method receives one by one.
Reason for this question is, we often see below errors together in prod.
Received all 4 errors below for multiple consumers in under a minute.
Trying to understand whether it is due to intermittent kafka broker connectivity issue or due to load. Please advise.
Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group.
seek to current after exception; nested exception is org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before successfully committing offsets {topic-9=OffsetAndMetadata{offset=2729058, leaderEpoch=null, metadata=''}}
Consumer clientId=consumer-groupName-5, groupId=consumer] Offset commit failed on partition topic-33 at offset 2729191: The coordinator is not aware of this member.
Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records
max.poll.records is not changed by Spring; it will take the default (or whatever you set it to). The records are handed to the listener one at a time before the next poll.
This means that your listener must be able to process max.poll.records within max.poll.interval.ms.
You need to reduce max.poll.records and/or increase max.poll.interval.ms so that you can process the records in that time, with a generous margin, to avoid these rebalances.

How to resolve notary double spend error in the running network? - Unable to notarise the transactions

There is a dev-network in my project which is including Accounts-sdk, Token-sdk. The project is about transfer of custom tokens between accounts of same node as well as different node. The network is suddenly not notarising the transition which was working for a long time. Now I am not able to transfer the tokens and I am able to transfer the token if I issue the fresh token and transfer. After two or three transactions of transfer the network is again not notarising the transaction.
Throwing
java.util.concurrent.ExecutionException: net.corda.core.flows.NotaryException: Unable to notarise transaction AD57C34EA40953561B90C5EB832336E4E42CD2C876CA5030C986B9EB50909AE7 : One or more input states or referenced states have already been used as input states in other transactions. Conflicting state count: 1, consumption details: today at 1:05:43 PM14BBC38DF7BDA1FD6CA8E97DB16967FBFC9B18AB95DCD7B1990892C206759FA3(1) -> StateConsumptionDetails(hashOfTransactionId=6190DA0ABAD95B767AB42351368A1D51FCCE164DD08D256F98E065F54B6EE789, type=INPUT_STATE). today at 1:05:43 PMTo find out if any of the conflicting transactions have been generated by this node you can use the hashLookup Corda shell command.] with root cause today at 1:05:43 PM net.corda.core.flows.NotaryException: Unable to notarise transaction AD57C34EA40953561B90C5EB832336E4E42CD2C876CA5030C986B9EB50909AE7 : One or more input states or referenced states have already been used as input states in other transactions. Conflicting state count: 1, consumption details: today at 1:05:43 PM14BBC38DF7BDA1FD6CA8E97DB16967FBFC9B18AB95DCD7B1990892C206759FA3(1) -> StateConsumptionDetails(hashOfTransactionId=6190DA0ABAD95B767AB42351368A1D51FCCE164DD08D256F98E065F54B6EE789, type=INPUT_STATE). today at 1:05:43 PMTo find out if any of the conflicting transactions have been generated by this node you can use the hashLookup Corda shell command.
I am not able to understand why the transfer of token is not been notarised.
Why the network went un-notarising the transfer tokens transaction?
how can I resolve the problem?
It looks like the transaction used to transfer the token is using an input reference that has already been consumed. You can see the transaction id and the output index that is creating the problem in the error description, which could be of some help to debug the issue.

Spring cloud stream unexpected shutdown is not covered by DLQ

We are using Spring Cloud Stream 2.2 with the Kafka binder. Something we have noticed is if the pod is killed in the middle of doing the job for whatever reason, then we will miss the message to be sent to DLQ.
We are managing exceptions by catching the failure to log it first, and then send the failure to another service to keep track of this situation, and finally throw exception again to be caught by error channel and captured by DLQ. This approach works seamlessly in normal failure, but if the failure has been triggered externally (like unexpected shutdown), then we miss the DLQ part as it seems the corresponding process is killed before reaching out to the error channel. I wonder if this is a known issue as it's impacting the at-least-once guarantee of this framework in our use case.
22:34:48.077 INFO Shutting down ExecutorService
22:34:48.135 INFO Consumer stopped
22:34:48.136 INFO stopped org.springframework.integration.kafka.inbound.KafkaMessageDrivenChannelAdapter#5174b135
22:34:48.155 INFO Registering MessageChannel outbox-usermgmt.event.job-creator-outbox-event-syncs.errors
22:34:48.241 INFO Channel 'application.outbox-usermgmt.event.job-creator-outbox-event-syncs.errors' has 1 subscriber(s).
22:34:48.241 INFO Channel 'application.outbox-usermgmt.event.job-creator-outbox-event-syncs.errors' has 0 subscriber(s).
22:34:48.246 INFO Registering MessageChannel progress-report.errors
22:34:48.258 INFO Channel 'application.progress-report.errors' has 0 subscriber(s).
22:34:48.262 INFO Registering MessageChannel job-created.errors
22:34:48.273 INFO Registering MessageChannel progress-report.errors
22:34:48.350 INFO Channel 'application.job-created.errors' has 0 subscriber(s).
22:34:48.366 INFO Registering MessageChannel job-created.errors
22:34:48.458 INFO Removing {logging-channel-adapter:_org.springframework.integration.errorLogger} as a subscriber to the 'errorChannel' channel
22:34:48.458 INFO Channel 'application.errorChannel' has 1 subscriber(s).
22:34:48.459 INFO stopped _org.springframework.integration.errorLogger
22:34:48.459 INFO Shutting down ExecutorService 'taskScheduler'
22:34:48.467 WARN Destroy method 'close' on bean with name 'genericSpecificFlexibleDeserializer' threw an exception: java.lang.NullPointerException
22:34:48.472 ERROR Job has failed, Fail to retrieve record's full tree, Connection closed unexpectedly
22:34:48.472 ERROR Fail to retrieve record's full tree
22:34:48.472 DEBUG Sending progress update of 0.0 with status of failed
22:34:48.474 ERROR Job has failed, Fail to retrieve record's full tree
22:34:48.538 INFO Closing JPA EntityManagerFactory for persistence unit 'default'
22:34:48.538 INFO Shutting down ExecutorService
22:34:48.541 INFO HikariPool-1 - Shutdown initiated...
22:34:48.543 INFO HikariPool-1 - Shutdown completed.
Code snippet:
try {
...
} catch (Exception ex) {
//capture the failure details in logs
//send failure progress update to another service
throw new JobProcessingException(ex);
}
It appears the framework commits the message before ensuring that the DLQ message is published to Kafka so the offset has moved but the message was skipped as nothing was published to DLQ.
P.S: This scenario happens for us whenever Kubernetes sends a restart signal to the pod for whatever reason like pod eviction, new release, etc. So I suppose if the kill signal was forced then we would not have the commit in the first place and the job was restarted.
This is a known problem - see https://github.com/spring-projects/spring-integration/issues/3450
The issue is that a PublishSubscribeChannel allows zero subscribers and no exception is thrown if there are none.
It has been resolved in Spring Integration (5.4.x) but is still a problem in the binder because it creates a pub/sub error channel by default.
See my comment there...
Yes; I think that solution makes sense; it shouldn't cause any real problems because the default errorChannel always gets one subscriber.
However, it won't solve the problem in the binder because the message producer gets a binding-specific error channel (which is bridged to the global error channel), so we'd need a similar change there.
It should be possible to work around it by declaring the binding's error channel as a DirectChannel #Bean in which case an exception will be thrown if the consumer has unsubscribed (during shutdown). However, this will mean errors will only go to the binding-specific error channel and won't be bridged to the global errorChannel.
https://github.com/spring-cloud/spring-cloud-stream/issues/2082

Suspended orchestration service instance re-throwing the same unexpected exception after Resume

I am getting below error, when i am trying to resume Suspended(resumable) orchestration instance.
Scenario: Request went thourgh DB2 Static solicit - Response port, and it got failed because of access permission denied. I can see two instances suspended in the admin console one is related to port and another one is related to orchestration. After fixing the credentials, suspended port instance got resumed but the orchestration one is keep on failing.
Uncaught exception (see the 'inner exception' below) has suspended an instance of service 'Orchestration name'.
The service instance will remain suspended until administratively resumed or terminated.
If resumed the instance will continue from its last persisted state and may re-throw the same unexpected exception.
InstanceId: ca927086-465d-40e8-93fe-c3a0e4c161f7
Shape name:
ShapeId:
Exception thrown from: segment -1, progress -1
Inner exception: An error occurred while processing the message, refer to the details section for more information
Message ID: {96B72521-9833-48EF-BB2F-4A2E2265D697}
Instance ID: {F6FBC912-C9DC-489C-87F3-103FA1273FDC}
Error Description: The user does not have the authority to access the host resource. Check your authentication credentials or contact your system administrator. SQLSTATE: HY000, SQLCODE: -1000
Exception type: XlangSoapException
Source: Microsoft.XLANGs.BizTalk.Engine
Target Site: Void VerifyTransport(Microsoft.XLANGs.Core.Envelope, Int32, Microsoft.XLANGs.Core.Context)
The following is a stack trace that identifies the location where the exception occured
at Microsoft.BizTalk.XLANGs.BTXEngine.BTXPortBase.VerifyTransport(Envelope env, Int32 operationId, Context ctx)
at Microsoft.XLANGs.Core.Subscription.Receive(Segment s, Context ctx, Envelope& env, Boolean topOnly)
at Microsoft.XLANGs.Core.PortBase.GetMessageIdForSubscription(Subscription subscription, Segment currentSegment, Context cxt, Envelope& env, CachedObject location)
at Microsoft.XLANGs.Core.PortBase.GetMessageId(Subscription subscription, Segment currentSegment, Context cxt, Envelope& env, CachedObject location)
at (StopConditions stopOn)
at Microsoft.XLANGs.Core.SegmentScheduler.RunASegment(Segment s, StopConditions stopCond, Exception& exp)
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Any thoughts how to fix this?
Creating the above scenario using samples:
Go to BizTalk
samples/orchestrations/consumeWebservice
folder, install the
ConsumeWebService application and
publish POWebservice to IIS.
Change IIS Directory security
permissions for POWebservice, remove
anonymous or any other access.
Now drop the message you will see
suspended messages because of HTTP
status 401: Access Denied, then give
access to POWebservice either
anonymous or Windows.
Then resume
the suspended instances, one will
get disappear but
another(orchestration) one wont.
The orchestration will continue to fail with the exception because when it was suspended, the last persistence point was the receipt of the exception. This means that the orchestration will re-start (when resumed) and re-throw the exception.
Here's at article discussing some points at which orchestration state is persisted to the database: http://blogs.msdn.com/b/sanket/archive/2006/11/12/understanding-persistence-points-in-biztalk-orchestration.aspx
You can manipulate this to some extent in your orchestration design, as Richard Seroter discusses here, but generally you would do better to use failed message routing, enabling you to handle the failed messages, and terminate the failed orchestration instance.
Please correct me if I'm wrong, but is this not just normal biztalk behavior? I am not 100% sure so please let me know if this is wrong:
The outbound messaging instance was suspended because the credentials the port was using to connect to to the DB were wrong.
This caused the orchestrations making these calls to also suspend.
The suspended message instance was resumed and was processed correctly because the problem was fixed. So the call was made to the DB.
However, the orchestration instance may not be able to resume because when resumed it found itself at the most recent persistence point and the original error which was delivered back from the send port is still available to the orchestration, causing it to re-suspend.
In the error message, it actually says "If resumed the instance will continue from its last persisted state and may re-throw the same unexpected exception."
If you want to handle this sort of thing you could make the call to the database atomic. That way the orchestration will not persist itself at the point of making the DB call. If the orchestration then suspends it will resume at a point before the DB call is made, and will make the DB call as normal, which should succeed this time because you have fixed the original issue.
The only problem with this is if your DB call cannot be executed more than once with the same data without bad things happenning (is not idempotent).
I am not 100% on the above explaination. Please point out if my understanding is incorrect.
this scenario not treated by Microsoft Biztalk = Middleware FAIL.
you have to solve this at the orchestration design stage up front...
http://seroter.wordpress.com/2007/01/02/orchestration-handling-of-suspended-messages/

Resources