Spring kafka MessageListener and max.poll.records - spring-kafka

I am using spring kafka 2.7.8 to consume messages from kafka.
Consumer listener is as below
#KafkaListener(topics = "topicName",
groupId = "groupId",
containerFactory = "kafkaListenerFactory")
public void onMessage(ConsumerRecord record) {
}
Above onMessage method receives single message at a time.
Does this mean max.poll.records is set to 1 by spring library or it polls 500 at a time(default value) and the method receives one by one.
Reason for this question is, we often see below errors together in prod.
Received all 4 errors below for multiple consumers in under a minute.
Trying to understand whether it is due to intermittent kafka broker connectivity issue or due to load. Please advise.
Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group.
seek to current after exception; nested exception is org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before successfully committing offsets {topic-9=OffsetAndMetadata{offset=2729058, leaderEpoch=null, metadata=''}}
Consumer clientId=consumer-groupName-5, groupId=consumer] Offset commit failed on partition topic-33 at offset 2729191: The coordinator is not aware of this member.
Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records

max.poll.records is not changed by Spring; it will take the default (or whatever you set it to). The records are handed to the listener one at a time before the next poll.
This means that your listener must be able to process max.poll.records within max.poll.interval.ms.
You need to reduce max.poll.records and/or increase max.poll.interval.ms so that you can process the records in that time, with a generous margin, to avoid these rebalances.

Related

Spring kafka - multiple consumers receiving same message

I am using spring kafka to consume messages from kafka. Consumer listener is as below.
#KafkaListener(topics = "topicName",
groupId = "groupId",
containerFactory = "kafkaListenerFactory")
public void onMessage(ConsumerRecord record) {
logger.info("Received Message from kafka topic " + record.topic() + " with record key " + kafkaRecordKey + " partition " + record.partition() + " offset " +record.offset());
}
Single instance of application with ConcurrentKafkaListenerContainerFactory concurrency=6.
The topic has 6 partitions.
Time: 5/27/22 6:28:52.864 PM
message: Received Message from kafka topic payment-topic with record key ti9:a1956769-28d2-4329-a0ff-9003003a3cde partition 4 offset 325
thread: org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1
threadId: 69
Time: 5/27/22 6:28:52.864 PM
message: Received Message from kafka topic payment-topic with record key ti9:a1956769-28d2-4329-a0ff-9003003a3cde partition 4 offset 325
thread: org.springframework.kafka.KafkaListenerEndpointContainer#0-3-C-1
threadId: 66
From above logs, it is clear that 2 consumers received same message from partition and offset and exactly same time.
Each thread continues processing the message. In the end one of the consumer fails with below error
Time: 5/27/22 6:28:52.887 PM
message: [Consumer clientId=consumer-payment-consumer-5, groupId=payment-consumer] Offset commit failed on partition payment-topic-4 at offset 326: The coordinator is not aware of this member.
thread: org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1
threadId: 69
Time: 5/27/22 6:28:53.902 PM
message: Error handler threw an exception
thread: org.springframework.kafka.KafkaListenerEndpointContainer#0-4-C-1
threadId: 69
threadPriority: 5
thrown: { [-]
cause: { [+]
}
commonElementCount: 0
extendedStackTrace: [ [+]
]
localizedMessage: Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
message: Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
name: org.springframework.kafka.KafkaException
I understand above error can when there is load or processing of the message takes time. In this case processing is less than a second and there are less than 10 messages in kafka topic.
Please advise on why multiple consumers receiving same message.
Also the error logs says "Offset commit failed on partition payment-topic-4 at offset 326" for message at offset 325
Library versions
Spring boot - 2.5.7
org.springframework.kafka.spring-kafka - 2.7.8
org.apache.kafka.kafka-clients - 2.8.1
The process time of the record has to be less than max.poll.interval.ms otherwise the rebalance happens and it is likely the currently processed record offset is not committed therefore another assigned consumer fetches only from the previously committed offset for that partition.
https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html#consumerconfigs_max.poll.interval.ms

What happens when two nodes attempt to write at the same time in 2PC?

Does anyone know what happens when two nodes try to write data at the same time and both initiate the 2PC protocol? Does a request from one node get aborted and another one succeed? The failed node would retry with some exponential backoff?
If not, what happens?
Usually resource managers does not allow both nodes to participate in the same transaction in the same transaction branch at the same time. Probably second node/binary/thread which tries to join to the transaction will get timeout or some other error on xa_start(..., TMJOIN) call.

SCS kafka consumer attempts to acquire info from a partition that is no longer assigned to it

spring-cloud-stream-binder-kafka 3.0.9-RELEASE
spring-boot 2.2.13.RELEASE
Hi, we have a project using Spring Cloud Stream with kafka and we are having a problem in reconnecting the consumers when the broker nodes are down for a period of time.
The problem is that the consumer is not able to reconnect and acquire the partitions because it is trying to check the offset position of a partition that is no longer assigned to it? how can this happen?
The logs are shown below:
2021-06-09T09:39:25.358Z [mecstkac-45-6gvd4] [WARN] [KafkaConsumerDestination{consumerDestinationName='topicName1', partitions=0, dlqName='null'}.container-0-C-1] [messageKey=] [Consumer clientId=clientid-0, groupId=groupid-v1] Connection to node 2147483644 (hostnode/10.71.34.4:9092) could not be established. Broker may not be available.
2021-06-09T09:42:30.217Z [mecstkac-45-6gvd4] [ERROR] [KafkaConsumerDestination{consumerDestinationName='topicName1', partitions=0, dlqName='null'}.container-0-C-1] [messageKey=] [Consumer clientId=clientid-0, groupId=groupid-v1] User provided listener org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer$ListenerConsumerRebalanceListener failed on invocation of onPartitionsAssigned for partitions [topicName1-1]org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition topicName1-1 could be determined
2021-06-09T09:42:30.217Z [mecstkac-45-6gvd4] [ERROR] [KafkaConsumerDestination{consumerDestinationName='topicName1', partitions=0, dlqName='null'}.container-0-C-1] [messageKey=] Error while processing: nullorg.apache.kafka.common.KafkaException: User rebalance callback throws an error\\n at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:403)\\nCaused by: org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition topicName1-1 could be determined"
2021-06-09T09:43:03.924Z [mecstkac-45-6gvd4] [ERROR] [KafkaConsumerDestination{consumerDestinationName='topicName1', partitions=0, dlqName='null'}.container-0-C-1] [messageKey=] Error while processing: nulljava.lang.IllegalStateException: You can only check the position for partitions assigned to this consumer.\\n at org.apache.kafka.clients.consumer.KafkaConsumer.position(KafkaConsumer.java:1717)
Is it possible that the kafka binder stores information about the previously assigned partition and tries to connect to it even though a rebalance has already been performed and it is now assigned to another consumer?
NOTE: The configuration of the consumer Assignor is the default (RangeAssignor).

update policy query and ingestion retry in ADX

By default update policy on a Kusto table is non-transactional. Lets say I have an Update Policy defined on a table MyTarget for which the source is defined in the update policy as MySource. The update policy is defined as transactional. Ingestion has been set on the table MySource. So continuously data will be getting loaded to MySource. Now say certain ingestion data batch is loaded to MySource, right after that the query defined in the Update Policy will be triggered. Now lets say this query fails , due to memory issues etc -- even the data batch loaded to MySource will not be committed (because the update policy is transactional). I have heard that in this case the ingestion will be re-tried automatically. Is it so? I haven't found any documentation regarding this retry. Anyways -- my simple question is -- how many times retry will be attempted and how much is the interval after each attempt? Are these configurable properties (I am talking about ADX cluster which is available through Azure) if I am owner of the ADX cluster?
yes, there's an automatic retry for ingestions that failed due to a failure in a transactional update policy.
the full details can be found here: https://learn.microsoft.com/en-us/azure/kusto/management/updatepolicy#failures
Failures are treated as follows:
Non-transactional policy: The failure is ignored by Kusto. Any retry is the responsibility of the data owner.
Transactional policy: The original ingestion operation that triggered the update will fail as well. The source table and the database will not be modified with new data.
In case the ingestion method is pull (Kusto's Data Management service is involved in the ingestion process), there's an automated retry on the entire ingestion operation, orchestrated by Kusto's Data Management service, according to the following logic:
Retries are done until reaching the earliest between the maximum retry period (2 days) and maximum retry attempts (10 attempts).
The backoff period starts from 2 minutes, and grows exponentially (2 -> 4 -> 8 -> 16 ... minutes)
In any other case, any retry is the responsibility of the data owner.

Flink + Kafka: Why am I losing messages?

I have written a very simple Flink streaming job which takes data from Kafka using a FlinkKafkaConsumer082.
protected DataStream<String> getKafkaStream(StreamExecutionEnvironment env, String topic) {
Properties result = new Properties();
result.put("bootstrap.servers", getBrokerUrl());
result.put("zookeeper.connect", getZookeeperUrl());
result.put("group.id", getGroup());
return env.addSource(
new FlinkKafkaConsumer082<>(
topic,
new SimpleStringSchema(), result);
}
This works very well and whenever I put something into the topic on Kafka, it is received by my Flink job and processed. Now I tried to see what happens if my Flink Job isn't online for some reason. So I shut down the flink job and kept sending messages to Kafka. Then I started my Flink job again and was expecting that it would process the messages that were sent meanwhile.
However, I got this message:
No prior offsets found for some partitions in topic collector.Customer. Fetched the following start offsets [FetchPartition {partition=0, offset=25}]
So it basically ignored all messages that came since the last shutdown of the Flink job and just started to read at the end of the queue. From the documentation of FlinkKafkaConsumer082 I gathered, that it automatically takes care of synchronizing the processed offsets with the Kafka broker. However that doesn't seem to be the case.
I am using a single-node Kafka installation (the one that comes with the Kafka distribution) with a single-node Zookeper installation (also the one that is bundled with the Kafka distribution).
I suspect it is some kind of misconfiguration or something the like but I really don't know where to start looking. Has anyone else had this issue and maybe solved it?
I found the reason. You need to explicitly enable checkpointing in the StreamExecutionEnvironment to make the Kafka connector write the processed offsets to Zookeeper. If you don't enable it, the Kafka connector will not write the last read offset and it will therefore not be able to resume from there when the collecting Job is restarted. So be sure to write:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(); // <-- this is the important part
Anatoly's suggestion for changing the initial offset is probably still a good idea, in case checkpointing fails for some reason.
https://kafka.apache.org/08/configuration.html
set auto.offset.reset to smallest(by default it's largest)
auto.offset.reset:
What to do when there is no initial offset in Zookeeper or if an
offset is out of range:
smallest : automatically reset the offset to the smallest offset
largest : automatically reset the offset to the largest offset
anything else: throw exception to the consumer.
If this is set to largest, the consumer may lose some messages when the number of partitions, for the topics it subscribes to, changes on the broker. To
prevent data loss during partition addition, set auto.offset.reset to
smallest
Also make sure getGroup() is the same after restart

Resources