How to handle errors and retries in spring-kafka

How to handle errors and retries in spring-kafka - spring-kafka

This is a question related to :
https://github.com/spring-projects/spring-kafka/issues/575
I'm using spring-kafka 1.3.7 and transactions in a read-process-write cycle.
For this purpose, I should use a KTM on the spring kafka container to enable transaction on the whole listener process and automatic handling the transaction id based on the partition for zombie fencing(1.3.7 changes).
If I understand well from the issue #575, I can not use a RetryTemplate in a container when using a transaction manager.
How am I supposed to handle errors and retries in a such case ?
The default behavior with transaction is infinite retries ? This seems really dangerous. An unexpected exception might simply block the whole process in production.

The upcoming 2.2 release adds recovery to the DefaultAfterRollbackProcessor - so you can stop retrying after some number of attempts.
Docs Here, PR here.
It also provides an optional mechanism to send the failed record to a dead-letter topic.
If you can't move to 2.2 (release candidate due at the end of this week, with GA in October), you can provide a custom AfterRollbackProcessor with similar functionality.
EDIT
Or, you could add code to your listener (or its error handler) to keep track of how many times the same record has been delivered, and handle the error in your listener, or its listener-level error handler.

Related

Looking for non-blocking spring kafka ErrorHandler

After use of SeekToCurrentErrorHandler i am looking for a non-blocking kafka ErrorHandler. Because of some unstable subsystems we need to set high interval times as 5 minutes or more. Which would block our processing.
My idea is to use the topic itself to re-queue failing messages. But with two additional header values kafka_try-counter and kafka_try-timestamp.
Based on the SeekToCurrentErrorHandler and the DeadLetterPublishingRecoverer i implemented a draft of RePublishingErrorHandler and a RePublishingRecoverer
The RePublishingRecoverer update the kafka headers and produce the message in the same topic.
The RePublishingErrorHandler check header values and if kafka_try-counter exeeds max-attempts calls another ConsumerRecordRecoverer like the DLT or Logging.
The kafka_try-timestamp used determine the wait time of a message. If it returns to fast it should re-queued without the incremention of the try-counter.
The expectation of this aproach is to get a non-blocking listener.
Because of i am new to spring-kafka implementation and also kafka itself. I'm not sure if this aproach is OK.
And i am also somehow stuck in the implementation of that concept.

My idea is to use the topic itself to re-queue failing messages.
That won't work; you would have to publish it to another topic and have a (delaying) consumer on that topic, perhaps polling at some interval rather than using a message-driven consumer. Then have that consumer publish it back to the original topic.
All of this assumes that strict ordering within a partition is not a requirement for you.
It's easy enough to subclass the DeadLetterPublishingRecoverer and override the createProducerRecord() method. Call super() and then add your headers.
Set the BackOff in the SeekToCurrentErrorHandler to have a zero back off and 0 retries to immediately publish to the DLT.

How can we pause Kafka consumer polling/processing records when there is an exception because of downstream system

I'm using spring boot 2.1.7.RELEASE and spring-kafka 2.2.8.RELEASE.And I'm using #KafkaListener annotation to create a consumer and I'm using all default settings for the consumer.
Now, In my consumer, the processing logic includes a DB call and I'm sending the record to DLT if there is an error/exception during processing.
With this setup, If the DB is down for few mins because of some reason, I want to pause/stop my consumer from consuming more records otherwise it keeps on consuming the messages and will get the DB exception and eventually fill up my DLT which I don't want to do unless the DB is back (based on some health check).
Now I've few questions here.
Does spring-kafka provide an option to trigger infinite retry based on the exception type (in this case a DB exception but I want to add few more types of exception based on my consumer logic)
Does spring-kafka provide an option to trigger the message consumption based on a condition?

There is a ContainerStoppingErrorHandler but it will stop the container for all exceptions.
You would need to create a custom error handler that stops (or pauses) the container after a specific failure as well as some mechanism to restart (or resume) the container.

Axon4 - Re-queue failed messages

In below scenario, what would be the bahavior of Axon -
Command Bus recieved the command
It creates an event
However messaging infra is down (say kafka)
Does Axon has re-queing capability for event or any other alternative to handle this scenario.

If you're using Axon, you know it differentiates between Command, Event and Query messages. I'd suggest to be specific in your question which message type you want to retry.
However, I am going to make the assumption it's about events, as your stating Kafka.
If this is the case, I'd highly recommend reading the reference guide on the matter, as it states how you can uncouple Kafka publication from actual event storage in Axon.
Simply put, use a TrackingEventProcessor as the means to publish events on Kafka, as this will ensure a dedicate thread is used for publication instead of the same thread storing the event. Added, the TrackingEventProcessor can be replayed, thus "re-process" events.

Stopping ConcurrentMessageListenerContainer in 1.3.5

I have one question and more of trying to understand so i can implement it right way. Requirement is to stop the ConcurrentMessageListenerContainer by invoking stop method on container which OOTB iterates for KafkaMessageListenerContainer(based on concurrency defined) and invoke stop for each consumer Thread.
Just FYI, i am on 1.3.5 and i cannot migrate to 2.* due to Spring Boot 1.5.*.
Configuration:
Let's say i have topic with 5 partitions and concurrency defined as 5 as well. Using Batchlistener so have batch records count =100 for each poll.
Question:
When we invoke stop on container, it appears internally, it set running false for each KafkaMessageListenerContainer and call wakeup on listenerConsumer.
setRunning(false);
this.listenerConsumer.consumer.wakeup();
During testing what i have observed, by invoking stop on container in separate thread, it does below:
1)It stops listenerConsumer and that's for sure working. No more polling happens after calling stop.
2)It seems if any listenerConsumer has already polled 100 records, and in middle of processing it completes the execution before stopping.
is #2 per design that invoking container stop only send wakeup to stop next polling? Because i don't see any handling of below in KafkaMessageListenerContainer.run()
catch (WakeupException e) {
//Ignore, we're stopping
}
One more thing, even in spring kafka 2.1 version by having ContainerStoppingBatchErrorHandler, it calls same container stop so i guess it's more of my understanding how to handle this scenario..
To conclude, if above is lot more detail, i want to terminate the listenerThread if stop is been invoked from separate thread. I have manual offset commit so replaying of batch is fine.
Hi Gary,
As you suggested to have consumeraware listener, my question is specific to listener been stopped through container. Once container invokes stop, listener thread be it BatchListner supposed to be interrupted from execution. I know entire poll of records has been received by listener and question is not about loosing offset as ack is at batch level.

It's not clear what your issue is; the container stops when the listener returns after processing the batch.
If you want to manage offsets at the record level, don't use a BatchMessageListener; use a record-level MessageListener instead.
With the former, the whole batch of records returned by the poll is considered a unit of work; offsets for the entire batch are committed, or not.
Even with a record-level listener, the container will not stop until the current batch of records have been sent to the listener; in that case, what you should do depends on the acknowledge mode; with manual acks, simply ignore those records that are received after the stop. If the container manages the acks; throw an exception for the records you want to discard (with ackOnError=false - default is true).
It's unfortunate that you can't move to a more recent release.
With 2.1; there is much more flexibility. For example; the SeekToCurrentErrorHandler and SeekToCurrentBatchErrorHandler are provided.
The SeekToCurrentErrorHandler extends RemainingRecordsErrorHandler which means the remaining records in the poll are sent to the error handler instead of the listener; when the container detects there is a RemainingRecordsErrorHandler the listener won't get the remaining records in the batch and the handler can decide what do do - stop the container, perform seeks etc.
So you no longer need to stop the container when a bad record is received.
This is all designed for handling errors/stopping the container due to some data error.
There currently is no stopNow() option on the containers - immediately stop calling the listener after the current record is processed and discard any remaining records so they will be resent on a start. Currently, you have to discard them yourself as I described above.
We could consider adding such an option but it would be a 2.2 feature at the earliest.
We only provide bug fixes in 1.3.x and, then, only if there is no work around.
It's open source, so you can always fork it yourself, and make contributions back to the framework.

Rebus Publish Exception Handling

Lets assume rebus could not publish message to rabbitmq or some other queue, what is the best practice to handle this exception.
I stopped rabbitmq service and rebus threw Aggregate exception. I can manually cacth this exception in try - catch block but is there a better solution to catch exceptions when such situations happened ?

First off: If you get an exception when initially sending/publishing a message (e.g. while handling a web request), there's nothing you can do, really. Sorry ;)
You should probably log - thoroughly - all the information you can, and then be sure to set up logging so that the information ends up in a file or in some other persistent log. And then you should have some kind of notification or a process in place that ensures that someone will at some point look at the log.
You should probably have this kind of logging in place, regardless of the type of work you do.
Depending on how important your information is, you could also set up some kind of retry mechanism (although you should be careful that you do not consume threads and too much memory while retrying). Also, since your web application should be able to be recycled at any time, you probably should not rely (too much) on retries.
You can do some things, though, in order to minimize the risk of ending up in a situation where you can't send/publish.
I can recommend that you use some kind of high-availability transport, like MSMQ (because it has local outgoing queues), RabbitMQ (with a shovel on each machine), or Azure Service Bus or Azure Storage Queues if you're in Azure.
Moreover - if you were using MSMQ, and you want to publish an event - I would recommend that you await bus.Send(theEvent) first, and then when you handle the message, you await bus.Publish(theEvent). This is because Rebus (with the MSMQ transport) needs to do a lookup in the subscription storage in order to get all subscribers for the given event. This is not a problem with RabbitMQ though, because Rebus will use Rabbit's topics to do pub/sub and will be just as safe as doing an ordinary send.
When you're sending/publishing from within a Rebus message handler, there is of course no problem, since the receive operation will be rolled back, and eventually the incoming message will end up in an error queue.
I hope that cast some light on the situation :)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex