I would like to know how Spring Kafka handles retries given multiple partitions assigned to an instance. Does Spring Kafka keep retrying the same message according to the retry policy and backoff policy or does it retry, and in-between retrying, does it send messages from other partitions?
Is the behavior:
A) retry message -> retry message -> retry message
B) retry message -> other message -> retry message -> retry message
I've looked at other stackoverflow questions that seem to confirm that given a single partition Spring Kafka will not move to another offset, but there was no info on what is behavior if there were multiple partitions assigned to the instance. I've implemented a factory that has a retry template and stateful retry.
#Bean
public KafkaListenerContainerFactory<ConcurrentMessageListenerContainer<Integer, String>>
kafkaListenerContainerFactory() {
ConcurrentKafkaListenerContainerFactory<Integer, String> factory =
new ConcurrentKafkaListenerContainerFactory<>();
ListenerExceptions listenerExceptions = new ListenerExceptions();
factory.setConsumerFactory(consumerFactory());
factory.setConcurrency(KafkaProperties.CONCURRENCY);
factory.getContainerProperties().setPollTimeout(KafkaProperties.POLL_TIMEOUT_VLAUE);
factory.setRetryTemplate(retryTemplate());
factory.setErrorHandler(new SeekToCurrentErrorHandler());
factory.setStatefulRetry(true);
factory.setRecoveryCallback((RetryContext context) -> listenerExceptions.recover(context));
return factory;
}
The retry configuration from the mentioned factory is delegated into the RetryingMessageListenerAdapter, which logic is like this:
public void onMessage(final ConsumerRecord<K, V> record, final Acknowledgment acknowledgment,
final Consumer<?, ?> consumer) {
RetryState retryState = null;
if (this.stateful) {
retryState = new DefaultRetryState(record.topic() + "-" + record.partition() + "-" + record.offset());
}
getRetryTemplate().execute(context -> {
context.setAttribute(CONTEXT_RECORD, record);
switch (RetryingMessageListenerAdapter.this.delegateType) {
case ACKNOWLEDGING_CONSUMER_AWARE:
context.setAttribute(CONTEXT_ACKNOWLEDGMENT, acknowledgment);
context.setAttribute(CONTEXT_CONSUMER, consumer);
RetryingMessageListenerAdapter.this.delegate.onMessage(record, acknowledgment, consumer);
break;
case ACKNOWLEDGING:
context.setAttribute(CONTEXT_ACKNOWLEDGMENT, acknowledgment);
RetryingMessageListenerAdapter.this.delegate.onMessage(record, acknowledgment);
break;
case CONSUMER_AWARE:
context.setAttribute(CONTEXT_CONSUMER, consumer);
RetryingMessageListenerAdapter.this.delegate.onMessage(record, consumer);
break;
case SIMPLE:
RetryingMessageListenerAdapter.this.delegate.onMessage(record);
}
return null;
},
getRecoveryCallback(), retryState);
}
So, we do retry per message. According Apache Kafka recommendations we process one partition in one thread, so every next record in that partition won't be handled until retry is exhausted or call has been successful.
According to your multiple partitions condition and factory.setConcurrency(KafkaProperties.CONCURRENCY); configuration, it might be the fact that different partitions are processed in different threads. Therefore it might be the case that different records from different partitions are retried at the same time. Just because a retry is tied to the thread and call stack.
Related
Hello I'm developing a Server-Client application that communicate with SignalR. What I have to implement is a mechanism that will allow my server to call method on client and get a result of that call. Both applications are developed with .Net Core.
My concept is, Server invokes a method on Client providing Id of that invocation, the client executes the method and in response calls the method on the Server with method result and provided Id so the Server can match the Invocation with the result.
Usage is looking like this:
var invocationResult = await Clients
.Client(connectionId)
.GetName(id)
.AwaitInvocationResult<string>(ClientInvocationHelper._invocationResults, id);
AwaitInvocationResult - is a extension method to Task
public static Task<TResultType> AwaitInvocationResult<TResultType>(this Task invoke, ConcurrentDictionary<string, object> lookupDirectory, InvocationId id)
{
return Task.Run(() =>
{
while (!ClientInvocationHelper._invocationResults.ContainsKey(id.Value)
|| ClientInvocationHelper._invocationResults[id.Value] == null)
{
Thread.Sleep(500);
}
try
{
object data;
var stingifyData = lookupDirectory[id.Value].ToString();
//First we should check if invocation response contains exception
if (IsClientInvocationException(stingifyData, out ClientInvocationException exception))
{
throw exception;
}
if (typeof(TResultType) == typeof(string))
{
data = lookupDirectory[id.Value].ToString();
}
else
{
data = JsonConvert.DeserializeObject<TResultType>(stingifyData);
}
var result = (TResultType)data;
return Task.FromResult(result);
}
catch (Exception e)
{
Console.WriteLine(e);
throw;
}
});
}
As you can see basically I have a dictionary where key is invocation Id and value is a result of that invocation that the client can report. In a while loop I'm checking if the result is already available for server to consume, if it is, the result is converted to specific type.
This mechanism is working pretty well but I'm observing weird behaviour that I don't understand.
If I call this method with await modifier the method in Hub that is responsible to receive a result from client is never invoked.
///This method gets called by the client to return a value of specific invocation
public Task OnInvocationResult(InvocationId invocationId, object data)
{
ClientInvocationHelper._invocationResults[invocationId.Value] = data;
return Task.CompletedTask;
}
In result the while loop of AwaitInvocationResult never ends and the Hub is blocked.
Maby someone can explain this behaviour to me so I can change my approach or improve my code.
As it was mentioned in the answer by Brennan, before ASP.NET Core 5.0 SignalR connection was only able to handle one not streaming invocation of hub method at time. And since your invocation was blocked, server wasn't able to handle next invocation.
But in this case you probably can try to handle client responses in separate hub like below.
public class InvocationResultHandlerHub : Hub
{
public Task HandleResult(int invocationId, string result)
{
InvoctionHelper.SetResult(invocationId, result);
return Task.CompletedTask;
}
}
While hub method invocation is blocked, no other hub methods can be invoked by caller connection. But since client have separate connection for each hub, he will be able to invoke methods on other hubs. Probably not the best way, because client won't be able to reach first hub until response will be posted.
Other way you can try is streaming invocations. Currently SignalR doesn't await them to handle next message, so server will handle invocations and other messages between streaming calls.
You can check this behavior here in Invoke method, invocation isn't awaited when it is stream
https://github.com/dotnet/aspnetcore/blob/c8994712d8c3c982111e4f1a09061998a81d68aa/src/SignalR/server/Core/src/Internal/DefaultHubDispatcher.cs#L371
So you can try to add some dummy streaming parameter that you will not use:
public async Task TriggerRequestWithResult(string resultToSend, IAsyncEnumerable<int> stream)
{
var invocationId = InvoctionHelper.ResolveInvocationId();
await Clients.Caller.SendAsync("returnProvidedString", invocationId, resultToSend);
var result = await InvoctionHelper.ActiveWaitForInvocationResult<string>(invocationId);
Debug.WriteLine(result);
}
and on the client side you will also need to create and populate this parameter:
var stringResult = document.getElementById("syncCallString").value;
var dummySubject = new signalR.Subject();
resultsConnection.invoke("TriggerRequestWithResult", stringResult, dummySubject);
dummySubject.complete();
More details: https://learn.microsoft.com/en-us/aspnet/core/signalr/streaming?view=aspnetcore-5.0
If you can use ASP.NET Core 5, you can try to use new MaximumParallelInvocationsPerClient hub option. It will allow several invocations to execute in parallel for one connection. But if your client will call too much hub methods without providing result, connection will hang.
More details: https://learn.microsoft.com/en-us/aspnet/core/signalr/configuration?view=aspnetcore-5.0&tabs=dotnet
Actually, since returning values from client invocations isn't implemented by SignalR, maybe you can try to look into streams to return values into hubs?
This is supported in .NET 7 now https://learn.microsoft.com/en-us/aspnet/core/signalr/hubs?view=aspnetcore-7.0#client-results
By default a client can only have one hub method running at a time on the server. This means that when you wait for a result in the first hub method, the second hub method will never run since the first hub method is blocking the processing loop.
It would be better if the OnInvocationResult method ran the logic in your AwaitInvocationResult extension and the first hub method just registers the id and calls the client.
I'm running into a problem sending massive requests to a .NET Core web service. I'm using a SemaphoreSlim to limit the number of simultaneous requests. When I get a 10061 error (the web service has refused the connection), I want to dial back the number of simultaneous requests. My idea at the moment is to de-reference the SemaphoreSlim and create another:
await this.semaphoreSlim.WaitAsync().ConfigureAwait(false);
counter++;
Uri uri = new Uri($"{api}/{keyProperty}", UriKind.Relative);
string rowVersion = string.Empty;
try
{
HttpResponseMessage getResponse = await this.httpClient.GetAsync(uri).ConfigureAwait(false);
if (getResponse.IsSuccessStatusCode)
{
using (HttpContent httpContent = getResponse.Content)
{
JObject currentObject = JObject.Parse(await httpContent.ReadAsStringAsync().ConfigureAwait(false));
rowVersion = currentObject.Value<string>("rowVersion");
}
}
}
catch (HttpRequestException httpRequestException)
{
SocketException socketException = httpRequestException.InnerException as SocketException;
if (socketException != null && socketException.ErrorCode == PutHandler.ConnectionRefused)
{
this.semaphoreSlim = new SemaphoreSlim(counter * 90 / 100, counter * 90 / 100);
}
}
}
finally
{
this.semaphoreSlim.Release();
}
If I do this, what will happen to the other tasks that are waiting on the Semaphore that I just de-referenced? My guess is that nothing will happen until the object is garbage collected and disposed.
A SemaphoreSlim (just like any other object in .NET) will exist as long as there are references to it.
However, there is a bug in your code: the SemaphoreSlim being released is this.semaphoreSlim, and if this.semaphoreSlim is changed between being acquired and being released, then the code will release a different semaphore than the one that was acquired. To avoid this problem, copy this.semaphoreSlim into a local variable at the beginning of your method, and acquire and release that local variable.
More broadly, there's a difficult in the attempted solution. If you start 1000 tasks, they will all reference the old semaphore and ignore the updated this.sempahoreSlim. So you'd need a separate solution. For example, you could define a disposable "token" which is permission to call the API. Then have an asynchronous collection of these tokens (e.g., a Channel). This gives you full control over how many tokens are released at once.
I am new to Spring-Kafka and trying to implement retry in case of failure or any exception during kafka message processing using Spring Kafka RetryTemplate.
I have used the following code:
//This is KafkaListenerContainerFactory:
public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactoryRetry() {
ConcurrentKafkaListenerContainerFactory<String, String> factory = new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory());
factory.setRetryTemplate(retryTemplate());
factory.setRecoveryCallback(retryContext -> {
ConsumerRecord consumerRecord = (ConsumerRecord) retryContext.getAttribute("record");
logger.info("Recovery is called for message {} ", consumerRecord.value());
return Optional.empty();
});
return factory;
}
// Retry template
public RetryTemplate retryTemplate() {
RetryTemplate retryTemplate = new RetryTemplate();
FixedBackOffPolicy fixedBackOffPolicy = new FixedBackOffPolicy();
// Todo: take from config
fixedBackOffPolicy.setBackOffPeriod(240000);// 240seconds
retryTemplate.setBackOffPolicy(fixedBackOffPolicy);
SimpleRetryPolicy simpleRetryPolicy = new SimpleRetryPolicy();
// Todo: take from config
simpleRetryPolicy.setMaxAttempts(3);
retryTemplate.setRetryPolicy(simpleRetryPolicy);
return retryTemplate;
}
//
This is consumerFactory
public ConsumerFactory<String, String> consumerFactory() {
Map<String, Object> props = new HashMap<>();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
return new DefaultKafkaConsumerFactory<>(props);
}
When any exception occurs, it is getting retried as expected as per the retry policy. Once, max retries exhaust, it calls the recovery callback method.
But soon after that, it gives "java.lang.IllegalStateException: This error handler cannot process
'org.apache.kafka.clients.consumer.CommitFailedException's; no record information is available" with some detail like:
Failing OffsetCommit request since the consumer is not part of an active group.
It seems that it is not able to commit the offset as the consumer is now kicked off from group because
it was idle for long time (backoffperiod*(maxretry-1)) before next poll.
Do I need to add max.poll.interval.ms with some large value?
Is there any other way to achieve this so that this commit failed error won't come even if the consumer is taking so much time in processing and is scheduled to retry with long interval.
Please help me on this.
The aggregate backOff delay must be less than the max.poll.interval.ms to avoid a rebalance.
It is now preferred to use a SeekToCurrentErrorHandler instead of a RetryTemplate because then only each delay (instead of the aggregate) needs to be less than max.poll.interval.ms
Documentation here.
I'm using spring-kafka '2.1.7.RELEASE' and here are my consumer settings.
public Map<String, Object> setConsumerConfigs() {
Map<String, Object> configs = = new HashMap<>();
configs.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
configs.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
configs.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, ErrorHandlingDeserializer2.class);
configs.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, ErrorHandlingDeserializer2.class);
configs.put(ErrorHandlingDeserializer2.KEY_DESERIALIZER_CLASS, stringDeserializerClass);
configs.put(ErrorHandlingDeserializer2.VALUE_DESERIALIZER_CLASS, kafkaAvroDeserializerClass.getName());
configs.setPartitionAssignmentStrategyConfig(Collections.singletonList(RoundRobinAssignor.class));
// Set this to true so that you will have consumer record value coming as your pre-defined contract instead of a generic record
sapphireKafkaConsumerConfig.setSpecificAvroReader("true");
}
and here are my factory settings
#Bean
public <K,V> ConcurrentKafkaListenerContainerFactory<String, Object> kafkaListenerContainerFactory() {
ConcurrentKafkaListenerContainerFactory<String, Object> factory = new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(new DefaultKafkaConsumerFactory<>(getConsumerConfigs));
factory.getContainerProperties().setMissingTopicsFatal(false);
factory.getContainerProperties().setAckMode(AckMode.RECORD);
factory.setErrorHandler(myCustomKafkaSeekToCurrentErrorHandler);
factory.setRetryTemplate(retryTemplate());
factory.setRecoveryCallback(myCustomKafkaRecoveryCallback);
factory.setStatefulRetry(true);
return factory;
}
public RetryTemplate retryTemplate() {
RetryTemplate retryTemplate = new RetryTemplate();
retryTemplate.setListeners(new RetryListener[]{myCustomKafkaRetryListener});
retryTemplate.setRetryPolicy(myCustomKafkaConsumerRetryPolicy);
FixedBackOffPolicy backOff = new FixedBackOffPolicy();
backOff.setBackOffPeriod(1000);
retryTemplate.setBackOffPolicy(backOff);
return retryTemplate;
}
Here is my consumer where I've added a delay of 6 minutes which is greater than the default max.poll.interval.ms
#KafkaListener(topics = TestConsumerConstants.CONSUMER_LONGRUNNING_RECORDS_PROCESSSING_TEST_TOPIC
, clientIdPrefix = "CONSUMER_LONGRUNNING_RECORDS_PROCESSSING"
, groupId = "kafka-lib-comp-test-consumers")
public void consumeLongRunningRecord(ConsumerRecord message) throws InterruptedException {
System.out.println(String.format("\n \n Received message at %s offset %s of partition %s of topic %s with key %s \n\n", DateTime.now(),
message.offset(), message.partition(), message.topic(), message.key()));
TimeUnit.MINUTES.sleep(6);
System.out.println(String.format("\n \n Processing done for the message at %s offset %s of partition %s of topic %s with key %s \n\n", DateTime.now(),
message.offset(), message.partition(), message.topic(), message.key()));
}
Now I'm getting below error and trying to process the same record again and again because it couldn't commit the offset (which is expected).
Caused by: org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
And now, i tried with setting the 'session.timeout.ms' = 420000. Now I'm getting below error but I didn't set any values for group.min.session.timeout.ms and group.max.session.timeout.ms. And the the default values for group.min.session.timeout.ms and group.max.session.timeout.ms are 6000 and 1800000 respectively.So, Can someone help me understand why am i getting this error?
Caused by: org.apache.kafka.common.errors.InvalidSessionTimeoutException: The session timeout is not within the range allowed by the broker (as configured by group.min.session.timeout.ms and group.max.session.timeout.ms).
I don't know why you are getting that error, but session timeout is no longer relevant; see KIP-62. Maybe the defaults have been changed and the docs not updated.
You need to increase max.poll.interval.ms to avoid the rebalance.
I'm trying out the simple deferred message timeout pattern in Rebus along the outline here http://mookid.dk/oncode/archives/3043 in order to have alternative behaviors depending on whether or not I receive a timely response from an other service hooked up to the same bus. Code has been modified to use async/await.
(yes, I know that article is really about unit testing. I'm just trying out the same timeout thing)
The handler in my case is a saga. Two message send operations are awaited as the last two calls of the handler of the message that starts the saga. First message uses request/reply from the external service and the reply is also eventually handled in the same saga.
Second message is a deferred message that is supposed to enforce the alternative action in case timeout occurs, just like in the Rebus unit test example. I've verified that messages are sent from and received by the saga/handler without any problems. It looks something like this:
public class TestSaga : Saga<TestSagaData>, IAmInitiatedBy<SomeMessage>, IHandleMessages<SomeReply>, IHandleMessages<TimeOutMessage>
{
private readonly IBus _bus;
public TestSaga(IBus bus)
{
_bus = bus;
}
protected override void CorrelateMessages(ICorrelationConfig<TestSagaData> config)
{
config.Correlate<SomeMessage>(s => s.Identifier, d => d.OriginalMessageIdentifier);
config.Correlate<SomeMessage>(s => s.Tag, d => d.CorrelationIdentifier);
config.Correlate<SomeReply>(s => s.Tag, d => d.CorrelationIdentifier);
config.Correlate<TimeOutMessage>(s => s.Tag, d => d.CorrelationIdentifier);
}
public async Task Handle(SomeMessage message)
{
if (!IsNew)
return;
Data.CorrelationIdentifier = message.Tag;
Data.OriginalMessageIdentifier = message.Identifier;
Data.ReplyReceived = false;
await _bus.Send(new SomeRequest {Tag = message.Tag});
await _bus.Defer(TimeSpan.FromSeconds(30), new TimeOutMessage() {Tag = message.Tag});
}
public async Task Handle(SomeReply message)
{
// Even if we would get here loooong before...
Data.ReplyReceived = true;
await DoStuffIfNotTimedout();
}
public async Task Handle(TimeOutMessage message)
{
// ...this, DoStuffIfTimeout below is always called
// since state is preserved from the _bus.Defer call. Correct?
if (!Data.ReplyReceived)
await DoStuffIfTimedout();
}
private async Task DoStuffIfNotTimedout()
{
// some more async stuff here
MarkAsComplete();
}
private async Task DoStuffIfTimedout()
{
// some more async stuff here
MarkAsComplete();
}
}
I have added a boolean saga data flag/property to indicate that the reply to the first message was received first, setting it to false initially before the both await Send/Defer calls
and setting it to true immediately in the message handler of the reply.
The flag was supposed to be used to prevent the timeout actions to start in case the deferred timeout thing was received after the first reply but before the subsequent actions were done with and the saga marked completed.
However, in this case the state of the saga seems to 'follow' the message received. So if the first message reply handler is entered first and sets the saga data flag to true. Then when the deferred message handler is entered later,
something has reset the flag again, seemingly ignoring the action taken in the first reply handler (setting the flag true). Not sure whether the 'Revision' number of the saga is supposed to change or not but it remains unchanged (zero) all the time it seems.
Also noted that it doesn't matter if the timeout occurs long after the reply handler is entered, when the timeout handler is entered, the the flag is 'false'.
Is this saga state behavior by design? I thought saga state would somehow be persisted between message handler calls. If it's the wrong behavior, what could possibly cause it?
I quite new to Rebus so I'm sure I've misunderstood something here but in that case I would like to know what :).
Transport used under the hood is RabbitMQ.
Test code: saga state test