Spring cloud stream unexpected shutdown is not covered by DLQ - spring-kafka

We are using Spring Cloud Stream 2.2 with the Kafka binder. Something we have noticed is if the pod is killed in the middle of doing the job for whatever reason, then we will miss the message to be sent to DLQ.
We are managing exceptions by catching the failure to log it first, and then send the failure to another service to keep track of this situation, and finally throw exception again to be caught by error channel and captured by DLQ. This approach works seamlessly in normal failure, but if the failure has been triggered externally (like unexpected shutdown), then we miss the DLQ part as it seems the corresponding process is killed before reaching out to the error channel. I wonder if this is a known issue as it's impacting the at-least-once guarantee of this framework in our use case.
22:34:48.077 INFO Shutting down ExecutorService
22:34:48.135 INFO Consumer stopped
22:34:48.136 INFO stopped org.springframework.integration.kafka.inbound.KafkaMessageDrivenChannelAdapter#5174b135
22:34:48.155 INFO Registering MessageChannel outbox-usermgmt.event.job-creator-outbox-event-syncs.errors
22:34:48.241 INFO Channel 'application.outbox-usermgmt.event.job-creator-outbox-event-syncs.errors' has 1 subscriber(s).
22:34:48.241 INFO Channel 'application.outbox-usermgmt.event.job-creator-outbox-event-syncs.errors' has 0 subscriber(s).
22:34:48.246 INFO Registering MessageChannel progress-report.errors
22:34:48.258 INFO Channel 'application.progress-report.errors' has 0 subscriber(s).
22:34:48.262 INFO Registering MessageChannel job-created.errors
22:34:48.273 INFO Registering MessageChannel progress-report.errors
22:34:48.350 INFO Channel 'application.job-created.errors' has 0 subscriber(s).
22:34:48.366 INFO Registering MessageChannel job-created.errors
22:34:48.458 INFO Removing {logging-channel-adapter:_org.springframework.integration.errorLogger} as a subscriber to the 'errorChannel' channel
22:34:48.458 INFO Channel 'application.errorChannel' has 1 subscriber(s).
22:34:48.459 INFO stopped _org.springframework.integration.errorLogger
22:34:48.459 INFO Shutting down ExecutorService 'taskScheduler'
22:34:48.467 WARN Destroy method 'close' on bean with name 'genericSpecificFlexibleDeserializer' threw an exception: java.lang.NullPointerException
22:34:48.472 ERROR Job has failed, Fail to retrieve record's full tree, Connection closed unexpectedly
22:34:48.472 ERROR Fail to retrieve record's full tree
22:34:48.472 DEBUG Sending progress update of 0.0 with status of failed
22:34:48.474 ERROR Job has failed, Fail to retrieve record's full tree
22:34:48.538 INFO Closing JPA EntityManagerFactory for persistence unit 'default'
22:34:48.538 INFO Shutting down ExecutorService
22:34:48.541 INFO HikariPool-1 - Shutdown initiated...
22:34:48.543 INFO HikariPool-1 - Shutdown completed.
Code snippet:
try {
...
} catch (Exception ex) {
//capture the failure details in logs
//send failure progress update to another service
throw new JobProcessingException(ex);
}
It appears the framework commits the message before ensuring that the DLQ message is published to Kafka so the offset has moved but the message was skipped as nothing was published to DLQ.
P.S: This scenario happens for us whenever Kubernetes sends a restart signal to the pod for whatever reason like pod eviction, new release, etc. So I suppose if the kill signal was forced then we would not have the commit in the first place and the job was restarted.

This is a known problem - see https://github.com/spring-projects/spring-integration/issues/3450
The issue is that a PublishSubscribeChannel allows zero subscribers and no exception is thrown if there are none.
It has been resolved in Spring Integration (5.4.x) but is still a problem in the binder because it creates a pub/sub error channel by default.
See my comment there...
Yes; I think that solution makes sense; it shouldn't cause any real problems because the default errorChannel always gets one subscriber.
However, it won't solve the problem in the binder because the message producer gets a binding-specific error channel (which is bridged to the global error channel), so we'd need a similar change there.
It should be possible to work around it by declaring the binding's error channel as a DirectChannel #Bean in which case an exception will be thrown if the consumer has unsubscribed (during shutdown). However, this will mean errors will only go to the binding-specific error channel and won't be bridged to the global errorChannel.
https://github.com/spring-cloud/spring-cloud-stream/issues/2082

Related

Will Spring KafkaContainerStoppingErrorHandler commits offset for batch listener

I am working on Spring Kafka implementation and my use case is consume messages from Kafka topic as batch (using batch listener). when I consumer the list of messages, will iterate and call the REST endpoint for message enrichment. In case REST API fails for any runtime exception, I have implemented retry logic using spring retry. I want to stop the container, after the number of retries fails. So planning to use KafkaContainerStoppingErrorHandler to achieve this. Does the KafkaContainerStoppingErrorHandler commits the previous success messages - say if we receive 10 messages, and for message 1,2,3,4, enrichment call is success and for message 5 enrichment API call fails. so when we restart the container, will I get all 10 again or will I receive messages 5- 10?
or is there a way we can achieve above use case? I looked into all types of error handles of Spring kafka and need input on how to achieve above requirement.
You will get them all again.
You can use the DefaultErrorHandler (with a custom recoverer) and throw a BatchListenerFailedException to indicate which record in the batch failed.
The error handler will commit the offsets up to that record and call the recoverer with the failed record; in your custom recoverer you can stop the container (use the same logic as the container stopping error handler).
In versions before 2.8, this same functionality is provided by the RecoveringBatchErrorHandler.

setAckOnError() method for Spring kafka 2.7.8

while upgrading our spring kafka to 2.7.8, we are getting error in the setAckOnError(false) method as it has been deprecated now. Is there any way now to set the acknowledgement for the errors to false? Any other methods that can help me to set it as false for errors acknowledgement?
P.S: I am new to Kafka, any help appreciated!
That property was found to have a (very small) timing hole in that a record could be ack'd before the error handler handles it; if the app dies at that time, the record could be "lost".
It was replaced by a new feature in the error handlers ackAfterHandle; which is true by default - i.e. the record's offset is only committed if the error handler "handles" the error.
Records are now never ack'd if the error handler (such as the SeekToCurrentErrorHandler) throws an exception (after it repositions the partitions).
There is no extra configuration needed any more.
See Spring Kafka AckOnError for more details.

skip retry mechanism and go straight to DLT

We're my ErrorHandler logs and rethrows any exception that is not handled by the kafkalistener, so that the message is retried and eventually goes to DLT.
There are some failures which should not be retried, but should go straight to DLT e.g. json parsing errors.
Is there a way to skip retry mechanism for certain exceptions?
See Spring Retry project: https://github.com/spring-projects/spring-retry and its ExceptionClassifierRetryPolicy: https://github.com/spring-projects/spring-retry/blob/master/src/main/java/org/springframework/retry/policy/ExceptionClassifierRetryPolicy.java. That is what you can inject into the RetryTemplate for the KafkaListenerContainer.

Timing issue a C++/winRT BLE connection attempt?

I am using C++/winRT UWP to discover and connect to Bluetooth Low Energy devices. I am using the advertisment watcher to look for advertisements from devices I can support. This works.
Then I pick one to connect to. The connection procedure is a little weird by my way of thinking but according to the microsoft docs one Calls this FromBluetoothAddressAsync() with the BluetoothAddress and two things happen; one gets the BluetoothLEDevice AND a connection attempt is made. One needs to register a handler for the connection status changed event BUT you can't do that until you get the BluetoothLEDevice.
Is there a timing issue causing the exception? Has the connection already happened BEFORE I get the BluetoothLEDevice object? Below is the code and below that is the log:
void BtleHandler::connectToDevice(BluetoothLEAdvertisementReceivedEventArgs eventArgs)
{
OutputDebugStringA("Connect to device called\n");
// My God this async stuff took me a while to figure out! See https://msdn.microsoft.com/en-us/magazine/mt846728.aspx
IAsyncOperation<Windows::Devices::Bluetooth::BluetoothLEDevice> async = // assuming the address type is how I am to behave ..
BluetoothLEDevice::FromBluetoothAddressAsync(eventArgs.BluetoothAddress(), BluetoothAddressType::Random);
bluetoothLEDevice = async.get();
OutputDebugStringA("BluetoothLEDevice returned\n");
bluetoothLEDevice.ConnectionStatusChanged({ this, &BtleHandler::onConnectionStatusChanged });
// This method not only gives you the device but it also initiates a connection
}
The above code generates the following log:
New advertisment/scanResponse with UUID 00001809-0000-1000-8000-00805F9B34FB
New ad/scanResponse with name Philips ear thermometer and UUID 00001809-0000-1000-8000-00805F9B34FB
Connect to device called
ERROR here--> onecoreuap\drivers\wdm\bluetooth\user\winrt\common\bluetoothutilities.cpp(509)\Windows.Devices.Bluetooth.dll!03BEFDD6: (caller: 03BFB977) ReturnHr(1) tid(144) 80070490 Element not found.
ERROR here--> onecoreuap\drivers\wdm\bluetooth\user\winrt\device\bluetoothledevice.cpp(428)\Windows.Devices.Bluetooth.dll!03BFB9B7: (caller: 03BFAF01) ReturnHr(2) tid(144) 80070490 Element not found.
BluetoothLEDevice returned
Exception thrown at 0x0F5CDF2F (WindowsBluetoothAdapter.dll) in BtleScannerTest.exe: 0xC0000005: Access violation reading location 0x00000000.
It sure looks like there is a timing issue. But if it is, I have no idea how to resolve it. I cannot register for the event if I don't have a BluetoothLEDevice object! I cannot figure out a way to get the BluetoothLEDevice object without invoking a connection.
================================ UPDATE =============================
Changed the methods to IAsyncAction and used co_await as suggested by #IInspectable. No difference. The problem is clearly that the registered handler is out of scope or something is wrong with it. I tried a get_strong() instead of a 'this' in the registration, but the compiler would not accept it (said identifier 'get_strong()' is undefined). However, if I commented out the registration, no exception is thrown but I still get these log messages
onecoreuap\drivers\wdm\bluetooth\user\winrt\common\bluetoothutilities.cpp(509)\Windows.Devices.Bluetooth.dll!0F27FDD6: (caller: 0F28B977) ReturnHr(3) tid(253c) 80070490 Element not found.
onecoreuap\drivers\wdm\bluetooth\user\winrt\device\bluetoothledevice.cpp(428)\Windows.Devices.Bluetooth.dll!0F28B9B7: (caller: 0F28AF01) ReturnHr(4) tid(253c) 80070490 Element not found.
But the program continues to run an I continue to discover and connect. But since I can't get the connection event it is kind of useless at this stage.
I hate my answer. But after asynching and co-routining everything under the sun, the problem is unsolvable by me:
This method
bluetoothLEDevice = co_await BluetoothLEDevice::FromBluetoothAddressAsync(eventArgs.BluetoothAddress(), BluetoothAddressType::Random);
returns NULL. That should not happen and there is not much I can do about it. I read that as a broken BLE API.
A BTLE Central should be able to do as follows
Discover a device if new then:
If user selects connect, connect to
the device
perform service discovery
read/write/enable
characteristics as needed
handle indications/notifications
If at any time the peripheral sends a security request or insufficient authentication error, start pairing
repeat the action that caused the insufficient authentication.
On disconnect, save the paired and bonded state if the device is pairable.
On rediscovery of the device, if unpaired (not a pairable device)
repeat above
If paired and bonded
start encryption
work with the device; no need to re-enable or do service discovery
========================= MORE INFO ===================================
This is what the log shows when the method is called
Connect to device called
onecoreuap\drivers\wdm\bluetooth\user\winrt\common\bluetoothutilities.cpp(509)\Windows.Devices.Bluetooth.dll!0496FDD6: (caller: 0497B977) ReturnHr(1) tid(3b1c) 80070490 Element not found.
onecoreuap\drivers\wdm\bluetooth\user\winrt\device\bluetoothledevice.cpp(428)\Windows.Devices.Bluetooth.dll!0497B9B7: (caller: 0497AF01) ReturnHr(2) tid(3b1c) 80070490 Element not found.
BluetoothLEDevice returned is NULL. Can't register
Since the BluetoothLEDevice is NULL, I do not attempt to register.
================= MORE INFO ===================
I should also add that taking an over-the-air sniff reveals that there is never a connection event. Though the method is supposed to initiate a connection as well as return the BluetoothLEDevice object, it ends up doing neither. My guess is that the method requires more pre-use setup of the system that only the DeviceWatcher does. The AdvertisementWatcher probably does not.
In BLE you always have to wait for every operation to complete.
I am not an expert in C++, but in C# the async connection procedure returns a bool if it was successful.
In C++ the IAsyncOperation does not have a return type, so there is no way to know if the connection procedure was successful or completed.
You will have to await the IAsyncOperation and make sure that you have a BluetoothLEDevice object, before you attach the event handler.
To await an IAsyncOperation there is a question/answer on how to await anIAsyncOperation:
How to wait for an IAsyncAction? How to wait for an IAsyncAction?

Suspended orchestration service instance re-throwing the same unexpected exception after Resume

I am getting below error, when i am trying to resume Suspended(resumable) orchestration instance.
Scenario: Request went thourgh DB2 Static solicit - Response port, and it got failed because of access permission denied. I can see two instances suspended in the admin console one is related to port and another one is related to orchestration. After fixing the credentials, suspended port instance got resumed but the orchestration one is keep on failing.
Uncaught exception (see the 'inner exception' below) has suspended an instance of service 'Orchestration name'.
The service instance will remain suspended until administratively resumed or terminated.
If resumed the instance will continue from its last persisted state and may re-throw the same unexpected exception.
InstanceId: ca927086-465d-40e8-93fe-c3a0e4c161f7
Shape name:
ShapeId:
Exception thrown from: segment -1, progress -1
Inner exception: An error occurred while processing the message, refer to the details section for more information
Message ID: {96B72521-9833-48EF-BB2F-4A2E2265D697}
Instance ID: {F6FBC912-C9DC-489C-87F3-103FA1273FDC}
Error Description: The user does not have the authority to access the host resource. Check your authentication credentials or contact your system administrator. SQLSTATE: HY000, SQLCODE: -1000
Exception type: XlangSoapException
Source: Microsoft.XLANGs.BizTalk.Engine
Target Site: Void VerifyTransport(Microsoft.XLANGs.Core.Envelope, Int32, Microsoft.XLANGs.Core.Context)
The following is a stack trace that identifies the location where the exception occured
at Microsoft.BizTalk.XLANGs.BTXEngine.BTXPortBase.VerifyTransport(Envelope env, Int32 operationId, Context ctx)
at Microsoft.XLANGs.Core.Subscription.Receive(Segment s, Context ctx, Envelope& env, Boolean topOnly)
at Microsoft.XLANGs.Core.PortBase.GetMessageIdForSubscription(Subscription subscription, Segment currentSegment, Context cxt, Envelope& env, CachedObject location)
at Microsoft.XLANGs.Core.PortBase.GetMessageId(Subscription subscription, Segment currentSegment, Context cxt, Envelope& env, CachedObject location)
at (StopConditions stopOn)
at Microsoft.XLANGs.Core.SegmentScheduler.RunASegment(Segment s, StopConditions stopCond, Exception& exp)
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Any thoughts how to fix this?
Creating the above scenario using samples:
Go to BizTalk
samples/orchestrations/consumeWebservice
folder, install the
ConsumeWebService application and
publish POWebservice to IIS.
Change IIS Directory security
permissions for POWebservice, remove
anonymous or any other access.
Now drop the message you will see
suspended messages because of HTTP
status 401: Access Denied, then give
access to POWebservice either
anonymous or Windows.
Then resume
the suspended instances, one will
get disappear but
another(orchestration) one wont.
The orchestration will continue to fail with the exception because when it was suspended, the last persistence point was the receipt of the exception. This means that the orchestration will re-start (when resumed) and re-throw the exception.
Here's at article discussing some points at which orchestration state is persisted to the database: http://blogs.msdn.com/b/sanket/archive/2006/11/12/understanding-persistence-points-in-biztalk-orchestration.aspx
You can manipulate this to some extent in your orchestration design, as Richard Seroter discusses here, but generally you would do better to use failed message routing, enabling you to handle the failed messages, and terminate the failed orchestration instance.
Please correct me if I'm wrong, but is this not just normal biztalk behavior? I am not 100% sure so please let me know if this is wrong:
The outbound messaging instance was suspended because the credentials the port was using to connect to to the DB were wrong.
This caused the orchestrations making these calls to also suspend.
The suspended message instance was resumed and was processed correctly because the problem was fixed. So the call was made to the DB.
However, the orchestration instance may not be able to resume because when resumed it found itself at the most recent persistence point and the original error which was delivered back from the send port is still available to the orchestration, causing it to re-suspend.
In the error message, it actually says "If resumed the instance will continue from its last persisted state and may re-throw the same unexpected exception."
If you want to handle this sort of thing you could make the call to the database atomic. That way the orchestration will not persist itself at the point of making the DB call. If the orchestration then suspends it will resume at a point before the DB call is made, and will make the DB call as normal, which should succeed this time because you have fixed the original issue.
The only problem with this is if your DB call cannot be executed more than once with the same data without bad things happenning (is not idempotent).
I am not 100% on the above explaination. Please point out if my understanding is incorrect.
this scenario not treated by Microsoft Biztalk = Middleware FAIL.
you have to solve this at the orchestration design stage up front...
http://seroter.wordpress.com/2007/01/02/orchestration-handling-of-suspended-messages/

Resources