Say that you are designing a system with two subsystems: one is a state machine A and the other is a functional subsystem B. Subsystem B has methods like doAsync(), that perform some operation and later return a completion or error event back to system A.
Now subsystem A can receive events from other external system, like ModeChange which results in changing to the proper state.
Here is the problem. Say that subsystem A is in state AA that invoked the function doAsync() and is waiting for the completion event. Yet, before the completion event is received, I get a ModeChange event that now pushes subsystem A to state AB. However, the completion event for doAsync() is now received on state AB instead of AA and thus captured as an unexpected event.
What would be the best way to solve this problem?
Here are some possible solutions and my thoughts:
Implement cancel operation for every async operation in system B.
I would prefer not to do this since it add a lot more work. It would increase the complexity of system B.
Ignore the unexpected Events.
I don’t like this solution because it would hide other problems.
Related
I am using an Axon Event Tracking processor. Sometimes events take longer that 10 seconds to process.
This seems to cause the message to be processed again and this appears in the log "Releasing claim of token X/0 failed. It was owned by another node."
If I up the number of segments it does not log this BUT the event is still processed twice so I think this might be misleading. (I think I was mistaken about this)
I have tried adjusting the fetchDelay, cleanupDelay and tokenClaimInterval. None of which has fixed this. Is there a property or something that I am missing?
Edit
The scenario taking longer than 10 seconds is making a HTTP request to an external service.
I'm using axon 4.1.2 with all default configuration when using with Spring auto configuration. I cannot see the Releasing claim on token and preparing for retry in [timeout]s log.
I was having this issue with a single segment and 2 instances of the application. I realised I hadn't increased the number of segments like I thought I had.
After further investigation I have discovered that adding an additional segment seems to have stopped this. Even if I have for example 2 segments and 6 applications it still doesn't reappear, however I'm not sure how this is different to my original scenario of 1 segment and 2 application?
I didn't realise it would be possible for multiple threads to grab the same tracking token and process the same event. It sounds like the best action would be to put an idem-potency check before the HTTP call?
The Releasing claim of token [event-processor-name]/[segment-id] failed. It was owned by another node. message can only occur in three scenarios:
You are performing a merge operation of two segments which fails because the given thread doesn't own both segments.
The main event processing loop of the TrackingEventProcessor is stopped, but releasing the token claim fails because the token is already claimed by another thread.
The main event processing loop has caught an Exception, making it retry with a exponential back-off, and it tries to release the claim (which might fail with the given message).
I am guessing it's not options 1 and 2, so that would leave us with option 3. This should also mean you are seeing other WARN level messages, like:
Releasing claim on token and preparing for retry in [timeout]s
Would you be able to share whether that's the case? That way we can pinpoint a little better what the exact problem is you are encountering.
By the way, very likely you have several processes (event handling threads of the TrackingEventProcessor) stealing the TrackingToken from one another. As they're stealing an un-updated token, both (or more) will handled the same event. Hence why you see the event handler being invoked twice.
Obviously undesirable behavior and something we should resolve for you. I would like to ask you to provide answers to my comments under the question, as right now I have to little to go on. Let us figure this out #Dan!
Update
Thanks for updating your question #dan, that's very helpful.
From what you've shared, I am fairly confident that both instances are stealing the token from one another. This does depend though on whether both are using the same database for the token_entry table (although I am assuming they are).
If they are using the same table, then they should "nicely" share their work, unless one of them takes to long. If it takes to long, the token will be claimed by another process. This other process in this case is the thread of the TEP of your other application instance. The "claim timeout" is defaulted to 10 seconds, which also corresponds with the long running event handling process.
This claimTimeout is adjustable though, by invoking the Builder of the JpaTokenStore/JdbcTokenStore (depending on which you are using / auto wiring) and calling the JpaTokenStore.Builder#claimTimeout(TemporalAmount) method. And, I think this would be required on your end, giving the fact you have a long running operation.
There are of course different ways of tackling this. Like, making sure the TEP is only ran on a single instance (not really fault tolerant though), or offloading this long running operation to a schedule task which is triggered by the event.
But, I think we've found the issue at least, so I'd suggest to tweak the claimTimeout and see if the problem persists.
Let us know if this resolves the problem on your end #dan!
I am trying to use Axon's TrackingEventProcessor to replay our events.
While resetting the token, I am getting an UnableToClaimTokenException, since the service runs in a distributed setup.
Is there any way to solve this issue without using Axon Server?
Axon Server has nothing to do with the occurrence of the UnableToClaimTokenException, Axon Server just greatly simplifies the process of initiating a replay.
As the exception and javadoc of the TrackingEventProcessor state, you will need to stop all instances of a given TrackingEventProcessor prior to initiating a replay.
Thus, in a distributed set up, you will have to stop each and every duplication of the given TrackingEventProcessor prior to being able to actual call resetTokens on one of them.
Without Axon Server, that means you will have to create your own endpoints or CLI within your application to stop the given processors. To simplify this, you would essentially want to have a centralized dashboard which shows all occurrence of a given TrackingEventProcessor.
This is exactly what Axon Server is, hence simplifying the process tremendously.
Regardless, it's definitely doable to create this yourself.
Thus, when it comes to triggering a replay, you will first shut down each instance of the TEP prior to a reset.
I want to pull messages off a MQS queue in a C client, and would love to do so asynchronously so I don't have to start (explicitly) multithreading. The messages will be forwarded to another system that acts "transactionally" but is completely incompatible with XA. So I'd like to have a way to explicitly commit (and thereby remove) a message that's been successfully handed off to the other system, and not commit if this failed, so that the last message is retained for a more successful later attempt.
I've read about the SYNCPOINT option and understand how I'd use that around a regular GET, but I haven's seen any hints on how to make asynchronous message retrieval have transactional behavior like this. Any hints, please?
I think you are describing using the asynchronous callback capability, ie you register a routine to be called when a message arrives, and ask for any get to be under syncpoint... An explanation of how some of it works is in here, https://share.confex.com/share/117/webprogram/Handout/Session9513/share_advanced_mqi.pdf page 4+
Effectively you get called with the MQ message under syncpoint, do your processing with another system, then commit or rollback the message before returning.
Be aware without the use of e.g. XA 2 phase commit, there is always going to be the windows of e.g. committing to the external system and a power outage means the message under the unit of work gets rolled back inside MQ as you didnt have time to perform the commit.
Edit: my misunderstanding, didn't realise that the application was using a callback to retrieve messages, which is indeed fully asynchronous behavior. Disregard the answer below.
Do MQGET with MQGMO_SYNCPOINT, then issue either MQCMIT or MQBACK.
"Asynchronous" and "synchronous" may be misnomers - these are your patterns of using MQ - whether you wait for a reply message or not, these patterns do not affect how MQ processes your calls. Transaction management (unit of work management) works across any MQI calls that use SYNCPOINT, no matter if they are part of a request/reply pattern or not.
I'm responsible for the maintenance and evolution of an already developed SMTP proxy for about a month now.
During some research about other aspects of this application I found the link below that states that channel writes are not synchronous and subsequent closes must be done using a listener.
Netty Channel closed detection
Questions:
Is this valid for version 3.6.3 (Final) of Netty API?
Considering that this is valid, is there any problem to store the ChannelFuture returned by a channel's "write" operation for future usage?
The reason for question "2" is:
I've made an initial analysis of the application's code and there is lots of places with writes, closes and write with subsequent close (one command right after the other).
The only way, I could realize, to handle all operations, given the asynchronous nature of this aspect of Netty, involves the storing of ChannelFuture returned by all writing operations so I can use them to schedule close operations.
Basically I would create a helper with methods "write" and "close" and hold a Map<Channel, ChannelFuture>. Always that a write is called, I put a new record on this map with the channel itself and the ChannelFuture returned by the channel's "write" operation.
When a "close" helper method is invoked, I firstly try to find the channel on this map. If I can't find it, there is no pending write operation, so I can close the channel right away. Otherwise, I have a pending write, so I use the stored ChannelFuture to register a listener scheduling the channel's closing.
I've started experimenting with SignalR. I've been trying to come up with a flexible way of storing information about each connected client. For example, storing the name in a chat app rather than passing it with each message.
At the moment, I have a static dictionary which matches the connectionId to an object which contains these properties. I add to this dictionary on connection, and remove on disconnection.
The issue I'm having is that I don't seem to get all disconnect events. If I close a tab in Chrome, the disconnect seems to go through. However, if I rapidly reload a tab, the disconnect doesn't seem to occur (at least not 'cleanly'). For example, if I reload the same tab over and over, it'll tell me my dictionary has multiple items when it should - in theory still be one.
Is there a standard way of storing this kind of per-connection information? Otherwise, what might be causing the issue I'm having?
You are actually handling connection id data correctly. Ensure that you are only instantiating your user data in OnConnected and uninstantiating it in OnDisconnected.
When spamming refresh on your page there are situations which result in the OnDisconnected event not being triggered immediately. However you should not worry about this because SignalR will actually time-out the connection and trigger the OnDisconnected event after a designated timeout (DisconnectTimeout).
If you do come across scenarios where there is not a 1-to-1 correlation for OnConnected and OnDisconnected events (after a significant amount of time) make sure to file a bug at https://github.com/SignalR/SignalR/issues.
Lastly if you're looking at doing some advanced chat mechanics and looking for some inspiration check out JabbR, it's open source!
https://github.com/davidfowl/JabbR
Hope this helps!