I have the following problem: I have two network clients, where one is a device that is to be "claimed" by its owner, and another is the program which claims it. When the claimee hits the server, it announces it's available to be claimed, and then the claimer can claim it (after authenticating and supplying information only it could know of course). However, if claimer hits the server first, then I have a classic "lost signal" problem. The claimer can retry and that's fine, but I can end up with the following race condition, the main point in question:
Claimee hits the server and announces, then its connection fails
Claimer comes in and find the announced record, and claims it
Claimee reconnects with a status of unclaimed, and overwrites the claim
I've thought of a few solutions:
1) Expire old claimee announces after 60 seconds, and have the claimer retry. This is still susceptible to the above problem, but shrinks the window to about 60 seconds. In addition, the claimee takes about 30-40 seconds to bootstrap, so it should pragmatically make the problem very hard to encounter, or reproduce.
2) Have claims issued by claimer be valid for any claimee announce up to 30 seconds after the claim came in. This works, but starts to muddle the definition of a claimee announce: it means that the claimee announce isn't always interpreted to mean to "reset the claimee status," because for up to 30 seconds after the last claim it means "join to last claim."
Those are the high points, but may not be enough of a description of the problem, so let me know if I can add any comments to elucidate further. In terms of the solution, these are workable solutions, but I'm looking for an analogy to a known problem perhaps, and to see if there're ideas I haven't thought of.
Maybe I didn’t understand the problem description correctly, but you have also another problem - what if both are connected just fine and than the claimee fails? The claimer will need to deal with this issue as well, unless you’re assuming that this scenario never can happen.
In general there are several ways to implement a solution for both problems, but the probably most reliable one would be inspired by the implementation used by Java’s RMI.
When you send a message to the claimee add there a unique ID. When you don’t get an answer you can retry sending the message several times with the same ID (messages can get lost) and after some longer timeout you can accept that the claimee is unavailable. Now you can again look for connection information at the server and restart the process.
For this you’d need to cache all messages which haven’t been yet processed on claimer’s side. Additionally on claimee’s side you’ll need to cache the last X message IDs and their results (if available) This is necessary in order not to perform operations in one message multiple times and also be able to reply with the correct result again (since also result messages can get lost)
Related
Having spent long hours trying to find documentation and help around this resulting in nothing, I have decided to reach out to the community.
I would like to read messages from a topic subscription. Using the message, a UI is populated for a human to work on it. The time it approximately takes to process each message is 15 minutes and each client can work on only one message. At the end of processing the message, the client can either decide to stop processing messages or request a new message.
With the max lock time set at 5 minutes on the subscription, I need to be able to automatically renew my lock for up to 15 minutes.
The first attempted approach was to use CreateReceiver and fetch the message, read it and Complete message when done. The issue with this is I have not been able to figure out how to automatically renew the lock for 15 minutes. I see the RenewLockAsync function but would like for this to be automatic and not have to run a background timer to keep track of the expiring lock.
The second attempted approach was to try using ServiceBusClient.CreateProcessor() with options to set the AutoLockRenewal timespan. The issue faced here is with the processor itself running based on events in the background. Since I need to populated a UI, I need to be able to stop the processor after the message has been read, return the callback and once the human interaction is done, complete the message. I have been unable to find a way to do this.
What would be a good approach to achieve this? The subscription acts as a workqueue that multiple people pull items from and individually work them. Any help in a proposing an approach to this is appreciated.
I am using an Axon Event Tracking processor. Sometimes events take longer that 10 seconds to process.
This seems to cause the message to be processed again and this appears in the log "Releasing claim of token X/0 failed. It was owned by another node."
If I up the number of segments it does not log this BUT the event is still processed twice so I think this might be misleading. (I think I was mistaken about this)
I have tried adjusting the fetchDelay, cleanupDelay and tokenClaimInterval. None of which has fixed this. Is there a property or something that I am missing?
Edit
The scenario taking longer than 10 seconds is making a HTTP request to an external service.
I'm using axon 4.1.2 with all default configuration when using with Spring auto configuration. I cannot see the Releasing claim on token and preparing for retry in [timeout]s log.
I was having this issue with a single segment and 2 instances of the application. I realised I hadn't increased the number of segments like I thought I had.
After further investigation I have discovered that adding an additional segment seems to have stopped this. Even if I have for example 2 segments and 6 applications it still doesn't reappear, however I'm not sure how this is different to my original scenario of 1 segment and 2 application?
I didn't realise it would be possible for multiple threads to grab the same tracking token and process the same event. It sounds like the best action would be to put an idem-potency check before the HTTP call?
The Releasing claim of token [event-processor-name]/[segment-id] failed. It was owned by another node. message can only occur in three scenarios:
You are performing a merge operation of two segments which fails because the given thread doesn't own both segments.
The main event processing loop of the TrackingEventProcessor is stopped, but releasing the token claim fails because the token is already claimed by another thread.
The main event processing loop has caught an Exception, making it retry with a exponential back-off, and it tries to release the claim (which might fail with the given message).
I am guessing it's not options 1 and 2, so that would leave us with option 3. This should also mean you are seeing other WARN level messages, like:
Releasing claim on token and preparing for retry in [timeout]s
Would you be able to share whether that's the case? That way we can pinpoint a little better what the exact problem is you are encountering.
By the way, very likely you have several processes (event handling threads of the TrackingEventProcessor) stealing the TrackingToken from one another. As they're stealing an un-updated token, both (or more) will handled the same event. Hence why you see the event handler being invoked twice.
Obviously undesirable behavior and something we should resolve for you. I would like to ask you to provide answers to my comments under the question, as right now I have to little to go on. Let us figure this out #Dan!
Update
Thanks for updating your question #dan, that's very helpful.
From what you've shared, I am fairly confident that both instances are stealing the token from one another. This does depend though on whether both are using the same database for the token_entry table (although I am assuming they are).
If they are using the same table, then they should "nicely" share their work, unless one of them takes to long. If it takes to long, the token will be claimed by another process. This other process in this case is the thread of the TEP of your other application instance. The "claim timeout" is defaulted to 10 seconds, which also corresponds with the long running event handling process.
This claimTimeout is adjustable though, by invoking the Builder of the JpaTokenStore/JdbcTokenStore (depending on which you are using / auto wiring) and calling the JpaTokenStore.Builder#claimTimeout(TemporalAmount) method. And, I think this would be required on your end, giving the fact you have a long running operation.
There are of course different ways of tackling this. Like, making sure the TEP is only ran on a single instance (not really fault tolerant though), or offloading this long running operation to a schedule task which is triggered by the event.
But, I think we've found the issue at least, so I'd suggest to tweak the claimTimeout and see if the problem persists.
Let us know if this resolves the problem on your end #dan!
Problem description
I have an ASP.NET app in which the users have different rights, and are logged in through Facebook. The app includes (among other things) filling out some forms. Some users have access to forms others don't. The forms can sometimes require some searching in books and/or on the internet before being able to submit them.
As such, we're having problems with session time-outs (it seemed), where users would be met with "Not authorized to see this page/form" after doing research somewhere else.
Attempted solutions
I've created a log function that logs the state of a handful of variables on strategic points in the application. I've pinpointed the problem to the fact that the Session variable "UserRole" is null when the problem occurs.
Relogging
The obvious solution is: "Have you tried relogging?" - which should reset the session and allow the user back to the form they want. On logout, I use
Session.Clear();
Session.RemoveAll();
and I create a new session with relevant variables (including UserRole) on login. This doesn't help, though.
Keeping session alive
One way to do it is just increase the standard 20-minute Session length to an arbitrary, higher number (say 2 hours). Although that could be viable during beta (there are only around 5 users right now), it is not a viable solution in the long haul as the server would have to keep the Session objects from many users for longer time, exponentially increasing server demands.
Instead, I created a 'dummy' .ashx handler "RefreshSession.ashx", that can recieve a POST request and return "200" statuscode. I then created a jQuery function in the shared part of the app (that all the pages use) that calls this handler every 10 minutes in order to refresh the session as long as the tab is open in the browser. I've checked the network traffic, and it works as intended, calling the handler even if the window is minimized or the user is viewing another tab. This did not solve the problem either.
A caveat
When one of the users encounter the problem, they call me or my programming partner up. Of course, we go and see if we get the same issue. We all have the same (admin) rights. The 'funny' thing is that we see the exact same error on the same subpage - even if we haven't had any contact with the application for days.
The problem will 'fix itself' (i.e. let users with proper role back on the subpage) after a while, but not even republishing the app to the server will reset it manually.
Therefore, it seems to not be a simpel session error as supposed from the "UserRole" session variable being null after 15-20 minutes of inactivity. It seems to be saved somewhere internally in the server state.
My problem is, that I now have no idea where to look and how to progress. I was hoping that someone here might have an idea for a solution, or at least be able to point me in the right direction? :-)
Thank you all for your time, it is much appreciated.
Based on MaCron's comment to the question, we decided to keep the information in the user's cookies instead of the session variables. Everything seemed to point to us having exactly that issue, and deadlines being deadlines and with me not being able to figure out how to disable the synchronization of worker processes, this seemed to be a feasible and comparatively easy fix.
I didn't see a situation quite like mine, so here goes:
Scenario highlights: The user wants a system that includes custom SMS alerts. A component of the functionality is to have a way to identify a start based on user input, then send SMS with personalized message according to a pre-defined interval after the trigger. I've never used Twilio before and am noodling around with the implementation.
First Pass Solution: Using Twilio account, I designated the .aspx that will receive the inbound triggering alert/SMS via GET. The receiving page declares and instantiates my SMSAlerter object within page load, which responds immediately with a first SMS and kicks off the System.Timer.Timer. Elementary, and functional to a point.
Problem: The alerts continue to be sent if the interval for the timer is a short time span. I tested it at a minute interval and was successful. When I went to 10 minutes, the immediate SMS is sent and the first message 10 minutes later is sent, but nothing after that.
My Observation: Since there is no interaction with the resource after the inbound text, the Session times out if left at default 20 minutes. Increasing Session timeout doesn't work, and even if it did does not seem correct since the interval will be on the order of hours, not minutes.
Using Cache to store each new SMSAlerter might be the way to go. For any SMSAlerter that is created, the schedule is used for roughly 12 hours and is replaced with a new SMSAlerter object when the same user notifies the system the following day. Is there a better way? Am I over/under-simplifying? I am not anticipating heavy traffic now (tens of users), but the user is thinking big.
Thank you for comments, suggestions. I didn't include the code, because the question is about design, not syntax.
I think your timer is going out of scope about 20 minutes after the original request, killing the timer. I have a feeling that if you keep refreshing the aspx page it won't happen - but obviously that doesn't help much.
You could launch a new thread that has the System.Timers.Timer object so it stays alive, and doesn't go out of scope when there are no follow up requests to the server. But this isn't a great idea to be honest - although it might help with understanding the issue.
Ultimately, you'll need some sort of continuously running service - as you don't want to depend on the app pool for this, so I'd suggest a Windows Service running in the background to handle it, which is going to be suitable for a long term solution.
Hope this helps!
(Edited slightly to make the windows service aspect clearer)
When considering a service in NServiceBus at what point do you start questioning how many messages handled by a service is too much and start to break these into a new service?
Consider the following: I have a sales service which can currently be broken into a few distinct business components, these are sales order validation, sales order processing, purchase order validation and purchase order processing.
There are currently about 20 message handlers and 2 sagas used within this service. My concern is that during high volume traffic from my website this can cause an initial spike in the messages to jump into the hundreds. Considering that the messages need to be processed in the order they are taken off the queue this can cause a delay for the last in the queue ( depending on what processing each message does).
When separating concerns within a service into smaller business components I find this makes things a little easier. Sure, it's a logical separation, but it seems to provide a layer of clarity and understanding. To me it seems it seems an easier option to do this than creating new services where in the end the more services I have the more maintenance I need to do.
Does anyone have any similar concerns to this?
I think you have actually answered you own question :)
As soon as the message volume reaches a point where the lag becomes an issue you could look to instance your endpoint. You do not necessarily need to reduce the number of handlers. You could simply install the service a number of times and have specific message types sent to the relevant endpoint by mapping.
So it becomes a matter of a simple instance installation and some config changes. So you can then either split messages on sending so that messages from a particular source end up on a particular endpoint (maybe priority) or on message type.
I happened to do the same thing on a previous project (not using NServiecBus though) where we needed document conversion messages coming from the UI to be processed ASAP. We simply installed the conversion service again with its own set of queues and changed the UI configuration to send the conversion messages to the new endpoint. The background conversion messages were still going to the previous endpoint. So here the source determined the separation.