How to initiate a correct reset of a Tracking Event Processor - axon

I am trying to use Axon's TrackingEventProcessor to replay our events.
While resetting the token, I am getting an UnableToClaimTokenException, since the service runs in a distributed setup.
Is there any way to solve this issue without using Axon Server?

Axon Server has nothing to do with the occurrence of the UnableToClaimTokenException, Axon Server just greatly simplifies the process of initiating a replay.
As the exception and javadoc of the TrackingEventProcessor state, you will need to stop all instances of a given TrackingEventProcessor prior to initiating a replay.
Thus, in a distributed set up, you will have to stop each and every duplication of the given TrackingEventProcessor prior to being able to actual call resetTokens on one of them.
Without Axon Server, that means you will have to create your own endpoints or CLI within your application to stop the given processors. To simplify this, you would essentially want to have a centralized dashboard which shows all occurrence of a given TrackingEventProcessor.
This is exactly what Axon Server is, hence simplifying the process tremendously.
Regardless, it's definitely doable to create this yourself.
Thus, when it comes to triggering a replay, you will first shut down each instance of the TEP prior to a reset.

Related

Axon FW 4.6 comes with Dead Letter Queue support, but is it possible to still have rollback support in case of exception?

we're looking into using the new feature of Axon, dead letter queue(DLQ).
In our previous application (axon 4.5x) we have a lot of eventhandlers updating projections. Our default is to rethrow exceptions when they occur, which will trigger a rollback for the database updates. Perhaps not the best practice to rely on this behaviour (because it can not rollback everything, eg sending an email from event can not be reverted of course)
This behaviour seems to get lost when introducing DLQ in our applications, which has big impact on our current application flow (projections are updated when they previously weren't). This makes upgrading not that easy.
Is it possible to still get the old behaviour(transaction rolled back in case of exceptions) together with DLQ processing?
What we tried was building a test application to test the new DLQ features. While playing around all looks fine in case of exceptions (they were moved to dlq) but the projections still got updated (not rolled back as before)
We throw an exception after the .save() of the projection simulating a database failure to see if events involved (we have multiple eventhandlers for an event updating projections) got rolled back.
You need to choose here, #davince. Storing a dead-letter in the dead-letter queue similarly requires a database transaction.
To ensure the token still progresses and the dead letter is entered, the framework uses the existing transaction.
Furthermore, in practical terms event handling was successful.
Hence, a rollback for some of the parts wouldn't be feasible.
The best way to deal with this, as is mentioned in the Reference Guide, is to make your event handlers idempotent:
Before configuring a SequencedDeadLetterQueue it is vital to validate whether your event handling functions are idempotent.
As a processing group consists of several Event Handling Components
(as explained in the intro of this chapter), some handlers may succeed
in event handling while others will not. As a configured dead-letter
queue does not stall event handling, a failure in one Event Handling
Component does not cause a rollback for other event handlers.
Furthermore, as the dead-letter support is on the processing group
level, dead-letter processing will invoke all event handlers for that
event within the processing group.
Thus, if your event handlers are not idempotent, processing letters may result in undesired side effects.
Hence, we strongly recommend making your event handlers idempotent when using the dead-letter queue.
The principle of exactly-once delivery is no longer guaranteed; at-least-once delivery is the reality to cope with.
I understand this requires some rework on your end, #davince. Although the feature is powerful, it comes with a certain level of responsibility before you use it.
I hope the above clarifies this for you.
Added, I'd like to point out that the version upgrade in itself does not require you to use the dead-letter queue. Hence, this change shouldn't impose any strains for updating to the latest release.
Update 1
Sometimes you need to think about an issue a bit longer.. I was just wondering the following things about your setup. Perhaps I can help out on that front:
What storage mechanism do you use to store projections in?
Where are you storing your tokens?
Where are you planning to store your dead-letters?
How are you invoking the storage layer from your event handlers?
Update 2
Thanks for sharing that you're using PostgreSQL for your projections, tokens, and dead letters. And that you're using JPA entities for storage.
It gives more certainty about your setup, as it may impact how your system would react in case of a rollback.
However, as it's rather vanilla/regular, the shared comment from the Reference Guide still applies.
This, sadly enough, means some work on your end, #davince. I hope the road forward to start using the SequencedDeadLetterQueue from Axon Framework is clear.
By the way, if you ever have recommendations on how the framework or the documentation may be improved, be sure to file issues in GitHub here and here, respectively.

WF4 -Get exceptions when using multiple ReceiveAndSendReply activities paralleled

I have a scenario and want to use multiple ReceiveAndSendReply activities running in parallel situation, each of them will be put in an infinite while loop to make sure all activities are always running and listening. So I used a parallel activity to pack all those ReceiveAndSendReply, and each ReceiveAndSendReply was put in a While activity with condition set to true. And of cause, I put some activities with business logic between Receive activity and SendReplyToRecieve activity.
Now I have a problem if it takes a long time to process a request in one branch, then during that time all other branches will be blocked. Any request for other Receive activities will not be processed, and both client, which include the one called long time run service and the other one who called other service during server process long time run service process, will also get exceptions.
Did anybody have an idea to fix it? Sorry since I am new user, can put post image of my workflow.
The workflow runtime is single treaded in that a given workflow instance only executes on a single thread at any given moment. So while your workflow is busy doing work it can't react to other incoming messages. Normally this isn't a problem as workflow's normally aren't compute intensive and doing async IO is real easy. One thing that might help is adding Delay activities with a real short timeout. They cause the workflow to pause letting it start processing the next request. Also make sure you put as few activities as you can between the Receive and the SendReply and add a short delay right after the SendReply.

Designing an asynchronous task library for ASP.NET

The ASP.NET runtime is meant for short work loads that can be run in parallel. I need to be able to schedule periodic events and background tasks that may or may not run for much longer periods.
Given the above I have the following problems to deal with:
The AppDomain can shutdown due to changes (Web.config, bin, App_Code, etc.)
IIS recycles the AppPool on a regular basis (daily)
IIS itself might restart, or for that matter the server might crash
I'm not convinced that running this code inside ASP.NET is not the right thing to do, becuase it would allow for a simpler programming model. But doing so would require that an external service periodically makes requests to the app so that the application is keept running and that all background tasks are programmed with utter most care. They will have to be able to pause and resume thier work, in the event of an unexpected error.
My current line of thinking goes something like this:
If all jobs are registered in the database, it should be possible to use the database as a bookkeeping mechanism. In the case of an error, the database would contain all state necessary to resume the operation at the next opportunity given.
I'd really appriecate some feedback/advice, on this matter. I've been considering running a windows service and using some RPC solution as well, but it doesn't have the same appeal to me. And I'd instead have a lot of deployment issues and sycnhronizing tasks and code cross several applications. Due to my business needs this is less than optimial.
This is a shot in the dark since I don't know what database you use, but I'd recommend you to consider dialog timers and activation. Assuming that most of the jobs have to do some data manipulation, and is likely that all have to do only data manipulation, leveraging activation and timers give an extremely reliable job scheduling solution, entirely embedded in the database (no need for an external process/service, not dependencies outside the database bounds like msdb), and is a solution that ensures scheduled jobs can survive restarts, failover events and even disaster recovery restores. Simply put, once a job is scheduled it will run even if the database is restored one week later on a different machine.
Have a look at Asynchronous procedure execution for a related example.
And if this is too radical, at least have a look at Using Tables as Queues since storing the scheduled items in the database often falls under the 'pending queue' case.
I recommend that you have a look at Quartz.Net. It is open source and it will give you some ideas.
Using the database as a state-keeping mechanism is a completely valid idea. How complex it will be depends on how far you want to take it. In many cases you will ended up pairing your database logic with a Windows service to achieve the desired result.
FWIW, it is typically not a good practice to manually use the thread pool inside an ASP.Net application, though (contrary to what you may read) it actually works quite nicely other than the huge caveat that you can't guarantee it will work.
So if you needed a background thread that examined the state of some object every 30 seconds and you didn't care if it fired every 30 seconds or 29 seconds or 2 minutes (such as in a long app pool recycle), an ASP.Net-spawned thread is a quick and very dirty solution.
Asynchronously fired callbacks (such as on the ASP.Net Cache object) can also perform a sort of "behind the scenes" role.
I have faced similar challenges and ultimately opted for a Windows service that uses a combination of building blocks for maximum flexibility. Namely, I use:
1) WCF with implementation-specific types OR
2) Types that are meant to transport and manage objects that wrap a job OR
3) Completely generic, serializable objects contained in a custom wrapper. Since they are just a binary payload, this allows any object to be passed to the service. Once in the service, the wrapper defines what should happen to the object (e.g. invoke a method, gather a result, and optionally make that result available for return).
Ultimately, the web site is responsible for querying the service about its state. This querying can be as simple as polling or can use asynchronous callbacks with WCF (though I believe this also uses some sort of polling behind the scenes).
I tell you what I have do.
I have create a class called Atzenta that have a timer (1-2 second trigger).
I have also create a table on my temporary database that keep the jobs. The table knows the jobID, other parameters, priority, job status, messages.
I can add, or delete a job on this class. When there is no action to be done the timer is stop. When I add a job, then the timer starts again. (the timer is a thread by him self that can do parallel work). I use the System.Timers and not other timers for this.
The jobs can have different priority.
Now let say that I place a job on this table using the Atzenta class. The next time that the timer is trigger is check the query on this table and find the first available job and just run it. No other jobs run until this one is end.
Every synchronize and flags are done from the table. In the table I have flags for every job that show if its |wait to run|request to run|run|pause|finish|killed|
All jobs are all ready known functions or class (eg the creation of statistics).
For stop and start, I use the global.asax and the Application_Start, Application_End to start and pause the object that keep the tasks. For example when I do a job, and I get the Application_End ether I wait to finish and then stop the app, ether I stop the action, notify the table, and start again on application_start.
So I say, Atzenta.RunTheJob(Jobs.StatisticUpdate, ProductID); and then I add this job on table, open the timer, and then on trigger this job is run and I update the statistics for the given product id.
I use a table on a database to synchronize many pools that run the same web app and in fact its work that way. With a common table the synchronize of the jobs is easy and you avoid 2 pools to run the same job at the same time.
On my back office I have a simple table view to see the status of all jobs.

Queuing using the Database or MSMQ?

A part of the application I'm working on is an swf that shows a test with some 80 questions. Each question is saved in SQL Server through WebORB and ASP.NET.
If a candidate finishes the test, the session needs to be validated. The problem is that sometimes 350 candidates finish their test at the same moment, and the CPU on the web server and SQL Server explodes (350 validations concurrently).
Now, how should I implement queuing here? In the database, there's a table that has a record for each session. One column holds the status. 1 is finished, 2 is validated.
I could implement queuing in two ways (as I see it, maybe you have other propositions):
A process that checks the table for records with status 1. If it finds one, it validates the session. So, sessions are validated one after one.
If a candidate finishes its session, a message is sent to a MSMQ queue. Another process listens to the queue and validates sessions one after one.
Now:
What would be the best approach?
Where do you start the process that will validate sessions? In your global.asax (application_start)? As a windows service? As an exe on the root of the website that is started in application_start?
To me, using the table and looking for records with status 1 seems the easiest way.
The MSMQ approach decouples your web-facing application from the validation logic service and the database.
This brings many advantages, a few of which:
It would be easier to handle situations where the validation logic can handle 5 sessions per second, and it receives 300 all at once. Otherwise you would have to handle copmlicated timeouts, re-attempts, etc.
It would be easier to do maintanance on the validation service, without having to interrupt the rest of the application. When the validation service is brought down, messages would queue up in MSMQ, and would get processed again as soon as it is brought up.
The same as above applies for database maintanance.
If you don't have experience using MSMQ and no infrastructrure set up, I would advice against it. Sure, it might be the "proper" way of doing queueing on the Microsoft platform, but it is not very straight-forward and has quite a learning curve.
The same goes for creating a Windows Service; don't do it unless you are familiar with it. For simple cases such as this I would argue that the pain is greater than the rewards.
The simplest solution would probably be to use the table and run the process on a background thread that you start up in global.asax. You probably also want to create an admin page that can report some status information about the process (number of pending jobs etc) and maybe a button to restart the process if it for some reason fails.
What is validating? Before working on your queuing strategy, I would try to make validating as fast as possible, including making it set based if it isn't already so.
I have recently been investigating this myself so wanted to mention my findings. The location of the Database in comparison to your application is a big factor on deciding which option is faster.
I tested inserting the time it took to insert 100 database entries versus logging the exact same data into a local MSMQ message. I then took the average of the results of performing this test several times.
What I found was that when the database is on the local network, inserting a row was up to 4 times faster than logging to an MSMQ.
When the database was being accessed over a decent internet connection, inserting a row into the database was up to 6 times slower than logging to an MSMQ.
So:
Local database - DB is faster, otherwise MSMQ is.

Long-running ASP.NET tasks

I know there's a bunch of APIs out there that do this, but I also know that the hosting environment (being ASP.NET) puts restrictions on what you can reliably do in a separate thread.
I could be completely wrong, so please correct me if I am, this is however what I think I know.
A request typically timeouts after 120 seconds (this is configurable) but eventually the ASP.NET runtime will kill a request that's taking too long to complete.
The hosting environment, typically IIS, employs process recycling and can at any point decide to recycle your app. When this happens all threads are aborted and the app restarts. I'm however not sure how aggressive it is, it would be kind of stupid to assume that it would abort a normal ongoing HTTP request but I would expect it to abort a thread because it doesn't know anything about the unit of work of a thread.
If you had to create a programming model that easily and reliably and theoretically put a long running task, that would have to run for days, how would you accomplish this from within an ASP.NET application?
The following are my thoughts on the issue:
I've been thinking a long the line of hosting a WCF service in a win32 service. And talk to the service through WCF. This is however not very practical, because the only reason I would choose to do so, is to send tasks (units of work) from several different web apps. I'd then eventually ask the service for status updates and act accordingly. My biggest concern with this is that it would NOT be a particular great experience if I had to deploy every task to the service for it to be able to execute some instructions. There's also this issue of input, how would I feed this service with data if I had a large data set and needed to chew through it?
What I typically do right now is this
SELECT TOP 10 *
FROM WorkItem WITH (ROWLOCK, UPDLOCK, READPAST)
WHERE WorkCompleted IS NULL
It allows me to use a SQL Server database as a work queue and periodically poll the database with this query for work. If the work item completed with success, I mark it as done and proceed until there's nothing more to do. What I don't like is that I could theoretically be interrupted at any point and if I'm in-between success and marking it as done, I could end up processing the same work item twice. I might be a bit paranoid and this might be all fine but as I understand it there's no guarantee that that won't happen...
I know there's been similar questions on SO before but non really answers with a definitive answer. This is a really common thing, yet the ASP.NET hosting environment is ill equipped to handle long-running work.
Please share your thoughts.
Have a look at NServiceBus
NServiceBus is an open source
communications framework for .NET with
build in support for publish/subscribe
and long-running processes.
It is a technology build upon MSMQ, which means that your messages don't get lost since they are persisted to disk. Nevertheless the Framework has an impressive performance and an intuitive API.
John,
I agree that ASP.NET is not suitable for Async tasks as you have described them, nor should it be. It is designed as a web hosting platform, not a back of house processor.
We have had similar situations in the past and we have used a solution similar to what you have described. In summary, keep your WCF service under ASP.NET, use a "Queue" table with a Windows Service as the "QueueProcessor". The client should poll to see if work is done (or use messaging to notify the client).
We used a table that contained the process and it's information (eg InvoicingRun). On that table was a status (Pending, Running, Completed, Failed). The client would submit a new InvoicingRun with a status of Pending. A Windows service (the processor) would poll the database to get any runs that in the pending stage (you could also use SQL Notification so you don't need to poll. If a pending run was found, it would move it to running, do the processing and then move it to completed/failed.
In the case where the process failed fatally (eg DB down, process killed), the run would be left in a running state, and human intervention was required. If the process failed in an non-fatal state (exception, error), the process would be moved to failed, and you can choose to retry or have human intervantion.
If there were multiple processors, the first one to move it to a running state got that job. You can use this method to prevent the job being run twice. Alternate is to do the select then update to running under a transaction. Make sure either of these outside a transaction larger transaction. Sample (rough) SQL:
UPDATE InvoicingRun
SET Status = 2 -- Running
WHERE ID = 1
AND Status = 1 -- Pending
IF ##RowCount = 0
SELECT Cast(0 as bit)
ELSE
SELECT Cast(1 as bit)
Rob
Use a simple background tasks / jobs framework like Hangfire and apply these best practice principals to the design of the rest of your solution:
Keep all actions as small as possible; to achieve this, you should-
Divide long running jobs into batches and queue them (in a Hangfire queue or on a bus of another sort)
Make sure your small jobs (batched parts of long jobs) are idempotent (have all the context they need to run in any order). This way you don't have to use a quete which maintains a sequence; because then you can
Parallelise the execution of jobs in your queue depending on how many nodes you have in your web server farm. You can even control how much load this subjects your farm to (as a trade off to servicing web requests). This ensures that you complete the whole job (all batches) as fast and as efficiently as possible, while not compromising your cluster from servicing web clients.
Have thought about the use the Workflow Foundation instead of your custom implementation? It also allows you to persist states. Tasks could be defined as workflows in this case.
Just some thoughts...
Michael

Resources