Siebel 8 race condition - asynchronous

Imagine the following set up
A set of n independent tasks in a task list must be completed in Siebel
Tasks a, b etc can be worked on by separate threads
When a thread starts the work flow records the states of all n tasks
The threads continue to completion and eventually end up sending a JMS message to a queue
We have the following problem:
Thread 1 that works on task a completes its work and marks task a as closed
At the same time thread 2 that works on task b also completes its work and marks task b as closed
Two JMS messages are placed on the queue and sent to another back end system
Here's the problem: The first JMS message says that the state of the task list is a=closed b=open and the second JMS message says a=open b=closed
Tasks can legitimately be re-opened by a user of Siebel (let's say for fraud checking purposes)
The back end system receives the two JMS messages in any order since the middleware does not guarantee ordering
The back end system receives one JMS message saying closed,open and another shortly afterwards that says open,closed. This could happen in any order but the result is the same. It appears to the back end system that the entire task list has not been closed whilst in Siebel all tasks (a and b in this example) have been closed
I am told that there is no way in Siebel that the commit to the database that modifies the state of the tasks being acted upon in the workflow thread can only happen at the very end of the work flow. That means crucially after the JMS messages have been sent out with the misleading state. This is apparently because of the need to roll back a workflow upon error.
Questions: Is the above statement true meaning that this problem can never be solved in Siebel? If not then can someone tell me whether it is possible to fix this in Siebel such that a JMS message is sent with the correct state of the tasks. I naively think this is solved using semaphores, but truth be told I've been spoiled in the sense I've never had to deal with semaphores and I sure don't know if that concept even exists in Siebel. Any help?

Can't read data before it's committed to the database, can only control the timing.
Use a business service to call workflow(s) synchronously, or use business service instead of workflow, and send JMS message after database commit. Instructions to call a workflow process from business service.

Related

Axon Event Processing Timeout

I am using an Axon Event Tracking processor. Sometimes events take longer that 10 seconds to process.
This seems to cause the message to be processed again and this appears in the log "Releasing claim of token X/0 failed. It was owned by another node."
If I up the number of segments it does not log this BUT the event is still processed twice so I think this might be misleading. (I think I was mistaken about this)
I have tried adjusting the fetchDelay, cleanupDelay and tokenClaimInterval. None of which has fixed this. Is there a property or something that I am missing?
Edit
The scenario taking longer than 10 seconds is making a HTTP request to an external service.
I'm using axon 4.1.2 with all default configuration when using with Spring auto configuration. I cannot see the Releasing claim on token and preparing for retry in [timeout]s log.
I was having this issue with a single segment and 2 instances of the application. I realised I hadn't increased the number of segments like I thought I had.
After further investigation I have discovered that adding an additional segment seems to have stopped this. Even if I have for example 2 segments and 6 applications it still doesn't reappear, however I'm not sure how this is different to my original scenario of 1 segment and 2 application?
I didn't realise it would be possible for multiple threads to grab the same tracking token and process the same event. It sounds like the best action would be to put an idem-potency check before the HTTP call?
The Releasing claim of token [event-processor-name]/[segment-id] failed. It was owned by another node. message can only occur in three scenarios:
You are performing a merge operation of two segments which fails because the given thread doesn't own both segments.
The main event processing loop of the TrackingEventProcessor is stopped, but releasing the token claim fails because the token is already claimed by another thread.
The main event processing loop has caught an Exception, making it retry with a exponential back-off, and it tries to release the claim (which might fail with the given message).
I am guessing it's not options 1 and 2, so that would leave us with option 3. This should also mean you are seeing other WARN level messages, like:
Releasing claim on token and preparing for retry in [timeout]s
Would you be able to share whether that's the case? That way we can pinpoint a little better what the exact problem is you are encountering.
By the way, very likely you have several processes (event handling threads of the TrackingEventProcessor) stealing the TrackingToken from one another. As they're stealing an un-updated token, both (or more) will handled the same event. Hence why you see the event handler being invoked twice.
Obviously undesirable behavior and something we should resolve for you. I would like to ask you to provide answers to my comments under the question, as right now I have to little to go on. Let us figure this out #Dan!
Update
Thanks for updating your question #dan, that's very helpful.
From what you've shared, I am fairly confident that both instances are stealing the token from one another. This does depend though on whether both are using the same database for the token_entry table (although I am assuming they are).
If they are using the same table, then they should "nicely" share their work, unless one of them takes to long. If it takes to long, the token will be claimed by another process. This other process in this case is the thread of the TEP of your other application instance. The "claim timeout" is defaulted to 10 seconds, which also corresponds with the long running event handling process.
This claimTimeout is adjustable though, by invoking the Builder of the JpaTokenStore/JdbcTokenStore (depending on which you are using / auto wiring) and calling the JpaTokenStore.Builder#claimTimeout(TemporalAmount) method. And, I think this would be required on your end, giving the fact you have a long running operation.
There are of course different ways of tackling this. Like, making sure the TEP is only ran on a single instance (not really fault tolerant though), or offloading this long running operation to a schedule task which is triggered by the event.
But, I think we've found the issue at least, so I'd suggest to tweak the claimTimeout and see if the problem persists.
Let us know if this resolves the problem on your end #dan!

Corda: Making HTTP request within a responder flow?

Is it okay to make HTTP requests to a counter party's external service from within a responder flow?
My use case is a Party invokes a "request-token" flow with an exchange node. That exchange node makes a HTTP request (on the responder flow) to move cash from that parties account to an exchange account in the external payment system. The event of the funds actually hitting the count and hence the issuance of the tokens would happen with another flow.
If it is not okay, what may be an alternative design to achieve the task?
It is not always a good idea to make HTTP request that way.
Unless you think very carefully about what happens when the previous checkpoint is replayed.so dedupe and idempotence are key considerations.plus what happens if target is down? plus this may exhaust the thread pool upon which the fibers operate.
Flows are run on fibers. CordaServices can spawn their own threads
threads can block on I/O, fibers can only do so for short periods and we make no guarantees about freeing resources, or ordering unless it is the DB. Also threads can register observables
The real challenge is restart-ability and for that they need to test the hell out of their code with a random kills.
You need to be aware that steps can be replayed in the event of a crash. this is true of any server-side work based system that restarts work.
Effectively, you should:
Step 1) execute an on-ledger Corda transaction to move one or more
assets into a locked state (analogous to XA 'prepare'). When
successfully notarised,
Step 2) execute the off-ledger transaction
with an idempotent call that succeeds or fails. When we know if it
succeeded or failed, move to
Step 3) execute a second Corda
transaction that either reverts the status of the asset or moves it
to its intended final state

Using SQS or DynamoDB to control order status

I am building a system that processes orders. Each order will follow a workflow. So this order can be, e.g., booked,accepted,payment approved,cancelled and so on.
Every time a status of a order changes I will post this change to SNS. To know if a status order has changed I will need to make a request to a external API, and compare to the last known status.
The question is: What is the best place to store the last known order status?
1. A SQS queue. So every time I read a message from queue, check status using the external API, delete the message and insert another one with the new status.
2. Use a database (like Dynamo DB) to control the order status.
You should not use the word "store" to describe something happening with stateful facts and a queue. Stateful, factual information should be stored -- persisted -- to a database.
The queue messages should be treated as "hints" on what work needs to be done -- a request to consider the reasonableness of a proposed action, and if reasonable, perform the action.
What I mean by this, is that when a queue consumer sees a message to create an order, it should check the database and create the order if not already present. Update an order? Check the database to see whether the order is in a correct status for the update to occur. (Canceling an order that has already shipped would be an example of a mismatched state).
Queues, by design, can't be as precise and atomic in their operation as a database should. The Two Generals Problem is one of several scenarios that becomes an issue in dealing with queues (and indeed with designing a queue system) -- messages can be lost or delivered more than once.
What happens in a "queue is authoritative" scenario when a message is delivered (received from the queue) more than once? What happens if a message is lost? There's nothing wrong with using a queue, but I respectfully suggest that in this scenario the queue should not be treated as authoritative.
I will go with the database option instead of SQS:
1) option SQS:
You will have one application which will change the status
Add the status value into SQS
Now another application will check your messages and send notification, delete the message
2) Option DynamoDB:
Insert you updated status in DynamoDB
Configure a Lambda function on update of that field
Lambda function will send notifcation
The database option looks clear additionally, you don't have to worry about maintaining any queue plus you can read one message from the queue at a time unless you implement parallel reader to read from the queue. In a database, you can update multiple rows and it will trigger the lambda and you don't have to worry about it.
Hope that helps

WF4 -Get exceptions when using multiple ReceiveAndSendReply activities paralleled

I have a scenario and want to use multiple ReceiveAndSendReply activities running in parallel situation, each of them will be put in an infinite while loop to make sure all activities are always running and listening. So I used a parallel activity to pack all those ReceiveAndSendReply, and each ReceiveAndSendReply was put in a While activity with condition set to true. And of cause, I put some activities with business logic between Receive activity and SendReplyToRecieve activity.
Now I have a problem if it takes a long time to process a request in one branch, then during that time all other branches will be blocked. Any request for other Receive activities will not be processed, and both client, which include the one called long time run service and the other one who called other service during server process long time run service process, will also get exceptions.
Did anybody have an idea to fix it? Sorry since I am new user, can put post image of my workflow.
The workflow runtime is single treaded in that a given workflow instance only executes on a single thread at any given moment. So while your workflow is busy doing work it can't react to other incoming messages. Normally this isn't a problem as workflow's normally aren't compute intensive and doing async IO is real easy. One thing that might help is adding Delay activities with a real short timeout. They cause the workflow to pause letting it start processing the next request. Also make sure you put as few activities as you can between the Receive and the SendReply and add a short delay right after the SendReply.

Long-running ASP.NET tasks

I know there's a bunch of APIs out there that do this, but I also know that the hosting environment (being ASP.NET) puts restrictions on what you can reliably do in a separate thread.
I could be completely wrong, so please correct me if I am, this is however what I think I know.
A request typically timeouts after 120 seconds (this is configurable) but eventually the ASP.NET runtime will kill a request that's taking too long to complete.
The hosting environment, typically IIS, employs process recycling and can at any point decide to recycle your app. When this happens all threads are aborted and the app restarts. I'm however not sure how aggressive it is, it would be kind of stupid to assume that it would abort a normal ongoing HTTP request but I would expect it to abort a thread because it doesn't know anything about the unit of work of a thread.
If you had to create a programming model that easily and reliably and theoretically put a long running task, that would have to run for days, how would you accomplish this from within an ASP.NET application?
The following are my thoughts on the issue:
I've been thinking a long the line of hosting a WCF service in a win32 service. And talk to the service through WCF. This is however not very practical, because the only reason I would choose to do so, is to send tasks (units of work) from several different web apps. I'd then eventually ask the service for status updates and act accordingly. My biggest concern with this is that it would NOT be a particular great experience if I had to deploy every task to the service for it to be able to execute some instructions. There's also this issue of input, how would I feed this service with data if I had a large data set and needed to chew through it?
What I typically do right now is this
SELECT TOP 10 *
FROM WorkItem WITH (ROWLOCK, UPDLOCK, READPAST)
WHERE WorkCompleted IS NULL
It allows me to use a SQL Server database as a work queue and periodically poll the database with this query for work. If the work item completed with success, I mark it as done and proceed until there's nothing more to do. What I don't like is that I could theoretically be interrupted at any point and if I'm in-between success and marking it as done, I could end up processing the same work item twice. I might be a bit paranoid and this might be all fine but as I understand it there's no guarantee that that won't happen...
I know there's been similar questions on SO before but non really answers with a definitive answer. This is a really common thing, yet the ASP.NET hosting environment is ill equipped to handle long-running work.
Please share your thoughts.
Have a look at NServiceBus
NServiceBus is an open source
communications framework for .NET with
build in support for publish/subscribe
and long-running processes.
It is a technology build upon MSMQ, which means that your messages don't get lost since they are persisted to disk. Nevertheless the Framework has an impressive performance and an intuitive API.
John,
I agree that ASP.NET is not suitable for Async tasks as you have described them, nor should it be. It is designed as a web hosting platform, not a back of house processor.
We have had similar situations in the past and we have used a solution similar to what you have described. In summary, keep your WCF service under ASP.NET, use a "Queue" table with a Windows Service as the "QueueProcessor". The client should poll to see if work is done (or use messaging to notify the client).
We used a table that contained the process and it's information (eg InvoicingRun). On that table was a status (Pending, Running, Completed, Failed). The client would submit a new InvoicingRun with a status of Pending. A Windows service (the processor) would poll the database to get any runs that in the pending stage (you could also use SQL Notification so you don't need to poll. If a pending run was found, it would move it to running, do the processing and then move it to completed/failed.
In the case where the process failed fatally (eg DB down, process killed), the run would be left in a running state, and human intervention was required. If the process failed in an non-fatal state (exception, error), the process would be moved to failed, and you can choose to retry or have human intervantion.
If there were multiple processors, the first one to move it to a running state got that job. You can use this method to prevent the job being run twice. Alternate is to do the select then update to running under a transaction. Make sure either of these outside a transaction larger transaction. Sample (rough) SQL:
UPDATE InvoicingRun
SET Status = 2 -- Running
WHERE ID = 1
AND Status = 1 -- Pending
IF ##RowCount = 0
SELECT Cast(0 as bit)
ELSE
SELECT Cast(1 as bit)
Rob
Use a simple background tasks / jobs framework like Hangfire and apply these best practice principals to the design of the rest of your solution:
Keep all actions as small as possible; to achieve this, you should-
Divide long running jobs into batches and queue them (in a Hangfire queue or on a bus of another sort)
Make sure your small jobs (batched parts of long jobs) are idempotent (have all the context they need to run in any order). This way you don't have to use a quete which maintains a sequence; because then you can
Parallelise the execution of jobs in your queue depending on how many nodes you have in your web server farm. You can even control how much load this subjects your farm to (as a trade off to servicing web requests). This ensures that you complete the whole job (all batches) as fast and as efficiently as possible, while not compromising your cluster from servicing web clients.
Have thought about the use the Workflow Foundation instead of your custom implementation? It also allows you to persist states. Tasks could be defined as workflows in this case.
Just some thoughts...
Michael

Resources