How to programmatically decide whether a session have unresolved errors in Ensemble? - intersystems

My client is using IRIS 2021.1 Interoperability and they want to highlight the sessions with unresolved(without resent and completed operations) errors so their maintenance team can have a list of unresolved sessions to check against.
When I'm working on it, it seems to be it is is difficult to decide whether a session have unresolved errors. For example,
in this session, the entry async request is labelled as error after the following dispatch is resent and completed. So it is not correct to decide with the entry. Does it mean that we'll need to fetch all messages from a session and traverse the messages flow, depending on the status of all downstream messages to decided whether the session need to be handled?
I wonder what solution could be used to fulfill this requirement?
Thanks.

Related

Axon Event Processing Timeout

I am using an Axon Event Tracking processor. Sometimes events take longer that 10 seconds to process.
This seems to cause the message to be processed again and this appears in the log "Releasing claim of token X/0 failed. It was owned by another node."
If I up the number of segments it does not log this BUT the event is still processed twice so I think this might be misleading. (I think I was mistaken about this)
I have tried adjusting the fetchDelay, cleanupDelay and tokenClaimInterval. None of which has fixed this. Is there a property or something that I am missing?
Edit
The scenario taking longer than 10 seconds is making a HTTP request to an external service.
I'm using axon 4.1.2 with all default configuration when using with Spring auto configuration. I cannot see the Releasing claim on token and preparing for retry in [timeout]s log.
I was having this issue with a single segment and 2 instances of the application. I realised I hadn't increased the number of segments like I thought I had.
After further investigation I have discovered that adding an additional segment seems to have stopped this. Even if I have for example 2 segments and 6 applications it still doesn't reappear, however I'm not sure how this is different to my original scenario of 1 segment and 2 application?
I didn't realise it would be possible for multiple threads to grab the same tracking token and process the same event. It sounds like the best action would be to put an idem-potency check before the HTTP call?
The Releasing claim of token [event-processor-name]/[segment-id] failed. It was owned by another node. message can only occur in three scenarios:
You are performing a merge operation of two segments which fails because the given thread doesn't own both segments.
The main event processing loop of the TrackingEventProcessor is stopped, but releasing the token claim fails because the token is already claimed by another thread.
The main event processing loop has caught an Exception, making it retry with a exponential back-off, and it tries to release the claim (which might fail with the given message).
I am guessing it's not options 1 and 2, so that would leave us with option 3. This should also mean you are seeing other WARN level messages, like:
Releasing claim on token and preparing for retry in [timeout]s
Would you be able to share whether that's the case? That way we can pinpoint a little better what the exact problem is you are encountering.
By the way, very likely you have several processes (event handling threads of the TrackingEventProcessor) stealing the TrackingToken from one another. As they're stealing an un-updated token, both (or more) will handled the same event. Hence why you see the event handler being invoked twice.
Obviously undesirable behavior and something we should resolve for you. I would like to ask you to provide answers to my comments under the question, as right now I have to little to go on. Let us figure this out #Dan!
Update
Thanks for updating your question #dan, that's very helpful.
From what you've shared, I am fairly confident that both instances are stealing the token from one another. This does depend though on whether both are using the same database for the token_entry table (although I am assuming they are).
If they are using the same table, then they should "nicely" share their work, unless one of them takes to long. If it takes to long, the token will be claimed by another process. This other process in this case is the thread of the TEP of your other application instance. The "claim timeout" is defaulted to 10 seconds, which also corresponds with the long running event handling process.
This claimTimeout is adjustable though, by invoking the Builder of the JpaTokenStore/JdbcTokenStore (depending on which you are using / auto wiring) and calling the JpaTokenStore.Builder#claimTimeout(TemporalAmount) method. And, I think this would be required on your end, giving the fact you have a long running operation.
There are of course different ways of tackling this. Like, making sure the TEP is only ran on a single instance (not really fault tolerant though), or offloading this long running operation to a schedule task which is triggered by the event.
But, I think we've found the issue at least, so I'd suggest to tweak the claimTimeout and see if the problem persists.
Let us know if this resolves the problem on your end #dan!

Take Set-Cookie from incoming message and place in Cookie in outgoing message

I have a scenario where I receive a request, and based on that request I have to do a few web service calls to a backend system. All is done in an orchestration. The backend system is session based, so first I perform a login and then I want to do my stuff. The login operation replies with a Set-Cookie header, I want to place that value in the Cookie header in the subsequent calls. However, when trying to do this in a Message assignment shape:
msg_request2(HTTP.HttpCookie) = msg_loginresponse(HTTP.HttpCookie)
I get an error in the event viewer:
Inner exception: There is no value associated with the property 'HTTP.HttpCookie' in the message.
Exception type: MissingPropertyException
Source: Microsoft.XLANGs.BizTalk.Engine
I've also tried accessing the HTTP.InboundHttpHeaders of msg_loginresponse, same error message. I can see the InboundHttpHeaders context property in the suspended message that results, so i "know" that it's there.
Adding a reference to Microsoft.BizTalk.GlobalPropertySchemas.dll in my project did not help.
Any clever suggestions?
After you have informed your management that the Trading Partner is using a very old and unusual pattern for which you are going to have to spend extra time, meaning money, to accommodate (this problem is 100% created by the Trading Partner), it should be pretty simple.
For the Adapter, Set-Cookie does not result in an actual cookie since the Adapter has no notion of sessions.
You need to parse the cookie value out of HTTP.InboundHttpHeaders, then use that value to set the cookie by either HTTP.HttpCookie or HTTP.UserHttpHeaders.
BTW, since token auth in 99.9% of scenarios is silly and unnecessary, I always avoid maintaining the token and just 'auth' every time. No reason for you to make you app more complicated. The only exception would be if the same calls are performed in the same sequence every time, then you can hold the token in the Orchestration.

MQSeries: Is syncpoint/rollback possible when getting asynchronously with MCB?

I want to pull messages off a MQS queue in a C client, and would love to do so asynchronously so I don't have to start (explicitly) multithreading. The messages will be forwarded to another system that acts "transactionally" but is completely incompatible with XA. So I'd like to have a way to explicitly commit (and thereby remove) a message that's been successfully handed off to the other system, and not commit if this failed, so that the last message is retained for a more successful later attempt.
I've read about the SYNCPOINT option and understand how I'd use that around a regular GET, but I haven's seen any hints on how to make asynchronous message retrieval have transactional behavior like this. Any hints, please?
I think you are describing using the asynchronous callback capability, ie you register a routine to be called when a message arrives, and ask for any get to be under syncpoint... An explanation of how some of it works is in here, https://share.confex.com/share/117/webprogram/Handout/Session9513/share_advanced_mqi.pdf page 4+
Effectively you get called with the MQ message under syncpoint, do your processing with another system, then commit or rollback the message before returning.
Be aware without the use of e.g. XA 2 phase commit, there is always going to be the windows of e.g. committing to the external system and a power outage means the message under the unit of work gets rolled back inside MQ as you didnt have time to perform the commit.
Edit: my misunderstanding, didn't realise that the application was using a callback to retrieve messages, which is indeed fully asynchronous behavior. Disregard the answer below.
Do MQGET with MQGMO_SYNCPOINT, then issue either MQCMIT or MQBACK.
"Asynchronous" and "synchronous" may be misnomers - these are your patterns of using MQ - whether you wait for a reply message or not, these patterns do not affect how MQ processes your calls. Transaction management (unit of work management) works across any MQI calls that use SYNCPOINT, no matter if they are part of a request/reply pattern or not.

Error during GENERAL_REQUEST_ENTITY for POST request causes ASP .NET session state to never get unlocked

(Cross posting from Server Fault where I wasn't getting any traction):
I have been trying to chase down the root cause of a condition where ASP .NET session state remains locked after a web request has been terminated due to an unexpected error. We use the SQL Server session state provider for session because we have several servers in a web farm. This issue first presented itself in the form of many requests getting stuck on the 'AcquireRequestState' event of their lifecycle for no apparent reason. I was able to finding corresponding entries for these requests in the session state database in SQL server that were all locked (column Locked = 1). I was also able to correlate these requests to entries in the IIS log with HTTP status codes of 500 (with a sub status of 0). These findings lead me to believe that, in some cases, a request was erroring out but was NOT releasing its lock on session state like it should.
I enabled Failed Request Tracing in IIS for the website in question for status code 500 with all available providers selected each with the 'Verbose' setting for verbosity. I've since gathered several failed traces that have caused permanently locked ASP .NET sessions. They all share the same characteristics:
They are all 'POST' requests where the browser is posting data to be processed/saved.
They all have events indicating that the 'Session' module was invoked during the REQUEST_ACQUIRE_STATE event. At this point the request would have marked the row in the session state database as being "locked". This is normal and expected.
They all have GENERAL_READ_ENTITY_START, GENERAL_READ_ENTITY_END, and GENERAL_REQUEST_ENTITY entries that appear to be reading in the data that was posted to the server as part of the request. This appears to be a buffered operation as these events get repeated over and over with each one reading in some subset of the posted data.
At some point during the 'read entity' related events an error occurs. Some have the error code "Incorrect function. (0x80070001)" and others have "The I/O operation has been aborted because of either a thread exit or an application request. (0x800703e3)".
Once the error has been encountered, they all jump directly to the END_REQUEST events.
The issue here is that, under normal circumstances, there should be a RELEASE_REQUEST_STATE event that will allow the Session module to release the lock it has on the session. This event is being skipped in this scenario. Just to be sure, I enabled failed request tracing for the '200' status code as well and generated several traces of successful requests that do have the RELEASE_REQUEST_STATE event being handled by the Session module.
A co-worker pointed out that you can also cause a request to skip directly to the 'END_REQUEST' event by calling HttpContext.Current.ApplicationInstance.CompleteRequest(). I tested this out and saw that using this method during a post request creates a trace very similar to the ones I've been capturing when this issue has been happening, but session does still get cleaned up properly. This lead to me to running SQL Profiler on the SQL Server database where the session is stored to trace all calls to stored procedures. When we skip directly to END_REQUEST due to calling CompleteRequest(), a call is made to update the session state (and release the lock) as expected. When we skip to END_REQUEST as a result of an error during GENERAL_REQUEST_ENTITY, the call to update or release the lock on session state is never made.
My theory at this point is that some kind of network issue is causing the 'Incorrect function' and 'I/O operation has been aborted because of either a thread exit or an application request' errors, but I don't understand why this seems to be causing the request handling to skip over the releasing the lock on session state. If the request went through REQUEST_ACQUIRE_STATE it seems like it should also release the lock at some point toward the end of the request as well. I'm loathe to say that this is a bug in IIS or ASP .NET, but it certainly appears that way to me at this point.
Are there any known conditions under which errors will lead to a session state lock not being released?
As it turns out, this was related to this question: ManagedPipelineHandler for an AJAX POST crashes if an IE9 user navigates away from a page while that call was in progress
The workaround specified in the accepted answer on that question does work, but Microsoft has also since released a hotfix (not yet publicly available as of this writing) that patches the session handling logic to avoid the issue all together.

Event Driven Architecture - Service Contract Design

I'm having difficulty conceptualising a requirement I have into something that will fit into our nascent SOA/EDA
We have a component I'll call the Data Downloader. This is a facade for an external data provider that has both high latency and a cost associated with every request. I want to take this component and create a re-usable service out of it with a clear contract definition. It is up to me to decide how that contract should work, however its responsibilities are two-fold:
Maintain the parameter list (called a Download Definition) for an upcoming scheduled download
Manage the technical details of the communication to the external service
Basically, it manages the 'how' of the communication. The 'what' and the 'when' are the responsibilities of two other components:
The 'what' is managed by 'Clients' who are responsible for
determining the parameters for the download.
The 'when' is managed by a dedicated scheduling component. Because of the cost associated with the downloads we'd like to batch the requests intraday.
Hopefully this sequence diagram explains the responsibilities of the services:
Because each of the responsibilities are split out in three different components, we get all sorts of potential race conditions with async messaging. For instance when the Scheduler tells the Downloader to do its work, because the 'Append to Download Definition' command is asynchronous, there is no guarantee that the pending requests from Client A have actually been serviced. But this all screams high-coupling to me; why should the Scheduler necessarily know about any 'prerequisite' client requests that need to have been actioned before it can invoke a download?
Some potential solutions we've toyed with:
Make the 'Append to Download Definition' command a blocking request/response operation. But this then breaks the perf. and scalability benefits of having an EDA
Build something in the Downloader to ensure that it only runs when there are no pending commands in its incoming request queue. But that then introduces a dependency on the underlying messaging infrastructure which I don't like either.
Makes me think I'm thinking about this problem in a completely backward way. Or is this just a classic case of someone trying to fit a synchronous RPC requirement into an async event-driven architecture?
The thing I like most about EDA and SOA, is that it almost completely eliminates the notion of race condition. As long as your events are associated with some association key (e.g. downloadId), the problem you describe can be addressed with several solutions of different complexities - depending on your needs. I'm not sure I totally understand the described use-case but I will try my best
Out of the top of my head:
DataDownloader maintains a list of received Download Definitions and a list of triggered downloads. When a definition is received it is checked against the triggers list to see if the associated download has already been triggered, and if it was, execute the download. When a TriggerDownloadCommand is recieved, the definitions list is checked against a definition with the associated downloadId.
For more complex situation, consider using the Saga pattern, which is implemented by some 3rd party messaging infrastructures. With some simple configuration, it will handle both messages, and initiate the actual download when the required condition is satisfied. This is more appropriate for distributed systems, where an in-memory collection is out of the question.
You can also configure your scheduler (or the trigger command handler) to retry when an error is signaled (e.g. by an exception), in order to avoid that race condition, and ultimately give up after a specified timeout.
Does this help?

Resources