Zombies in BizTalk [closed]

Zombies in BizTalk [closed] - biztalk

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 11 years ago.
I have a BizTalk 2006 R2 application that works perfectly.
It receives the messages, processes them and sends correct responses.
But although the everything is correct (the messages are successfully picked up by the orchestrations and the response is sent without errors), BizTalk still generates a "Message not consumed" error related to the response message...
I've debugged every bit of the application and there is no error, no duplicated message, no message left behind, nothing... I googled the error and the vast majority of the few links that I find on the subject are related to zombie clean up scripts. This makes me wonder if this is not a common issue in BizTalk...
Does anybody have any idea on what may be causing this error?

Yeah ... this is a common issue which can most of the time be overcome by a slight change in the way your solution is put together.
Zombies usually occur when using correlations and time-outs, but not the only time.
The orchestration is dehydrated waiting for either a response to the correlation set or the time-out, if the time-out occurs the orchestration proceeds to process usuall past the receive location waiting for the correlated response. Now the message box gets the response but there is no longer anything waiting for that response. Hence your error.
I've also seen this behavior when calling a web service and waiting for a response; but this had to do with how I was handling errors. A small change to my process resolved that problem.
Ways to minimise the occurance of this problem is to shorten the amount of work the orchestration does after the time-out. Make the window for Zombies to occur as small as possible.
Sometimes it is not possible to avoid this non-deterministic termination issue so I've found myself building a "ZombieHandler" process which receives these messages and cleans up after itself.
If you could post more information about your process we could try assist some more.

This sounds like a zombie. Does your orchestration use correlation and a wait time? If so, you're in Zombie Land. The issue is that you have a wait and a seocndary read waiting to see which triggers first. If the wait triggers first and then a new message on the correlation comes in... Zombie.
Let us know more about your orchestration and we can further discuss a solution.

The error is in the BizTalk group panel and not in the event log and is "The instance completed without consuming all of its messages. The instance and its unconsumed messages have been suspended.". Basically, I have a main orchestration that receives a message through a two way port, sends it to the message box while initializing a correlation. The next shape in this orchestration waits for a message (without any timeout logic) and follows the correlation created in the previous send shape. When a response is received it is forwarded back to the original requester.
It is a very simple orchestration (screenshot: http://img139.imageshack.us/img139/2307/orchestration.jpg) with almost no logic. The point is that i'm getting correct responses always, so I cannot figure out what is triggering the "message not consumed" error. By the way, the message flagged as not consumed is the response message.
Any further ideas?
Ps. ryancrawcour, can you elaborate on your ZombieHandler? To which properties do you bind such orchestration?

Why are you using a correlation set? You have an initializing receive for the correlation set, where is the following receive?
Can you take a step back and explain what is the requirement for correlation? What messages are you trying to tie togeather here? I am guessing if you remove correlation from this orchestration, it will work perfectly.
Here's a link to correlation Tutorial if you want to take a look.

#ChrisLoris:
Screenshot of the orchestration: http://img139.imageshack.us/img139/2307/orchestration.jpg
In the screenshot above you can see that I have an orchestration linked to a send/receive port. Basically i'm getting a message to process, update a few attributes on it and send it to the message box while initializing a correlation based on a specific property (lets call it MsgIdentifier). Other orchestrations will pickup this message and do the real processing. When a response is dropped into the message box with the same MsgIdentifier (custom property) this orchestration picks it up and sends it back to the original requester.
The correlation is initialized in the send shape that sends the request to the message box and the following receive shape waits for a response that follows the same correlation, i.e. that has the same value in the MsgIdentifier property.
Think of this orchestration as a broker, an intermediary between the external system and the inner workings of the BizTalk application.
Again, all is working properly and the correct messages are being picked up without any problems and that is the exact strange behavior that I'm trying to analyse. It shouldn't mark the response as a message not being consumed, because its being detected, consumed and returned.

Are there any chances that the original message is being processed by multiple orchestrations? In this case, there may be two messages put back into the message box for a response to the orchestration we are discussing. In this case, the first message would be picked up by the corrleation set. because there is no looping construct on the following receive, the second message would have no where to go - Zombie.

Related

Axon Event Processing Timeout

I am using an Axon Event Tracking processor. Sometimes events take longer that 10 seconds to process.
This seems to cause the message to be processed again and this appears in the log "Releasing claim of token X/0 failed. It was owned by another node."
If I up the number of segments it does not log this BUT the event is still processed twice so I think this might be misleading. (I think I was mistaken about this)
I have tried adjusting the fetchDelay, cleanupDelay and tokenClaimInterval. None of which has fixed this. Is there a property or something that I am missing?
Edit
The scenario taking longer than 10 seconds is making a HTTP request to an external service.
I'm using axon 4.1.2 with all default configuration when using with Spring auto configuration. I cannot see the Releasing claim on token and preparing for retry in [timeout]s log.
I was having this issue with a single segment and 2 instances of the application. I realised I hadn't increased the number of segments like I thought I had.
After further investigation I have discovered that adding an additional segment seems to have stopped this. Even if I have for example 2 segments and 6 applications it still doesn't reappear, however I'm not sure how this is different to my original scenario of 1 segment and 2 application?
I didn't realise it would be possible for multiple threads to grab the same tracking token and process the same event. It sounds like the best action would be to put an idem-potency check before the HTTP call?

The Releasing claim of token [event-processor-name]/[segment-id] failed. It was owned by another node. message can only occur in three scenarios:
You are performing a merge operation of two segments which fails because the given thread doesn't own both segments.
The main event processing loop of the TrackingEventProcessor is stopped, but releasing the token claim fails because the token is already claimed by another thread.
The main event processing loop has caught an Exception, making it retry with a exponential back-off, and it tries to release the claim (which might fail with the given message).
I am guessing it's not options 1 and 2, so that would leave us with option 3. This should also mean you are seeing other WARN level messages, like:
Releasing claim on token and preparing for retry in [timeout]s
Would you be able to share whether that's the case? That way we can pinpoint a little better what the exact problem is you are encountering.
By the way, very likely you have several processes (event handling threads of the TrackingEventProcessor) stealing the TrackingToken from one another. As they're stealing an un-updated token, both (or more) will handled the same event. Hence why you see the event handler being invoked twice.
Obviously undesirable behavior and something we should resolve for you. I would like to ask you to provide answers to my comments under the question, as right now I have to little to go on. Let us figure this out #Dan!
Update
Thanks for updating your question #dan, that's very helpful.
From what you've shared, I am fairly confident that both instances are stealing the token from one another. This does depend though on whether both are using the same database for the token_entry table (although I am assuming they are).
If they are using the same table, then they should "nicely" share their work, unless one of them takes to long. If it takes to long, the token will be claimed by another process. This other process in this case is the thread of the TEP of your other application instance. The "claim timeout" is defaulted to 10 seconds, which also corresponds with the long running event handling process.
This claimTimeout is adjustable though, by invoking the Builder of the JpaTokenStore/JdbcTokenStore (depending on which you are using / auto wiring) and calling the JpaTokenStore.Builder#claimTimeout(TemporalAmount) method. And, I think this would be required on your end, giving the fact you have a long running operation.
There are of course different ways of tackling this. Like, making sure the TEP is only ran on a single instance (not really fault tolerant though), or offloading this long running operation to a schedule task which is triggered by the event.
But, I think we've found the issue at least, so I'd suggest to tweak the claimTimeout and see if the problem persists.
Let us know if this resolves the problem on your end #dan!

How to make BizTalk only take one message at a time from the MSMQ

I have a BizTalk orchestration that is picking up messages from an MSMQ. It processes the message and sends it on to another system.
The thing is, whenever a message is put on the queue, BizTalk dequeues it immediately even if it is still processing the previous message. This is a real pain because if I restart the orchestration then all the unprocessed messages get deleted.
Is there any way to make BizTalk only take one message at a time, so that it completely finishes processing the message before taking the next one?
Sorry if this is an obvious question, I have inherited a BizTalk system and can't find the answer online.

There are three properties of the BizTalk MSMQ adapter you could try to play around with:
batchSize
Specifies the number of messages that the adapter will take off the queue at a time. The default value is 20.
This may or may not help you. Even when set to 1, I suspect BTS will try to consume remaining "single" messages concurrently as it will always try parallel processing, but I may be wrong about that.
serialProcessing
Specifies messages are dequeued in the order they were enqueued. The default is false.
This is more likely to help because to guarantee ordered processing, you are fundamentally limited to single threaded processing. However, I'm not sure if this will be enough on its own, or whether it will only mediate the ordering of message delivery to the message box database. You may need to enable ordered delivery throughout the BTS application too, which can only be done at design time (i.e. require code changes).
transactional
Specifies that messages will be sent to the message box database as part of a DTC transaction. The default is false.
This will likely help with your other problem where messages are "getting lost". If the queue is non-transactional, and moreover, not enlisted in a larger transaction scope which reaches down to the message box DB, that will result in message loss if messages are dequeued but not processed. By making the whole process atomic, any messages which are not committed to the message box will be rolled back onto the queue.
Sources:
https://msdn.microsoft.com/en-us/library/aa578644.aspx

While you can process the messages in order by using Ordered Delivery, there is no way to serialize to they way you're asking.
However, merely stopping the Orchestration should not delete anything, much less 'all the unprocessed messages'. Seems that's you problem.
You should be able to stop processing without losing anything.
If the Orchestration is going into a Suspended state, then all you need to do is Resume that one orchestration and any messages queued will remain and be processed. This would be the default behavior even if the app was created 'correctly' by accident ;).
When you Stop the Application, you're actually Terminating the existing Orchestration and everything associated with it, including any queued messages.
Here's your potential problem, if the original developer didn't properly handle the Port error, the Orchestration might get stuck in an un-finishable Loop. That would require a (very minor) mod to the Orchestration itself.

Strategy for messages that must be delivered in order in Rebus

I'm using Rebus SQLTransport with XML serialized messages for integration with SQL Server. Messages represent changes done in SQL Server. Because of that the order of message delivery is essential.
It is because for example message1 may contain object that is referenced (by id) in message2. Another example is that message1 may contain remove request of some object that is required to accept new object from message2.
Aggregating messages into one message would be quite complicated because messages are generated by triggers.
Having message idempotence and one worker I guess that would work except the fact that won't work if error happens and message will be moved to error queue. The error is quite possible to happen because of validation or business logic exception. Because of that I believe only human can fix the problem with message and until that time other messages should not be delivered. So I wanted to ask for advice what would be best to do in that situation. As far as I saw retry number cannot be set to infinity so should I stop the service inside of handler until problem is solved by human?
Thanks in advance

If it's important that the messages are processed in order without any "holes", I suggest you assign a sequence number to each message.
This way, if the endpoint gets a message whose sequence number is greater than the expected sequence number it can throw an exception, thus preventing out-of-order messages to be processed.
I would only do this if errors are uncommon though, and only if the message volume is fairly small.
If in-order processing is required, a much better design would be to use another message processing library that supports a pull model, which I think would fit your scenario much better than Rebus' push model.

NServiceBus, when are too many message used?

When considering a service in NServiceBus at what point do you start questioning how many messages handled by a service is too much and start to break these into a new service?
Consider the following: I have a sales service which can currently be broken into a few distinct business components, these are sales order validation, sales order processing, purchase order validation and purchase order processing.
There are currently about 20 message handlers and 2 sagas used within this service. My concern is that during high volume traffic from my website this can cause an initial spike in the messages to jump into the hundreds. Considering that the messages need to be processed in the order they are taken off the queue this can cause a delay for the last in the queue ( depending on what processing each message does).
When separating concerns within a service into smaller business components I find this makes things a little easier. Sure, it's a logical separation, but it seems to provide a layer of clarity and understanding. To me it seems it seems an easier option to do this than creating new services where in the end the more services I have the more maintenance I need to do.
Does anyone have any similar concerns to this?

I think you have actually answered you own question :)
As soon as the message volume reaches a point where the lag becomes an issue you could look to instance your endpoint. You do not necessarily need to reduce the number of handlers. You could simply install the service a number of times and have specific message types sent to the relevant endpoint by mapping.
So it becomes a matter of a simple instance installation and some config changes. So you can then either split messages on sending so that messages from a particular source end up on a particular endpoint (maybe priority) or on message type.
I happened to do the same thing on a previous project (not using NServiecBus though) where we needed document conversion messages coming from the UI to be processed ASAP. We simply installed the conversion service again with its own set of queues and changed the UI configuration to send the conversion messages to the new endpoint. The background conversion messages were still going to the previous endpoint. So here the source determined the separation.

Connection forcibly closed on 15 second request

I have a request to my own service that takes 15 seconds for the service to complete. Should not be a problem right? We actually have a service-side timer so that it will take at most 15 seconds to complete. However, the client is seeing "the connection was forcibly closed " and is automatically (within in the System.Net layer--I have seen it by turning on the diagnostics) retrying the GET request twice.
Oh, BTW, this is a non-SOAP situation (WCF 4 REST Service) so there is none of that SOAP stuff in the middle. Also, my client is a program, not a browser.
If I shrink the time down to 5 seconds (which I can do artificially), the retries stop but I am at a loss for explaining how the connection should be dropped so quickly. The HttpWebRequest.KeepAlive flag is, by default, true and is not being modified so the connection should be kept open.
The timing of the retries is interesting. They come at the the end of whatever timeout we choose (e.g. 10, 15 seconds or whatever) so the client side seems to be reacting only after getting the first response.
Another thing: There is no indication of a problem on the service side. It works just fine but sees a surprising (to me) couple of retry of the requests from the client.
I have Googled this issue and come up empty. The standard for keep-alive is over 100 seconds AFAIK so I am still puzzled why the client is acting the way it is--and the behavior is within the System.Net layer so I cannot step through it.
Any help here would be very much appreciated!
== Tevya ==

Change your service so it sends a timeout indication to the client before closing the connection.

Sounds like a piece of hardware (router, firewall, load balancer?) is sending a RST because of some configuration choice.

I found the answer and it was almost totally unrelated to the timeout. Rather, the problem is related to the use of custom serialization of response data. The respose data structures have some dynamically appearing types and thus cannot be serialized by the normal ASP.NET mechanisms.
To solve that problem, we create an XmlObjectSerializer object and then pass it plus the objects to serialize to System.ServiceModel.Channels.Message.CreateMessage(). Here two things went wrong:
An unexpected type was added to the message [how and why I will not go into here] which caused the serialization to fail and
It turns out that the CreateMessage() method does not immediately serialize the contents but defers the serialization until sometime later (probably just-in-time).
The two facts together caused an uncatchable serialization failure on the server side because the actual attempt to serialize the objects did not occur until the user-written service code had returned control to the WCF infrastructure.
Now why did it look like a timeout? Well, it turns out that not all the objects being returned had the unexpected object type in them. In particular, the first N objects did not. So, when the time limit was lengthened beyond 5 seconds, the N+1th object did reference the unknown type and was included in the download which broke the serialization. Later testing confirmed that the failure could happen even when only one object was being passed back.
The solution was to pre-process the objects so that no unexpected types are referenced. Then all goes well.
== Tevya ==

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex