MSDTC fails intermittently - transactionscope

I have a Windows 2003 server where MSDTC is running. I have set it to No Authentication Mode, with allow inbound-outbound settings. My MSDTC works but it fails for the 1st transaction of the day. On immediate another transaction it will start working.
Error is : The transaction has already been implicitly or explicitly committed or
aborted (Exception from HRESULT: 0x8004D00E).
So I started MSDTC tracing, in trace file it shows this :
pid=2144;tid=2528;time=12/02/2011-10:49:39.140;seq=531;eventid=TRACING_STARTED;;"MSDTC is resuming the tracing of long -lived transactions"
pid=2144;tid=2528;time=12/02/2011-10:49:39.140;seq=532; eventid=TRANSACTION_BEGUN; tx_guid=4df1b0cf-26a0-43ba-8f41-965d80f92441;"transaction got begun, description : ''"
pid=2144;tid=3288;time=12/02/2011-10:49:39.140;seq=533 ; eventid=RM_ENLISTED_IN_TRANSACTION; tx_guid=4df1b0cf-26a0-43ba-8f41-965d80f92441 ;"resource manager #1002 enlisted as transaction enlistment #1. RM guid = '4e45a393-b02a-42bf-8f66-62bcb17fee8e'"
pid=2144;tid=4164;time=12/02/2011-0:49:58.390;seq=534; eventid=TRANSACTION_PROPOGATION_FAILED_CONNECTION_DOWN_FROM_REMOTE_TM ;tx_guid=4df1b0cf-26a0-43ba-8f41-965d80f92441 ;"failed to propogate transaction to child node 'DBSERVER' because the connection with the remote transaction manager went down"
pid=2144;tid=4164;time=12/02/2011-10:49:58.390;seq=535; eventid=TRANSACTION_ABORTING;tx_guid=4df1b0cf-26a0-43ba-8f41-965d80f92441;"transaction is aborting"
pid=2144;tid=4164;time=12/02/2011-10:49:58.390;seq=536; eventid=RM_ISSUED_ABORT;tx_guid=4df1b0cf-26a0-43ba-8f41-965d80f92441 ;"abort request issued to resource manager #1002 for transaction enlistment #1"
pid=2144;tid=2528;time=12/02/2011-10:49:58.422;seq=537; eventid=RM_ACKNOWLEDGED_ABORT;tx_guid=4df1b0cf-26a0-43ba-8f41-965d80f92441;"received acknowledgement of abort request from the resource manager #1002 for transaction enlistment #1"
pid=2144;tid=2528;time=12/02/2011-10:49:58.422;seq=538; eventid=TRANSACTION_ABORTED;tx_guid=4df1b0cf-26a0-43ba-8f41-965d80f92441 ;"transaction has been aborted"
pid=2144;tid=3640;time=12/02/2011-10:50:29.437;seq=539;eventid=TRACING_STOPPED;;"MSDTC is suspending the tracing of long - lived transactions due to lack of activity"
I have applied hack of Davy Brion from here, http://davybrion.com/blog/2010/03/msdtc-woes-with-nservicebus-and-nhibernate/
Also set timeout interval to 10 minutes in Transaction Options.
If the server remains idle for a while again the transaction will fail.
Thanks in advance..

Here is the KB article that helped: 922430
If the MS DTC transaction trace log file contains this data, follow these steps:
Click Start, click Run, type regedit, and then click OK.
Locate the following registry subkey:
HKEY_LOCAL_MACHINE\Software\Microsoft\MSDTC
Right-click MSDTC, point to New, and then click DWORD Value.
Type CmMaxNumberBindRetries, and then press ENTER.
Right-click CmMaxNumberBindRetries, and then click Modify.
Click Decimal.
In the Value data box, type 60.
This value increases the length of time that the client computer waits for the bind packet response from the server computer. This value is double the number of seconds before the client computer stops the transaction if the client computer does not receive the bind packet response. For example, a value of 60 equals 30 seconds.
Note The value of 60 is only a recommended value. Additional testing on your configuration may be required.
Click OK.
Restart MS DTC.
Note For the slow response scenario, make sure that the ports that are required by Kerberos authentication (UDP 88 and TCP 88) are open when a firewall is involved in the Perimeter Network. The ports UDP 389 and TCP 389 (both for LDAP to find KDC) must also be open.

Related

Continuously running send pipeline instance

An instance of a BizTalk send pipeline has started to run continuously. On 09/12/2021 an attempt was made to send a file via SFTP, which retried several times but ultimately failed due to a network issue. The error from the event logs is:
The adapter failed to transmit message going to send port "Deliver Outgoing - SFTP" with URL "sftp://xxx.xxxxxx.co.nz:22/To_****/%SourceFileName%". It will be retransmitted after the retry interval specified for this Send Port. Details:"WinSCP.SessionRemoteException: Network error: Software caused connection abort.
For some reason BizTalk made another send attempt at 1:49pm on 10/12/2021 which succeeded as confirmed by the administrator of the SFTP site. Despite this, BizTalk continued making intermittent send attempts and the pipeline instance is still running. The same file has been sent 4 times to the SFTP server.
The pipeline instance in theory should have suspended at 9:47pm on 09/12/2021. I have been able to confirm definitively whether anybody resumed it, but it seems unlikely at this stage. In any case, after sending successfully the pipeline instance should have terminated and should not be re-executing intermittently.
Does anybody know what could account for this behaviour? This is occurring on BTS2020 with CU2 applied.
I've sent messages over SFTP where the WinSCP interpretation of the date-modified attribute doesn't work with a specific type of SFTP server.
With the WinSCP GUI a dialogue box appears and you can disregard this error, but this option isn't available with BizTalk's GUI. This error appears when a file with the same filename already exists on the server and is supposed to be overwritten.
My solution was to create a pipeline component that removed %SourceFileName% on the server. The pipeline component (just like WinSCP GUI) can disregard the modified-date.

IBM MQ :: Remote Configuration - Can't Start Sender Channel

I am working with IBM MQ. I managed to get a basic Handshake / Put Message(s) / Get Message(s) / Disconnect .net solution going on, a couple of days ago, but it only works on a local level, and I now need to update the solution so it works remotely as well.
After reading and experimenting for a while, I decided to follow IBM Knowledge Center's Point to Point scenario step by step. However, I can't start the Sender Channel as instructed by the guide's last step; the Sender Channel's status ping-pongs between Binding and Retrying, and the logs come up with the following error codes; AMQ9002, AMQ9202 and AMQ9999, meaning, as far as I can tell, there is some kind of trouble finding and/or connecting with the host, as explained by the error logs.
I have looked through a lot of questions regarding these errors in particular, but while I have followed most of the proposed solutions (I made sure the Receiver's listener is running, I tried turning off Firewalls, I tried with different ports, I have performed tests Telnet, I have stopped/restarted/resolved the Sender channel a few times, and I have tried setting this up from both, the command line and MQ Explorer), I have yet to get a successful communication going on between two different PCs.
I am aware the error could be either temporary, or the result of problems within the Network itself, but I have been trying to establish a successful connection for almost three days now, and before I pass this unto my bosses I would like to make sure I have exhausted every other possibility.
How can I complete IBM's Point To Point set up guide, or is there anything that could point me towards a different / better approach to get two PCs talking with each other via IBM MQ v9?
Although hastily translated from Japanese, you can find the detailed error logs below.
2017/09/19 17:34:09 - Process (234212.1) User (MUSR_MQADMIN) Program
(runmqchl.exe)
Host (DESKTOP - UP 4 D 363) Installation (Installation 1)
VRMF (9.0.3.0) QMgr (QM 1)
Time (2017-09-19T08: 34: 09.201 Z)
AMQ9002: Channel 'TO.QM2' is starting.
Description: Channel 'TO.QM2' is starting.
ACTION: None.
2017/09/19 17:34:30 - Process (234212.1) User (MUSR_MQADMIN) Program
(runmqchl.exe)
Host (DESKTOP - UP4D363) Installation (Installation 1)
VRMF (9.0.3.0) QMgr (QM 1)
Time (2017-09-19T08: 34: 30.824Z)
AMQ 9202: The remote host 'DESKTOP-1AV4LM3 (The correct ip address) (1415)' can not be used.Please try again later.
Description: Using TCP / IP to host 'DESKTOP-1AV4LM3 (The correct ip
address) of channel TO.QM2 (1415) 'trying to allocate a conversation,
but it did not succeed. However, It is temporary and there is also the
possibility that TCP / IP conversation can be allocated normally
later.
If the remote host can not be determined, '????' is displayed. .
ACTION: Please try the connection later. If the failure persists,
record the error value Please contact the stem administrator. The
return code from TCP / IP is 10060 (X'274C ').The cause of this
failure may be that the host can not reach the destination host.
Alternatively, There is a possibility that the host 'DESKTOP-1AV4LM3
(The correct ip address) (1415)' listener isn't running. If that is
the case, start the listener and try again.
2017/09/19 17:34:30 - Process (234212.1) User (MUSR_MQADMIN) Program (runmqchl.exe)
Host (DESKTOP - UP 4 D 363) Installation (Installation 1)
VRMF (9.0.3.0) QMgr (QM 1)
Time (2017-09-19T08: 34: 30.825Z)
AMQ9999: Channel 'TO.QM2' for host 'DESKTOP-1AV4LM3 (1415)' terminated abnormally
Description: The host 'DESKTOP-1AV4LM3 (1415)' cannot be determined.
ACTION: Check the error log for the preceding error message for
this channel program Please determine the cause of failure....
".
The 'interesting' bit of the error messages above is that the sender is attempting to start a channel to port 1415 on the destination and is getting a 10060 return code (WSAETIMEDOUT). This is different from an immediate rejection because the other end doesnt have a socket open, for example.
You will also note its timing out after about 21 seconds if your times are to be believed. The only time I've seen this kind of things is DNS resolution - There was an APAR for example showing that reverse DNS can cause delays in channel startup, and this could be for a successful or unsuccessful startup
http://www-01.ibm.com/support/docview.wss?uid=swg1IC96408
A new attribute was added to MQ to disable reverse DNS lookups if its the cause - See https://www.ibm.com/support/knowledgecenter/en/SSFKSJ_8.0.0/com.ibm.mq.pro.doc/q113120_.htm#q113120___chlauth
If this is the case, on the receiving end (or both!) try runmqsc , 'ALTER QMGR REVDNS(DISABLED)'. You might have to restart the qmgr for it to be effective (I'm not sure, sorry)
I'd also echo the comment added to your question by JoshMc, to check the receiving end logs for messages (both global errors but more likely the qmgr specific AMQERR01.LOG files) when this occurs - I have a feeling that the timeout is only part of your problem.

Unable to enable receive locations (many of them)

Whenever I try to enable these receive locations, I get the sql time out error. Moreover there are many service instances and messages which are in active state or Dehydrated (Queued awaiting processing).
TITLE: BizTalk Server Administration
Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. (.Net SqlClient Data Provider)
For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft+SQL+Server&EvtSrc=MSSQLServer&EvtID=-2&LinkId=20476
ADDITIONAL INFORMATION:
The wait operation timed out
BUTTONS:
OK
I had tried restarting the host instances, but it did not work. The environment is a clustered server. There looks like no performance issue: no memory or cpu spikes.

what causes the data power object to goto pending state and how can it be resolved?

In datapower, the operational state of queue manager object is pending. The information provided for this operational state is as follows : "This message indicates that the configuration of the object has changed, but has not been committed and has yet to take effect. No user intervention is required." What is exactly causing this problem and how can this be resolved?
If it is "pending" for a MQ QM object it means that DataPower is trying to figure out if it has a connection to it or not.
Normally if a QM object is in "pending" for a while, more than 20 seconds, it would mean that it didn't get the connection.
Check the System log and you'll probably see a ton of connection errors to the QM server.
First go to Troubleshooting from the Control Panel and do a TCP test to make sure you have a connection to the MQ server using the IP and port of the listener on the QM.
If you get a connection then check the MQ logs for any authentication issues, eg. user and/or auth-records. You need a Server-Connection channel for DataPower!
If you don't get a connection in TCP test then check your firewalls and also make sure that the DataPower network is setup correctly if you have multiple network cards (NIC) setup and set a static route for the MQ on the correct NIC.

Implementing TCP keep alive at the application level

We have a shell script setup on one Unix box (A) that remotely calls a web service deployed on another box (B). On A we just have the scripts, configurations and the Jar file needed for the classpath.
After the batch job is kicked off, the control is passed over from A to B for the transactions to happen on B. Usually the processing is finished on B in less than an hour, but in some cases (when we receive larger data for processing) the process continues for more than an hour. In those cases the firewall tears down the connection between the 2 hosts after an inactivity of 1 hour. Thus, the control is never returned back from B to A and we are not notified that the batch job has ended.
To tackle this, our network team has suggested to implement keep-alives at the application level.
My question is - where should I implement those and how? Will that be in the web service code or some parameters passed from the shell script or something else? Tried to google around but could not find much.
You basically send an application level message and wait for a response to it. That is, your applications must support sending, receiving and replying to those heart-beat messages. See FIX Heartbeat message for example:
The Heartbeat monitors the status of the communication link and identifies when the last of a string of messages was not received.
When either end of a FIX connection has not sent any data for [HeartBtInt] seconds, it will transmit a Heartbeat message. When either end of the connection has not received any data for (HeartBtInt + "some reasonable transmission time") seconds, it will transmit a Test Request message. If there is still no Heartbeat message received after (HeartBtInt + "some reasonable transmission time") seconds then the connection should be considered lost and corrective action be initiated....
Additionally, the message you send should include a local timestamp and the reply to this message should contain that same timestamp. This allows you to measure the application-to-application round-trip time.
Also, some NAT's close your TCP connection after N minutes of inactivity (e.g. after 30 minutes). Sending heart-beat messages allows you to keep a connection up for as long as required.

Resources