Biztalk Cluster Servers - biztalk

we used to have 1 biztalk 2006R2 32bit server. We recently upgraded it to Enterprise. But because our traffic size we didn't have enough power and memory with only one. So we also recently installed a second biztalk server, a 2006R2 64-bit, and we put them in a shared cluster. Since then a problem arose, actually two but I'm guessing they probably are connected. One of our (19) host instances keeps getting in the "stopped" status. This host instance is mainly connected with TCP ports. We have a script which checks if host instances are in the stopped state and starts them again, but this obviously has very little use since it keeps resetting to the stopped state. There also is an error in our event viewer, namely:
Faulting application btsntsvc.exe, version 3.6.1404.0, stamp 4674b0a4, faulting module kernel32.dll, version 5.2.3790.4480, stamp 49c51f0a, debug? 0, fault address 0x0000bef7.
Anyone has any idea?
Thanks

Having automated scripts to restart the host instance is not a good idea IMO, you need to get to the bottom of the problem. It looks like a known issue and a hot fix is availble. Worth lookint at this KB http://support.microsoft.com/kb/978059

Related

Some BizTalk Receive Locations are disabled after server reboot

It is found that some BizTalk Receive Locations are disabled after server reboot (BizTalk server and SQL Server are separately installed to different physical servers)
Is there any idea on this scenario? Due to the boot sequence or other issues?
I will assume that, once you enable the receive locations manually, they are working correctly.
Typically, when FILE receive locations fail while pointing to an external server/share, it is because they are no longer available.
Make sure that, during the night, there are no network issues, planned/unplanned downtime of the share (= here your SQL server). A BizTalk receive location will retry to access a share for quite a while before disabling itself. Check the event log(s) for more information. You would want to look for errors/warnings there indicating an issue with connectivity between BizTalk and SQL.
Another issue might be that there are too many connections between your BizTalk server and SQL server. You can provide a maximum number of connections in the FILE share properties.
Also, you could try this link: https://serverfault.com/questions/235032/intermittent-connection-to-windows-7-shared-folder-from-windows-xp-workstations
It describes a potential fix for optimizing throughput for file sharing, although it depends on your operating system.

Troubleshooting an IIS .NET website outage

Last night one of the websites (.NET 4.0 forms) hosted on my Win 2008 R2 (IIS 7.5) Server started to time out throwing the following error for all connected users.
TYPE System.Web.HttpException
MESSAGE Request timed out.
DETAIL System.Web.HttpException (0x80004005): Request timed out.
The outage was confined to just one website within IIS, the others continued to work fine.
Unfortunately I was unable to identify why the website was timing out. Here are the steps I took:
First thing I did was look at the task manager which revealed normal CPU and memory usage. Network activity was also moderate.
I then opened IIS to look at the live connections under 'Worker Processes'. There were about 60 live connections, so it didn't look like anything DDoS related.
Checked database connectivity (hosted on a separate server), all fine!
I then reset the website on IIS. That didn't work
I tried to then do a complete iisreset...still no luck :(
In the end (and under some duress) the only thing I could think to do to resolve this was to restart the server.
Restarting the server worked but I am nervous not knowing why this happened in the first place. Can anyone recommend any checks that I failed to carryout? Is there an official checklist for working through these sorts of IIS problems? I have reviewed the IIS logs but don't see anything unusual on the run up to the outage.
Any pointers or links to useful resources to help me understand and mitigate against this in future will be much appreciated.
EDIT
The only time I logged into the server that day was to add an additional web handler component (for remote deploy) to IIS Web Deploy. I'm doubtful this caused the outage as the server worked for for 6 hours after.
Because iisreset didn't helped and you had to restart whole machine, I would suspect it was a global resources shortage and mostly used website (or most resource consuming) was impacted. It could be because of not available RAM, network connections congestion due to some malfunctioning calls (for example a lot of CLOSE_WAIT sockets exhausting connections pool, we've seen that in production because of malfunction of external service). It could be also one specific client problem, which was disconnected after machine restart so eventually the problem disappeared.
I would start from:
Historical analysis
review Event Viewer to see any errors/warnings from that period of time,
although you have already looked into IIS logs, I would do it once again with help of Log Parser Lizard to make some statistics like number of request per client, network bandwith per client, average response time per client and so on.
Monitoring
continuously monitor Performance Counters:
\Processor(_Total_)\% Processor Time,
\.NET CLR Exceptions(_Global_)\# of Exceps Thrown / sec,
\Memory\Available MBytes,
\Web Service(Default Web Site)\Current Connections (per each your site name),
\ASP.NET v4.0.30319\Request Wait Time,
\ASP.NET v4.0.30319\Requests Current,
\ASP.NET v4.0.30319\Request Queued,
\Process(XXX)\Working Set,
\Process(XXX)\% Processor Time (XXX per each w3wp process),
\Network Interface(XXX)\Bytes total / sec
run Performance Analysis of Logs (PAL) Tool during time of failure to make a very detailed analysis of performance counters data,
run netstat -ano to analyze network traffic (or TCPView tool even better)
If all this will not lead you to any conclusion, create a Debug Diagnostic rule to create a memory dump of the process for long running requests and analyze it with WinDbg and PSSCor extension for .NET debugging.

MSMQ Inconsistent State After Restart

I'm seeing a really strange error that I'm having a difficult time
tracking down. I think its related to my configuration of Rhino ESB, though I'm not sure
if RSB is actually causing it, so I figured I'd ask and see if
anyone else has come across this in any other usages of MSMQ.
I'm using RSB as a client in a web app (ASP.NET, the client runs in the background). The client talks to a windows service via the MSMQ binding for RSB. Restarting the service never appears to have an effect on MSMQ, neither does restarting IIS by hand. However, whenever I actually restart the computer itself, MSMQ always refuses to start back up, claiming that a "queue is in an inconsistent state". Attempting to start MSMQ manually results in the same error, effectively rendering the MSMQ install completely useless. The only way to solve it is to actually remove then reinstall MSMQ.
The only information I've found via the almighty Google are references to a problem in MSMQ 2.0 (this problem is occurring in MSMQ 4.0). I've verified that Dispose is being called on on the bus at shutdown, in both the service and the web site.
Does anyone have any idea why this could be occurring? Thanks!
I faced the same issue on Window 2008 Server (Virtual Machine). Although the environment was not related to rhino tools.
The error in the event log:
"The Message Queuing service cannot start because a queue is in an inconsistent state. For more information, see Microsoft Knowledge Base article 827493 at support.microsoft.com."
As Roy pointed out, this is happening every 2-3 days. Every time we would follow the steps below to recover - instead re-installing the MSMQ.
1) Stop all applications and services that uses MSMQ.
2) Kill the mqsvc.exe from the Task Manager
3) Go to C:\Windows\System32\msmq\storage and delete any .mq files
4) Start the MSMQ Service
4) Start your application
In my scenario I've been able to fix "queue is in an inconsistent state" error after MSMQ service restart.
Turns out the computer name was too long, so changing computer name to a name with less than 15 characters fixed the issue.
My team is experiencing a similar issue, with MSMQ getting called by NSB 2.5. The issue came up recently after Infrastructure moved our VM to another physical server and for some reason lowered available RAM. We think the issue may be memory-related.
EDIT
After a week of no more issues with this, I can confidently say that raising RAM on the server solved our MSMQ's "Inconsistent state" issue. Mind you, we did have to re-install MSMQ first -- but the issue never came back, and before the RAM update the issue popped up every 2 days.
Regularly on Windows 2008RC2, MSMQ cannot start after reboot.
The two regular issues for me are:
"The Message Queuing service cannot start because a queue is in an inconsistent state"
and
"The dependency service does not exist or has been marked for deletion"
Sometimes, the following has helped (although we are seeking a more solid answer)
rename msmq folder to msmq_old
net stop wuauserv
net stop bits
Delete “%windir%\softwaredistribution” directory
Reboot
This has occurred 5 times this year, and each time, a variety of the above with plenty of reboots.
Sometimes, we revert to Remove Feature / Add Feature, however you may get yourself in a loop. As it boots up, a rollback occurs in the windows update service, so the Feature is never uninstalled, and the problem is never repaired.
Following the steps above can help with that.

What Causes "Internal connection fatal errors"

I've got a number of ASP.Net websites (.Net v3.5) running on a server with a SQL 2000 database backend. For several months, I've been receiving seemingly random InvalidOperationExceptions with the message "Internal connection fatal error". Sometimes there's a few days in between, while other times there are multiple errors per day.
The exception is not limited to one site in particular, though they share business and data access assemblies. The error seems to always be thrown from SqlClient.TdsParser.Run(). It sometimes is thrown from old-school direct SqlCommand.Execute() calls, while other times it is thrown from Linq2Sql code.
I've been assured by the network guys that there are no errors or packets lost on their end. Has anyone else experienced this? Could it be a driver problem? We have been unable as of yet to pinpoint a specific trigger for this exception.
We're running II6 on Windows Server 2003.
After a few months of ignoring this issue, it started to reach a critical mass as traffic gradually increased. Under heavy load, including some crawlers, things got crazy and these errors poured in nonstop.
Through trial and error, we eventually tracked down a handful of SqlCommand or LINQ queries whose SqlConnection wasn't closed immediately after use. Instead, through some sloppy programming originating from a misunderstanding of LINQ connections, the DataContext objects were disposed (and connections closed) only at the end of a request rather than immediately.
Once we refactored these methods to immediately close the connection with a C# "using" block (freeing up that pool for the next request), we received no more errors. While we still don't know the underlying reason that a connection pool would get so mixed up, we were able to cease all errors of this type. This problem was resolved in conjunction with another similar error I posted, found here: Why is my SqlCommand returning a string when it should be an int?
Sounds like the database connection is getting dropped or timing out.
We recently had similar issues moving to IIS 6 from IIS 5 connecting to SQL 2000. Our issue was solved by increasing number of ephemeral ports available.
Look at the usage of the ephemeral ports by the IIS server. The default max no. of ports available is normally 4000. You might want to consider increasing this if the sites on your server are particularly busy or your application is making a lot of database calls.
You can monitor these first to see if going over max limit.
Search Microsoft Knowledge base for "MaxUserPort" and "TcpTimedWaitDelay" and make necessary registry changes. Make sure you back up registry or snapshot server before making the changes. Will need to reboot for changes to take effect.
You should double check your database and recordset connection are being closed after use. Not closing will use up this port range unnecessarily.
Check the efficiency of your stored procedures anyway as they might be taking longer than they need too.
"If you rapidly open and close 4000 sockets in less than four minutes, you will reach the default maximum setting for client anonymous ports, and new socket connection attempts fail until the existing set of TIME_WAIT sockets times out." - from http://support.microsoft.com/kb/328476
Check your server's LOG folder (\program files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG or similar) for files named SqlDump*.mdmp and SqlDump*.txt. If you do find any you'll have to take it to Product Support.
I was creating a new EF Core project and was trying to create the database to an external Linux server instead of a Windows Server or local one. After hours of searching I found out that I am using MySQL instead of the Microsoft SQL server.
I found it weird that everyone was using 1433 instead of the usual 3306. So to fix my 'Internal connection fatal error' I had to set up a docker instance of SQL Server bound to its default port of 1433.
It literally was that simple. In the docker repo look for "microsoft-mssql-server" and run the image as described neatly in the description below. Everything works now and I am able to push my database from my EF Core project to an external server.

Slow BizTalk File Receive

I have an application with a file receive location. After the host instance has been running for a few hours the receive location fails to identify new files dropped into the folder that it is monitoring. It doesn't forget about them altogether, it's just that performance grinds to a crawl. The receive location is configured to poll the target folder every 60 seconds but after host instance has been running for an hour or so, then it seems that the target folder is being polled only every thirty minutes. If I restart the host instance then the files waiting in the target folder are collected right away and performance is fine for the next hour or so.
The same application runs fine in a different environment.
There are now obvious entries in the event log related to the problem.
All the BizTalk SQL jobs are running fine except for Backup BizTalk Server (BizTalkMgmtDb).
Any suggestions gratefully received.
Thanks
Rob
Here are some additional tools which may help you identify and diagnose BizTalk database issues.
BizTalk MsgBox Viewer
Here is a tool to repair identified errors:
Terminator
Use at your own risk... read the glogs and docs. Start with the message box viewer and let us know our results.
Without more details, the biggest tell is that your Backup Job is failing. If the backup job is failing, it may not be properly configured. If it is properly configured and still failing, then you've got other issues. Can you give us some more information about your BizTalk install.
What version are you running?
What are our database sizes?
What are your purge and archive settings like?
Is there any long running blocks in your SQL Server DB coming from BizTalk?
Another thing to consider is the user accounts the send, receive and orchestration hosts are running under. Please check the BizTalk Administration Console. If they are all running the same account, sometimes the orchestrations can starve the send and receive processes of CPU time. I believe priority is given to orchestrations then receive, then send. Even if you are just developing, it is useful to use separate accounts for this. This also improves security.
The Wrox BizTalk Server 2006 will also supply tuning advice.
What other things are going on with the server? Is BizTalk pegged otherwise or is it idle?
You mention that the solution does not have any problems in another environment, so it's likely that there is a configuration problem.
Check the following:
** On SQL Server, set some upper memory limit for SQL Server. By default, SQL Server uses whatever it can get and then hangs onto it, so set a reasonable limit so that your system can operate without spending a lot of time paging memory onto and from your hard drive(s).
** Ensure that you have available disk space - maybe you are running low - this can lead to all kinds of strange problems.
** Try to split up the system's paging file among its physical drives (if you have more than one drive on the system). Also consider using a faster drive, or if you have lots of cash laying around, get a SAN.
** In BizTalk, is tracking enabled? If so, are you also tracking message bodies? Disable tacking or message body tracking and see if there is a difference.
** Start performance monitor and monitor the following counters when running your solution
Object: BizTalk Messaging
Instance: (select the receiving host) %%
Counter: Documents Received/Sec
Object: BizTalk Messaging
Instance: (select the transmitting host) %%
Counter: Documents Sent/Sec
Object: XLANG/s Orchestrations
Instance: (select the processing host) %%
Counter: Orchestrations Completed/Sec.
%% You may have only one host, so just use it. Since BizTalk configurations vary, I am using generic names for hosts.
The preceding counters monitor the most basic aspects of your server, but may help to narrow down places to look further. You can, of course, add CPU and Memory too. If you have time (days...maybe weeks) you could monitor for processes that allocate memory and never release it. Use the following counter...
Object: Memory
Counter: Pool Nonpaged Bytes
Slow decline of this counter indicates that a process is not releasing memory, which affects everything on the system.
Let us know how things turn out!
I had the same problem with, when my orchestration was idle for some time it took a long time to process the first msg. A article of EvYoung helped me solve this problem.
"This is caused by application domain unloading within the BizTalk host process. If an AppDomain is shutdown after idle, the next message that comes needs to wait for the Orchestration to compile again. Depending on the complexity of your design, this can be a noticeable wait. To prevent this in low latency requirement scenario, you can modify the BTSNTSVC.EXE.config file and set SecondsIdleBeforeShutdown property to -1. This will prevent AppDomain shutdown due to idle."
You can find the article in here:
http://blogs.msdn.com/b/biztalkcpr/archive/2008/05/08/thoughts-on-orchestration-performance.aspx
It took me to long to respond but i thought i might help someone. cheers :)
Some good suggestions from others. I will add :
Do you have any custom receive pipeline components on the receive location ? If so perhaps one is leaking memory, calling some external component eg database which is taking a long time ?
How big are the files you are receiving ?
On the File transport properties of your receive location, set "file renaming" on, do the files get renamed within 60s.

Resources