Debugging requests which are 'stuck' in an IIS worker process - asp.net

In case of TL;DR - I basically need guidance regarding what tools are available to debug requests which are issued to IIS and which stall inside a module.
I have a problem with an old ASP 2.0 app at the moment whereby it will periodically become unavailable and recycling the app pool (horrible as that may be) doesn't bring it back up 100% of the time.
So first of all it presents itself as requests entering the app pool and being trapped in state 'BeginRequest' in RewriteModule.
It is not a specific request which is always the first to experience this issue. The issue cannot be easily recreated either.
Eventually requests join this backlog and when it becomes 70+ deep the app pool fails to respond to pings from WAS and it forcibly recycles. Predictably it doesn't stop on-time and the old app pool is forced to stop. When the new app pool comes up it either works just fine or it instantly experiences the same issue as the outgoing one and requests begin to queue.
In issues like this all the official guidance is understandably focussed around looking at why the RewriteModule may choke.
I have validated my redirections and though complex there are no obvious issues with syntax (XML validates).
Likewise in inetmgr loading up the URL Rewrite Module seems to parse the configs fine and show them visually.
Basic stuff like permissions is all fine.
When the app is working normally I also used Failed Request Tracing/Logging to look at the request pipeline for a sample URL which stalled and I can confirm that there is no circular logic or weird errors presenting - the request seems to be handled just fine. This also showed me how high up the rewritemodule is invoked and from this I really don't see how the issue could be app-related as .NET isn't invoked at this point.
Annoyingly when an app pool is experiencing this issue and I can throw in requests which just stall Failed Request Tracing is no good because you actually need a request to get to the end of it's journey and fail otherwise it refuses to log anything out.
I resorted to taking process dumps of affected w3wp.exe's and running them through DebugDiag. Unfortunately the only thing I see is that threads are open accessing the rewritemodule but precious little about what they are stuck on.
As anyone else would do I've tried to track the start of the issue back to any recently installed patches or code changes but nothing matches. Likewise this is happening on 3x servers otherwise I would try reinstalling the rewritemodule. Other sites on the same server which invoke rewritemodule are unaffected.
Has anyone else experienced issues like this - the net seems to have relatively little info in this case. Perhaps you can recommend further debugging tools or approaches for IIS which I can adapt to this scenario? This is sort of a cry for help from someone more used to Apache/Nginx - sorry for the long post.

Related

Some requests on IIS hang for minutes and end in a lost connection

I have an awkward issue with IIS 10.0 on Windows Server 2016 and ASP.Net 4.5.2 and MVC 5.2.7.
At times, certain requests do not receive a response and run for minutes, maybe 10 or so, before ending in a lost connection (PR_CONNECT_RESET_ERROR in Firefox on Windows, NSURLDomainError in Firefox on iOS). These are mostly POST requests. When this issue occurs, other GET requests will receive a swift response and a correct result. Normally, POST-request do no take long to be processed, typically less than 3 seconds.
Recycling the associated worker process will make the issue go away, for hours or days.
When today inspected the web server when the issue was going on, I saw little CPU usage, less than 10%, memory 56%, the worker process a modest 615 MB. I saw neither logging in the W3C log of these requests, nor in my custom application logs.
I added the Web-Request-Monitor conform How do I see currently executing web request on IIS 8, but in doing so, the the worker process probably got recycled, as the issue is not currently occurring.
There are a reverse proxy and an access manager between the internet and my web server. I suppose they can have something to do with this issue, but it certainly is related to IIS, as recycling helps.
All of this is happening on a acceptation web server running a newer version of my application. I am not aware of any big changes to the application's architecture that could be involved. Also, there will be very little traffic from other clients, if none at all.
What could be next steps to investigate this issue further?
Update
This issue was definitely caused by log4net. However, it was not related to the log4net.Internal.Debug setting. It was caused by two application domains accessing the same log file. This occasionally resulted in concurrency issues with accessing the log file. It appeared that log4net could not properly handle this and got stuck while writing to the log file.
This log file was configured with the RollingFileAppender option. Since we also used AdoNetAppender, we decided to remove file logging all together.
Original
I have found a probable cause. I'll report the steps I took to investigate the issue.
I activated the Worker Processes feature in IIS.
When, after a couple of days of waiting, the issue started again, I found long running requests. They all had State ExecuteRequestHandler and Module Name ManagedPipelineHandler. They had Time Elapsed of hundreds of seconds.
I also activated the Failed Requests Tracing with a rule for long running requests with a Time Taken of 1 minute.
After a couple of days, I started to receive failed request reports. The failed request all have a GENERAL_SET_RESPONSE_HEADER event as their last event.
I added additional debug logging events for each requests. When debugging in my development environment, at one point, I started to see the hanging behaviour there, on one of the new logging statements(!). The application uses log4net.
I captured a stack trace:
log4net.dll!log4net.Appender.AppenderSkeleton.DoAppend(log4net.Core.LoggingEvent loggingEvent) log4net.dll!log4net.Util.AppenderAttachedImpl.AppendLoopOnAppenders(log4net.Core.LoggingEvent loggingEvent) log4net.dll!log4net.Repository.Hierarchy.Logger.CallAppenders(log4net.Core.LoggingEvent loggingEvent) log4net.dll!log4net.Repository.Hierarchy.Logger.Log(System.Type callerStackBoundaryDeclaringType, log4net.Core.Level level, object message, System.Exception exception) log4net.dll!log4net.Core.LogImpl.DebugFormat(string format, object arg0)
The DoAppend method uses lock(this), which may very well cause hangs.
I also found out that the config setting log4net.Internal.Debug was set to true, which I do not want under normal circumstances and this may be related. I did not attempt to understand the log4net code, but I remember that logging initially did not work, in the acceptance environment, so the setting may very well have been set to true then, causing the issue to start.
Another indication that this is happening with log4net is that when the issue last occurred, I realized that logging of level standard, only occurs in some POST requests. I found a POST-request that does not log and requests to it where handled normally, while the other POST-requests still hung.
For now, I have set log4net.Internal.Debug to false and will wait to see what happens.
IIS recycle fix this issue doesn't mean that this is an IIS issue because all asp.net application run in .net runtime unless it is proved that the request is hang in IIS module.
So you may need to wait this issue happen again, then create a Failed request tracing rule for time-taken. Then it will tell us this issue is happening on IIS pipeline module or .net runtime.
If all request hang in .net runtime. Then you may have to capture a hang dump and do a deep analysis via WINDGB and mex extension. It will tell us what's happening there.

Application Takes too long to respond

I'm using devexpress grid view to view and saving data,my problem is when the browser is first loaded application respond normally,after while (less than 5 minute),application response become very low and takes too long time to respond for any action ,i tried to to expand recycle time(regular time interval) on IIS but seems that the problem is not from IIS at all
also i made session time out after 60 minute and likewise there is no difference.
note that:
All of my request is Ajax calls.
When i deploy the application using development Server (in visual studio) application respond normally.
could any one please suggest me where the problem is??
Its hard to say where is the problem , but first thing came in to my head
It may be possible connection are opened but not closed, In general not disposing objects.
Make sure that the database connection pool settings are correct.
You may try to check the IIS logs to see if there is any unusual response rate i.e. errors (http 500, 404, etc)
also I will suggests fetching multiple result sets in one database call , try to reduce number of round trips to the database
It may possible that you have installed so many plugins in your browsers, and they are making it slow.
These are just few tips, hope it will help.
i investigate my problem further more i guess the problem is from browser, when i deploy application using Firefox issue totally disappeared and app respond normally,when using IE or chrome problem reveal

Re-enable ASP.NET session that caused IIS hang

I'm trying to implement some fail safes on a client's web server which is running two of their most important sites (ASP.NET on IIS7). I'm going to set up application pool limiting so that if any w3wp process uses 90%+ CPU for longer than a minute then it gets killed (producing a temporary 503 Service Unavailable message to any visitors), and based on my local testing will be restarted within a minute - a much better solution than having one CPU-hogging process taking down the whole server for any length of time.
This seems to work, however during my fiddling on my local IIS7 instance I've noticed that if a request calls my "Kill.aspx", even when the site comes back up IIS will not serve the session that caused it to hang. I can only restart the test site from a different session - but as soon as I clear my cookies on the "killer" browser I can get to the site again.
So, whatever malicious behaviour IIS is trying to curb with this would not work against an even slightly determined opponent. In most cases, if excrement does hit fan it will be coding/configuration error and not the fault of the user who happened to request a page at that time.
Therefore, I'd like to turn this feature off as the theoretical user would have no idea that they need to clear their cookies before they can access the site again. I would really appreciate any ideas on how this might be possible.
Yous should be using ASP.Net Session StateServer instead of In-Proc (see msdn for details). That way, you session will run in different process and won't be affected by IIS crash.
Turn what "feature" off? If the worker process is reset (and your using in-proc session) then the session is blown away on a reset.
You might want to investigate moving your session storage to a state server or some other out of process scenario.
Also, you might want to set the application pool to use several worker processes (aka: web garden) this way if one process is killed the others continue serving content.
Next, as another option you might want to set up multiple web servers and load balance them.
Finally, you might want to profile the app to see exactly how they are causing it to spin into nothingness. My guess is that there are a number of code issues you are simply covering up with this idea.

Validation of viewstate MAC failed

Ran into this issue yesterday on one of our sites. First of all the site is hosted in a web farm environment and for the time being I have added a static machineKey to the web.config on both nodes (2 node environment). This has solved the issue and everything is running fine now.
This raised the following question:
Why is it that all our other sites that run on this environment does not require this (machineKey in the web.config).
I checked event logs to make sure that we are not having the same issue on other sites and everything looks fine. I also confirmed that the app pool is not recycling too often and everything was normal with regards to app pool settings.
The only explanation I can come up with is that the site is rendered by one node and subsequent post backs go to another node - which would leave me to believe that the problem lies with the load balancer. Our infrastructure guys tells me that everything is as it should be with regards to the load balancer and the scenario that I am proposing will not happen.
Am I missing the obvious here or are there anything else that I can consider?
Thanks in advance
Basically, yes, you're right - you generally see this in a web farm environment when "Sticky Sessions" aren't properly configured in the load balancer, and the users postback is sent to a different server.
To be fair to your network guys, it's possible that most requests are being sent to one server, but that this application is tipping the usage such that requests are often sent to another server - but you should be seeing that across all sites, unless the traffic patterns are completely different.
The other possible cause is that either your page is taking too long to load, and the users are posting back before the page has completely finished loading - I'd managed to get one of my sites doing that with a couple of remote advert calls buried halfway through the page load, or the users are waiting too long between page render and postback and the session on the loadbalancer is timing out so it thinks it's a new request.
If you are working with a web-farm environment, machine key values, if specified in the web.config need to be synced. In addition, you will want to make sure that the machine key values in the machine.config file are also synced between the two.

Mysterious IIS Problem: Site stops serving dynamic pages, no errors in logs

This may be the most mysterious problem I've ever encountered.
We have an IIS7 install with 3 Web Sites on it, each with it's own Application Pool. Once a day, for about an hour, a specific one of them goes down.
What I mean by "goes down" is:
It stops responding to requests for dynamic pages (ex. default.aspx) but will serve static files fine (logo.png).
Wireshark tells me that these dynamic page requests are actually return HTTP 500 Internal Server errors, but in the browser, I don't see an error. I just see the browser spinning.
If I log on locally to the box and surf around everything runs fine. All the pages pull up, so the database is being queried. It all seems perfectly normal.
There are no errors in the event log.
There are no errors recorded that have been captured by our internal (Application-level) error logging.
The basic IIS log file, which I thought logged every request, shows no record of these requests coming in.
And, if I restart the App Pool for the Web Site, everything comes back immediately. Or, if I just wait an hour or so, it comes back.
So, I've ruled out:
DNS issues, since I have no problem terminal servicing into the box by hostname.
Database issues, since the site works fine when I'm local to the box and surfing around
HTTP firewall issues, since I'm seeing the requests in wireshark, and am even getting images to serve up.
I have to assume it's a problem with my application, but IIS doesn't even show that these requests ever happened, and nothing in IIS or my app is logging errors.
It also doesn't even go down at the same time each day. This started at night (#midnight) and seems that it's gradually started moving it's daily time by an hour or so, until the point now where it hit at 9AM.
Any clues you might have for further troubleshooting would be greatly appreciated.
Tom
I'd fire up performance monitor and look for requests and exceptions being thrown. Not a whole lot of value in my answer but it might started pointing you in the right direction.
Actually, check the event logs first, see if something is throwing errors. Also, check memory usage and paging.

Resources