Mysterious IIS Problem: Site stops serving dynamic pages, no errors in logs - asp.net

This may be the most mysterious problem I've ever encountered.
We have an IIS7 install with 3 Web Sites on it, each with it's own Application Pool. Once a day, for about an hour, a specific one of them goes down.
What I mean by "goes down" is:
It stops responding to requests for dynamic pages (ex. default.aspx) but will serve static files fine (logo.png).
Wireshark tells me that these dynamic page requests are actually return HTTP 500 Internal Server errors, but in the browser, I don't see an error. I just see the browser spinning.
If I log on locally to the box and surf around everything runs fine. All the pages pull up, so the database is being queried. It all seems perfectly normal.
There are no errors in the event log.
There are no errors recorded that have been captured by our internal (Application-level) error logging.
The basic IIS log file, which I thought logged every request, shows no record of these requests coming in.
And, if I restart the App Pool for the Web Site, everything comes back immediately. Or, if I just wait an hour or so, it comes back.
So, I've ruled out:
DNS issues, since I have no problem terminal servicing into the box by hostname.
Database issues, since the site works fine when I'm local to the box and surfing around
HTTP firewall issues, since I'm seeing the requests in wireshark, and am even getting images to serve up.
I have to assume it's a problem with my application, but IIS doesn't even show that these requests ever happened, and nothing in IIS or my app is logging errors.
It also doesn't even go down at the same time each day. This started at night (#midnight) and seems that it's gradually started moving it's daily time by an hour or so, until the point now where it hit at 9AM.
Any clues you might have for further troubleshooting would be greatly appreciated.
Tom

I'd fire up performance monitor and look for requests and exceptions being thrown. Not a whole lot of value in my answer but it might started pointing you in the right direction.
Actually, check the event logs first, see if something is throwing errors. Also, check memory usage and paging.

Related

Some requests on IIS hang for minutes and end in a lost connection

I have an awkward issue with IIS 10.0 on Windows Server 2016 and ASP.Net 4.5.2 and MVC 5.2.7.
At times, certain requests do not receive a response and run for minutes, maybe 10 or so, before ending in a lost connection (PR_CONNECT_RESET_ERROR in Firefox on Windows, NSURLDomainError in Firefox on iOS). These are mostly POST requests. When this issue occurs, other GET requests will receive a swift response and a correct result. Normally, POST-request do no take long to be processed, typically less than 3 seconds.
Recycling the associated worker process will make the issue go away, for hours or days.
When today inspected the web server when the issue was going on, I saw little CPU usage, less than 10%, memory 56%, the worker process a modest 615 MB. I saw neither logging in the W3C log of these requests, nor in my custom application logs.
I added the Web-Request-Monitor conform How do I see currently executing web request on IIS 8, but in doing so, the the worker process probably got recycled, as the issue is not currently occurring.
There are a reverse proxy and an access manager between the internet and my web server. I suppose they can have something to do with this issue, but it certainly is related to IIS, as recycling helps.
All of this is happening on a acceptation web server running a newer version of my application. I am not aware of any big changes to the application's architecture that could be involved. Also, there will be very little traffic from other clients, if none at all.
What could be next steps to investigate this issue further?
Update
This issue was definitely caused by log4net. However, it was not related to the log4net.Internal.Debug setting. It was caused by two application domains accessing the same log file. This occasionally resulted in concurrency issues with accessing the log file. It appeared that log4net could not properly handle this and got stuck while writing to the log file.
This log file was configured with the RollingFileAppender option. Since we also used AdoNetAppender, we decided to remove file logging all together.
Original
I have found a probable cause. I'll report the steps I took to investigate the issue.
I activated the Worker Processes feature in IIS.
When, after a couple of days of waiting, the issue started again, I found long running requests. They all had State ExecuteRequestHandler and Module Name ManagedPipelineHandler. They had Time Elapsed of hundreds of seconds.
I also activated the Failed Requests Tracing with a rule for long running requests with a Time Taken of 1 minute.
After a couple of days, I started to receive failed request reports. The failed request all have a GENERAL_SET_RESPONSE_HEADER event as their last event.
I added additional debug logging events for each requests. When debugging in my development environment, at one point, I started to see the hanging behaviour there, on one of the new logging statements(!). The application uses log4net.
I captured a stack trace:
log4net.dll!log4net.Appender.AppenderSkeleton.DoAppend(log4net.Core.LoggingEvent loggingEvent) log4net.dll!log4net.Util.AppenderAttachedImpl.AppendLoopOnAppenders(log4net.Core.LoggingEvent loggingEvent) log4net.dll!log4net.Repository.Hierarchy.Logger.CallAppenders(log4net.Core.LoggingEvent loggingEvent) log4net.dll!log4net.Repository.Hierarchy.Logger.Log(System.Type callerStackBoundaryDeclaringType, log4net.Core.Level level, object message, System.Exception exception) log4net.dll!log4net.Core.LogImpl.DebugFormat(string format, object arg0)
The DoAppend method uses lock(this), which may very well cause hangs.
I also found out that the config setting log4net.Internal.Debug was set to true, which I do not want under normal circumstances and this may be related. I did not attempt to understand the log4net code, but I remember that logging initially did not work, in the acceptance environment, so the setting may very well have been set to true then, causing the issue to start.
Another indication that this is happening with log4net is that when the issue last occurred, I realized that logging of level standard, only occurs in some POST requests. I found a POST-request that does not log and requests to it where handled normally, while the other POST-requests still hung.
For now, I have set log4net.Internal.Debug to false and will wait to see what happens.
IIS recycle fix this issue doesn't mean that this is an IIS issue because all asp.net application run in .net runtime unless it is proved that the request is hang in IIS module.
So you may need to wait this issue happen again, then create a Failed request tracing rule for time-taken. Then it will tell us this issue is happening on IIS pipeline module or .net runtime.
If all request hang in .net runtime. Then you may have to capture a hang dump and do a deep analysis via WINDGB and mex extension. It will tell us what's happening there.

Debugging requests which are 'stuck' in an IIS worker process

In case of TL;DR - I basically need guidance regarding what tools are available to debug requests which are issued to IIS and which stall inside a module.
I have a problem with an old ASP 2.0 app at the moment whereby it will periodically become unavailable and recycling the app pool (horrible as that may be) doesn't bring it back up 100% of the time.
So first of all it presents itself as requests entering the app pool and being trapped in state 'BeginRequest' in RewriteModule.
It is not a specific request which is always the first to experience this issue. The issue cannot be easily recreated either.
Eventually requests join this backlog and when it becomes 70+ deep the app pool fails to respond to pings from WAS and it forcibly recycles. Predictably it doesn't stop on-time and the old app pool is forced to stop. When the new app pool comes up it either works just fine or it instantly experiences the same issue as the outgoing one and requests begin to queue.
In issues like this all the official guidance is understandably focussed around looking at why the RewriteModule may choke.
I have validated my redirections and though complex there are no obvious issues with syntax (XML validates).
Likewise in inetmgr loading up the URL Rewrite Module seems to parse the configs fine and show them visually.
Basic stuff like permissions is all fine.
When the app is working normally I also used Failed Request Tracing/Logging to look at the request pipeline for a sample URL which stalled and I can confirm that there is no circular logic or weird errors presenting - the request seems to be handled just fine. This also showed me how high up the rewritemodule is invoked and from this I really don't see how the issue could be app-related as .NET isn't invoked at this point.
Annoyingly when an app pool is experiencing this issue and I can throw in requests which just stall Failed Request Tracing is no good because you actually need a request to get to the end of it's journey and fail otherwise it refuses to log anything out.
I resorted to taking process dumps of affected w3wp.exe's and running them through DebugDiag. Unfortunately the only thing I see is that threads are open accessing the rewritemodule but precious little about what they are stuck on.
As anyone else would do I've tried to track the start of the issue back to any recently installed patches or code changes but nothing matches. Likewise this is happening on 3x servers otherwise I would try reinstalling the rewritemodule. Other sites on the same server which invoke rewritemodule are unaffected.
Has anyone else experienced issues like this - the net seems to have relatively little info in this case. Perhaps you can recommend further debugging tools or approaches for IIS which I can adapt to this scenario? This is sort of a cry for help from someone more used to Apache/Nginx - sorry for the long post.

iis startup delay with aspx pages

Environment: Windows Server 2003; IIS 6, ASP.NET 2.0.50727
I'm going crazy with a brand new web server that we set up (note that this problem doesn't happen on our other web servers which have the same configuration). When loading and asp.net app the first time, the page hangs for over a full minute before showing the page in the browser. After it loads the first page, everything runs very quickly.
Note 1: You will probably say that the application is being compiled for the first time. But I've ruled that out. I put trace messages EVERYWHERE in the app and all the trace messages run within a second of requesting the page. Thus, the app compiles and runs immediately. But when the app is finished rendering the page and my last trace message is printed, nothing happens. IIS is doing something behind the scenes for a full minute before transferring the finished page along http to the user's browser.
Note 2: We found that after hitting the app the first time and things run fine, if we wait an hour then we get the delay again. Thus, IIS has something in its cache that it clears out after an hour and causes our site to stall again.
Note 3: Between each test we stop/start IIS to force it to hang upon loading the app.
Note 4: We watched the Task Manager to see if IIS was spiking and taking up a lot of resources processing something. But that wasn't it. We did see a very quick spike to 50% immediately before the browser showed the page, but for the previous 60 seconds there was only 1% usage on the server.
Note 5: On another test I created a HelloWorld.html page and this does not cause IIS to hang. Thus, it has something to do with calling the ASP.NET library the very first time it sends a rendered page across http. Also, since the app has already been compiled and runs instantly, it's just the part of asp.net that sends the rendered page to the user's browser that causes the delay.
Any ideas? We are a a loss here. All of our other web servers are setup the same way and work fine, but this is a new install. So there must be a configuration setting that was missed or maybe something needs to be installed?
Thanks,
Brian
If you have access to the servers, then make sure that app pool recycling is actually logged to the event logs
cscript adsutil.vbs get w3svc/AppPools/DefaultAppPool/LogEventOnRecycle
you can set it to log everything with
cscript adsutil.vbs Set w3svc/AppPools/DefaultAppPool/LogEventOnRecycle 255
See more here
Then check if there were any recycles.
App initialization, creation the worker process, threads, load the app domain and all the references dll's can take some time, that's normal, but that 1 minute delay is something else probably.
Try to precompile the app on the server and see if that helps
aspnet_compiler -m /LM/W3SVC/[site id ]/Root/[your appname]
If you want to dig deeper, you can check the event trace ETW.
logman query providers
Save the IIS /ASP.NET related Guids to a file like iisproviders.txt
logman start ExampleTrace -pf iisproviders.txt -ets -rt
reproduce
LogParser "SELECT * FROM ExampleTrace" -i:ETW
logman stop ExampleTrace -ets
You can find more hereTroubleshooting appdomain restarts and other issues with ETW tracing
I would also check the w3wp.exe with procexp if it has a TCP connection time out or with Procmon for other clues.
If you have experience with windbg, then you can make a request to the app then quickly attach the debugger to the process
windbg -p [process id of the app pool]
.loadby sos mscorwks
g
and take it from there. If there are exceptions, process crash, etc you should be able to catch it...
Once we had a weird server issue like this and a .NET reinstall solved the problem, still not sure what was the culprit.
Could be some aspnet.config settings on this box that are different from others. Have you tried copying over their config files to this server? There appears to be certificate options along with registry modifications that you can do to remove some lag time during the initial load of a page (precompiling aside)
See here and here
One thing you might want to check on is if there are any database access going on on your page load. That might be blocking the creation of the page during initial page load. Then when the query is cached (either by the db engine or another cache mechanism like memcached), subsequent page loads work as normal.
As per your last comment,
I could stop/start IIS multiple times and the app always ran instantly. I thought it was fixed for good. But now I just tried again (it has been sitting idle for the past couple of hours) and now it is back to hanging on the first request.
This could mean that the cache has expired and thus needs to hit the database once again, causing the delay in page load.

Asp.net slow first load per user

I have a website set up with in IIS 7 with HTTPS, and every time a user access it for the first time the load time is about 15 sec.
THIS IS NOT the compile/warm up "problem" described for instance her: Slow first page load on asp.net site
I know about that "problem" and I also have that, but that is of course expected and not the issue here.
Since it's not when the application loads first time since recycle/start. If I open another browser and access it after doing it first in another browser then it takes the same amount of time. So it seems every time a session is started, that's when the delay happens. All following requests from the same user/browser is as quick as expected.
This is for an admin interface site I have and I use asp.net membership. Although the delay happens even before the user have logged in. So I'm not sure if that is the culprit.
I am a bit unsure where to look for killing the delay. I am running session state in process. With cookies.
Any ideas?
You need to get a little more information. Enable trace and track how long each step takes. You could also use Wireshark and have a look at the traffic between the client/server. If there is a big gap in traffic you can see something is hanging at the servers end. If you see constant traffic perhaps you have to much going on with your landing page. Other simple things to do would be to enable dynamic caching/compression on the server to speed things up.
Warm it up...
http://learn.iis.net/page.aspx/688/using-the-iis-application-warm-up-module/

Validation of viewstate MAC failed

Ran into this issue yesterday on one of our sites. First of all the site is hosted in a web farm environment and for the time being I have added a static machineKey to the web.config on both nodes (2 node environment). This has solved the issue and everything is running fine now.
This raised the following question:
Why is it that all our other sites that run on this environment does not require this (machineKey in the web.config).
I checked event logs to make sure that we are not having the same issue on other sites and everything looks fine. I also confirmed that the app pool is not recycling too often and everything was normal with regards to app pool settings.
The only explanation I can come up with is that the site is rendered by one node and subsequent post backs go to another node - which would leave me to believe that the problem lies with the load balancer. Our infrastructure guys tells me that everything is as it should be with regards to the load balancer and the scenario that I am proposing will not happen.
Am I missing the obvious here or are there anything else that I can consider?
Thanks in advance
Basically, yes, you're right - you generally see this in a web farm environment when "Sticky Sessions" aren't properly configured in the load balancer, and the users postback is sent to a different server.
To be fair to your network guys, it's possible that most requests are being sent to one server, but that this application is tipping the usage such that requests are often sent to another server - but you should be seeing that across all sites, unless the traffic patterns are completely different.
The other possible cause is that either your page is taking too long to load, and the users are posting back before the page has completely finished loading - I'd managed to get one of my sites doing that with a couple of remote advert calls buried halfway through the page load, or the users are waiting too long between page render and postback and the session on the loadbalancer is timing out so it thinks it's a new request.
If you are working with a web-farm environment, machine key values, if specified in the web.config need to be synced. In addition, you will want to make sure that the machine key values in the machine.config file are also synced between the two.

Resources