IIS Worker process hangs forever on first request - asp.net

I am working on solving a problem that I have had for a couple of days now. Every time one of my sites are rebuild or the AppPool is recycled, the first pageload will hang forever (well, I've only waited up to 30 minutes). It is only happening on one particular site out of ~10 sites. It is an ASP.NET site.
Here are the things I have observed:
In IIS Manager under worker processes I can see the request. Verb = GET, Sate = ExecuteRequestHandler, Module Name = ManagedPipelineHandler. Time Elapsed just keeps increasing, of course.
If I close down the browser in which I made the initial request and then open a new one to make another request, the page will load instantly.
In my code the Application_Start of my Global.asax file is not called on the first request. It is called on the second request.
The worker proccess is causing the memory usage on my machine to go through the roof
I'm inexperienced in troubleshooting IIS, but hours and hours of searching has led me nowhere.
The only major code change we have made on the site recently is that we have started implementing logging using log4net. I have though tried to remove any log4net code, both from my web.config file and Global.asax - still no luck.
Has anyone else experienced this and if so how did you solve it?
Any and all help will be much appreciated.
ADD:
If I place a .txt file in the root of the site and load that as the first thing after a build it will load instantly.
However the worker proccess still acts exactly as before and the memory usage still goes through the roof.
Final edit:
I feel like such an idiot. I can't explain why, but for some reason my break points in Global.asax suddenly got hit and I was able to identify the problem. It was a call to a database via Entity Framework that was badly written - i.e. the filtering was done after all the rows from the column in question had been fetched. And to make it worse, the filtering was done inside a foreach loop. Anyway, now everything is back to normal and I'm happy.

Possibly stating the obvious but you haven't got any silly code in your global asax in the app_start that could be causing this?
Sounds like an infinite loop or something?

Just a quick note what happend in my case:
Neither Process Monitor nor Failed Request Tracing was of any help. The website simply loaded (nearly) forever.
Finally, after waiting for several minutes an error occurred stating that it "cannot locate the network path".
The reason was that I entered a connection string to a non-existing SQL Server instance, so it somehow keept searching for the server. Finally, a timeout occured.
The solution was to simply specify the correct SQL Server in the connection string inside Web.Config.

Related

Some requests on IIS hang for minutes and end in a lost connection

I have an awkward issue with IIS 10.0 on Windows Server 2016 and ASP.Net 4.5.2 and MVC 5.2.7.
At times, certain requests do not receive a response and run for minutes, maybe 10 or so, before ending in a lost connection (PR_CONNECT_RESET_ERROR in Firefox on Windows, NSURLDomainError in Firefox on iOS). These are mostly POST requests. When this issue occurs, other GET requests will receive a swift response and a correct result. Normally, POST-request do no take long to be processed, typically less than 3 seconds.
Recycling the associated worker process will make the issue go away, for hours or days.
When today inspected the web server when the issue was going on, I saw little CPU usage, less than 10%, memory 56%, the worker process a modest 615 MB. I saw neither logging in the W3C log of these requests, nor in my custom application logs.
I added the Web-Request-Monitor conform How do I see currently executing web request on IIS 8, but in doing so, the the worker process probably got recycled, as the issue is not currently occurring.
There are a reverse proxy and an access manager between the internet and my web server. I suppose they can have something to do with this issue, but it certainly is related to IIS, as recycling helps.
All of this is happening on a acceptation web server running a newer version of my application. I am not aware of any big changes to the application's architecture that could be involved. Also, there will be very little traffic from other clients, if none at all.
What could be next steps to investigate this issue further?
Update
This issue was definitely caused by log4net. However, it was not related to the log4net.Internal.Debug setting. It was caused by two application domains accessing the same log file. This occasionally resulted in concurrency issues with accessing the log file. It appeared that log4net could not properly handle this and got stuck while writing to the log file.
This log file was configured with the RollingFileAppender option. Since we also used AdoNetAppender, we decided to remove file logging all together.
Original
I have found a probable cause. I'll report the steps I took to investigate the issue.
I activated the Worker Processes feature in IIS.
When, after a couple of days of waiting, the issue started again, I found long running requests. They all had State ExecuteRequestHandler and Module Name ManagedPipelineHandler. They had Time Elapsed of hundreds of seconds.
I also activated the Failed Requests Tracing with a rule for long running requests with a Time Taken of 1 minute.
After a couple of days, I started to receive failed request reports. The failed request all have a GENERAL_SET_RESPONSE_HEADER event as their last event.
I added additional debug logging events for each requests. When debugging in my development environment, at one point, I started to see the hanging behaviour there, on one of the new logging statements(!). The application uses log4net.
I captured a stack trace:
log4net.dll!log4net.Appender.AppenderSkeleton.DoAppend(log4net.Core.LoggingEvent loggingEvent) log4net.dll!log4net.Util.AppenderAttachedImpl.AppendLoopOnAppenders(log4net.Core.LoggingEvent loggingEvent) log4net.dll!log4net.Repository.Hierarchy.Logger.CallAppenders(log4net.Core.LoggingEvent loggingEvent) log4net.dll!log4net.Repository.Hierarchy.Logger.Log(System.Type callerStackBoundaryDeclaringType, log4net.Core.Level level, object message, System.Exception exception) log4net.dll!log4net.Core.LogImpl.DebugFormat(string format, object arg0)
The DoAppend method uses lock(this), which may very well cause hangs.
I also found out that the config setting log4net.Internal.Debug was set to true, which I do not want under normal circumstances and this may be related. I did not attempt to understand the log4net code, but I remember that logging initially did not work, in the acceptance environment, so the setting may very well have been set to true then, causing the issue to start.
Another indication that this is happening with log4net is that when the issue last occurred, I realized that logging of level standard, only occurs in some POST requests. I found a POST-request that does not log and requests to it where handled normally, while the other POST-requests still hung.
For now, I have set log4net.Internal.Debug to false and will wait to see what happens.
IIS recycle fix this issue doesn't mean that this is an IIS issue because all asp.net application run in .net runtime unless it is proved that the request is hang in IIS module.
So you may need to wait this issue happen again, then create a Failed request tracing rule for time-taken. Then it will tell us this issue is happening on IIS pipeline module or .net runtime.
If all request hang in .net runtime. Then you may have to capture a hang dump and do a deep analysis via WINDGB and mex extension. It will tell us what's happening there.

Debugging requests which are 'stuck' in an IIS worker process

In case of TL;DR - I basically need guidance regarding what tools are available to debug requests which are issued to IIS and which stall inside a module.
I have a problem with an old ASP 2.0 app at the moment whereby it will periodically become unavailable and recycling the app pool (horrible as that may be) doesn't bring it back up 100% of the time.
So first of all it presents itself as requests entering the app pool and being trapped in state 'BeginRequest' in RewriteModule.
It is not a specific request which is always the first to experience this issue. The issue cannot be easily recreated either.
Eventually requests join this backlog and when it becomes 70+ deep the app pool fails to respond to pings from WAS and it forcibly recycles. Predictably it doesn't stop on-time and the old app pool is forced to stop. When the new app pool comes up it either works just fine or it instantly experiences the same issue as the outgoing one and requests begin to queue.
In issues like this all the official guidance is understandably focussed around looking at why the RewriteModule may choke.
I have validated my redirections and though complex there are no obvious issues with syntax (XML validates).
Likewise in inetmgr loading up the URL Rewrite Module seems to parse the configs fine and show them visually.
Basic stuff like permissions is all fine.
When the app is working normally I also used Failed Request Tracing/Logging to look at the request pipeline for a sample URL which stalled and I can confirm that there is no circular logic or weird errors presenting - the request seems to be handled just fine. This also showed me how high up the rewritemodule is invoked and from this I really don't see how the issue could be app-related as .NET isn't invoked at this point.
Annoyingly when an app pool is experiencing this issue and I can throw in requests which just stall Failed Request Tracing is no good because you actually need a request to get to the end of it's journey and fail otherwise it refuses to log anything out.
I resorted to taking process dumps of affected w3wp.exe's and running them through DebugDiag. Unfortunately the only thing I see is that threads are open accessing the rewritemodule but precious little about what they are stuck on.
As anyone else would do I've tried to track the start of the issue back to any recently installed patches or code changes but nothing matches. Likewise this is happening on 3x servers otherwise I would try reinstalling the rewritemodule. Other sites on the same server which invoke rewritemodule are unaffected.
Has anyone else experienced issues like this - the net seems to have relatively little info in this case. Perhaps you can recommend further debugging tools or approaches for IIS which I can adapt to this scenario? This is sort of a cry for help from someone more used to Apache/Nginx - sorry for the long post.

Strange Code first Migration behavior, or IIS issue?

Ok, so the background is this.
I have created a hardware controller for a fingerprint reader, and a web application that allows users that have scanned in to do things in the web application. The web application was created using Code First, and the communication is done through signalr 2.0 The problem that I am having is this. Everything works beautifully for about a day, this used to be about half a day, but in IIS 7.0 I changed the idle time on the application pool to 200 mins, but I am still getting an error at random times on the web server, I have managed to have extended the amount of time that is stays running. However, what confuses me, and why I cannot seem to get a handle on what is happening is that when it does go down;
A) I do not know why? (I am leaning towards a timeout somewhere)
B) The error message is the same one you get when you make a change to the database structure and forget to use Database-Update from the package manager console, Yet no one is changing the database.
c) If you leave it alone it will fix itself, and I do not know why or how.
Has anyone seen behavior like this? and if so what caused it and how did you fix it? Or can anyone offer how I can debug this?
Thanks so much for any help!
Kelso
If the exception is "The model backing the 'YourContext' context has changed since the database was created. Consider using Code First Migrations to update the database" you could try to catch that exception and log the content of the following method and compare it to the return value of the method in Application_Start or whenever it worked for you.
((System.Data.Entity.DbContext)(context)).InternalContext.QueryForModel(0)
The method gives you a XML representation of your database schema.
Just to update on this issue, it turns out that the IIS server had been set to only a single CPU and a single thread, (VMare setting) and that thread was getting hung, and could not create a new thread to continue processing, once we updated the cpu's and increased the thread count to 5, everything works like a dream.

sporadic ASP.NET data error: "Cannot find table 0"

Having deployed a new build of an ASP.NET site in a production environment, I am logging dozens of data errors every second, almost always with the error "Cannot find table 0." We use datasets and frequently refer to Table[0], and while I understand the defensive coding practice of checking the dataset for tables before accessing Table[0], it's never been a problem in the past. A certain page will load fine one second, and then be missing one of its data-driven components the next. Just seeing if this rings a bell for anyone.
More detail: I used a different build server this time, and while I imagine the compiler settings are the same on both, I have a hard time thinking that there's a switch that makes 50% of my database calls come back with no tables. I also switched the project to VS 2008, but then reverted all of those changes when I switched back to VS 2005. I notice that the built assembly has a new MyLibrary.XmlSerializers.dll, where it didn't used to, but I also can't imagine that that's causing all the trouble. (It also doesn't fall down on calls to MyLibrary, or at least no more than any other time.)
Updated to add: I've discovered that the troublesome build is a "Release" build, where the working build was compiled as "Debug". Could that explain it?
Rolling back to the build before these changes fixed it. (Rebooting the SQL Server, the step we tried before that, did not.)
The trouble also seems to be load-based - this cruised through our integration and QA environments without a problem, and even our smoke test environment - the one that points to production data - is fine under light load.
Does this have the distinguishing characteristics of anything you might have seen in the past?
Bumping this old question because we have encountered the same issue and perhaps our solution would give more insight in what causes this.
Essentially this problem occurs in a production environment that is under very heavy load in a Windows service that uses multiple threads to process several jobs simultaneously (100 users use the same DB via ASP.NET web app and there are about 60 transactions/second on older hardware with SQL Server 2000).
No variables are shared, that is connections are opened anew, transaction is started, operations executed, transaction committed and connection closes.
Under heavy load sometimes one of the following exceptions occurs:
NullReferenceException: Object reference not set to an instance of an
object.
at System.Data.SqlClient.SqlInternalConnectionTds.get_IsLockedForBulkCopy()
or
System.Data.SqlClient.SqlException:
The server failed to resume the transaction. Desc:3400000178
or
New request is not allowed to start because it should come with valid transaction descriptor
or
This SqlTransaction has completed; it is no longer usable
It seems somehow the connection that is within the pool becomes corrupted and remains associated with previously used transactions. Furthermore, if such connection is retrieved from pool then sqlAdapter.Fill(dataset) results in an empty dataset, causing "Cannot find table 0". Because our service would retry the operation (reading job list) on failure and it would always get the same corrupt connection from the pool it would fail with this error until restarted.
We removed the issue by using SqlConnection.ClearPool(connection) on exception to make sure this connection is discarded from the pool and restructuring the application so less threads access the same resources simultaneously.
I have no clue who exactly caused this issue so I am not sure we have really fixed that, maybe just made it so rare it had not occurred again yet.
I've fought precisely this error message before. The key is that an underlying data method is swallowing a timeout exception.
You're probably doing something like this:
var table = GetEmployeeDataSet().Tables[0];
GetEmployeeDataSet is swallowing an exception, probably a timeout exception, which is why it only happens sporadically - it happens under load. You need to do the following to fix it:
Modify the underlying code to not swallow the exception, but rather let it bubble up to the next level so you can identify it properly.
Identify the query(s) causing the problem, and then rewrite, reindex, denormalize or throw hardware at the problem. See this for more info: System.Data.SqlClient.SqlException: Timeout expired
I've seen something similar. I believe our problem had to do with failed sessions being re-used (once the session object failed it went into a poor state and could not recover.) We fixed it by increasing the memory for the session pool and increasing the frequency of the web application recycling.
It also was "caused" by a new version that at first blush did not seem to have any change to cause such an effect. However, eventually it became clear that the logic of the program was opening and closing a lot more connections (maybe 20% more) than it used to. This small change pushed the limit of our prior configuration.
You might check the SQL Server logs for errors. Or, the Web server event log. It sounds like your connection pool could be out of open connections or your db could be out.
Which database calls changed between versions?
The error is obviously telling you one of your database calls isn't returning any data on occasion; I can't think of any cases where a code/assembly issue would cause it.
I have seen something like this when doing something with nHibernate Sessions in a non-thread-safe manner. That would explain why you only see it under load. Would need to see your code to guess at what isn't thread-safe though.

ASP.NET Lifecycle and long process

I know we need a better solution but we need to get this done this way for right now. We have a long import process that's fired when you click start import button on a aspx web page. It takes a long time..sometimes several hours. I changed the timeout and that's fine but I keep getting a connection server reset error after about an hour. I'm thinking it's the asp.net lifecycle and I'd like to know if there are settings in IIS I can change to make this lifecycle last longer.
You should almost certainly do the long-running work in a separate process (not just a separate thread).
Write a standalone program to do the import. Have it set a flag somewhere (a column in a database, for example) when it's done, and put lines into a logfile or database table to show progress.
That way your page just gets the job started. Afterwards, it can self-refresh once every few minutes, until the 'completed' flag is set. You can display the log table if you want to be sure it's still running and hasn't died.
This is pretty straightforward stuff, but if you need code examples they can be supplied.
One other point to consider which might explain the behaviour is that the aspnet_wp.exe recycles if too much memory is being consumed (do not confuse this with the page life cycle)
If your long process is taking up too much memory ASP.NET will launch a new process and reassign all existing request. I would suggest checking for this. You can do this by looking in task manager at the aspnet_wp and checking the memory size being used - if the size suddnely goes back down it has recycled.
You can change the memory limit in machine.config:
<system.web>
<processModel autoConfig="true"/>
Use memoryLimit to specify the maximum allowed memory size, as a percentage of total system memory that the worker process can consume before ASP.NET launches a new process and reassigns existing requests. (The default is 60)
<system.web>
<processModel autoConfig="true" memoryLimit="10"/>
If this is what is causing a problem for you, the only solution might be to have a separate process for your long operation. You will need to setup IIS accordingly to allow your other EXE the relevant permissions.
You can try running the process in a new thread. This means that the page will start the task and then finish the page's processing but the separate thread will still run in the background. You wont be able to have any visual feedback though so you may want to log progress to a database and display that in a separate page instead.
You can also try running this as an ajax call instead of a postback which has different limitations...
Since you recognize this is not the way to do this I wont list alternatives. Im sure you know what they are :)
Extending the timeout is definitely not the way to do it. Response times should be kept to an absolute minimum. If at all possible, I would try to shift this long-running task out of the ASP.NET application entirely and have it run as a separate process.
After that it's up to you how you want to proceed. You might want the process to dump its results into a file that the ASP application can poll for, either via AJAX or having the user hit F5.
If it's taking hours I would suggest a separate thread for this and perhaps email a notification when it is ready to download the result from the server (i.e. send a link to the finished result)
Or if it is important to have a UI in the client's browser (if they are going to be hanging around for n hours) then you could have a WebMethod which is called from the client (JavaScript) using SetInterval to periodically check if its done.

Resources