I am sorry if the question is too vague. I am not supposed to copy any logs or other information from my work place to public. But here is the question:
In my organization, we have a CA APM team to monitor application performance. One of our applications that is idle (No users as of now as the official release is in next couple of months) is showing the w3wp:%Time in GC as 89 which is higher than the set threshold of 80. From developers perspective the code is not executed but the CA APM tells that this is from our app pool and the server is dedicated for our app alone. Can an idle asp.net application cause a problem like this ? Infrastructure team simply pushes it on to developers and developers are clueless because in their perspective their code is not executed. Any advice, insight in to this topic is highly appreciated.
I know that this post is almost month old but still would like to know the reason. My best 2 guesses are:
If there's completely no traffic going to website and GC handles are
still high there might be a scheduled job that's leaking memory.
If not there might be other apps on same machine that are fighting
over memory and system forces gc also on your idle app and since your
app is idle and is gc is being constantly forced the GC% ratio is
high.
Related
Server Specks
Microsoft Windows Server 2003 Enterprise Edition SP2
IIS 6
.net4
Intel(R) Xeon(R) CPU
X5680 # 3.33GHz, 2.00GB of RAM
Physical Address Extension
I am having trouble finding the cause of our server's random downtime. Our clients inform us that their website goes down for hours at a time. Sometimes users are able to log in however the site is extremely slow/unstable and unusable. Sometimes users are not able to log in at all. When users are able to log in not all images are displayed (they get the image not found image).
We upgraded their website from .net1 to .net4 because we thought the cause of their downtime and random user log out was due to them running their website on .net1. The website was running fine with no issues for a few months.
The first time the server started to go down after that was due to the drive with which the website resided on running out of disk space. There was 40GB partitioned to this drive and 20GB was added. This didn't resolve the issue for very long.
The second time the server would randomly go down, I noticed in the Event viewer, that the web worker associated with the app pool used by the website would periodically require to be recylcled. That is, in the Security tab of the Event Viewer I would periodically see an event with ID 1074 reading 'A worker process with process id of '1540' serving application pool 'Net4' has requested a recycle because the worker process reached its allowed processing time limit.'. I then went into this app pool's properties and saw that the app pool would be recycled every 29 hours, which is the default. I modified this to have the app pool recycle every day at 3:00am. Since that we have not seen this event in the Event Viewer. We were able to catch the website during one of its downtimes before this was changed and recycled the app pool manually. This resolved the issue in this one instance.
This did not permanently fix the issue however, as we are still receiving emails from our client informing us that the website is down for hours at a time.
I then set up a performance monitor counter log. We have managed to monitor the server's performance during many of these downtimes. It does not appear to be a problem with memory as there is plenty of space on the drive. It does not appear to be a memory leak or related to excessive paging as there are no running processes which take up an excessive amount of % Processor Time and the Pages/Second Memory counter does not peak at an excessive amount during most of the downtime (I'll explain why excessive paging occurs later). The total IO Data Bytes/sec and IO Other Data Bytes/sec Process counter does not appear to be usually high or low during downtime. The total Thread Count and Handle Count Process counter do not exhibit any abnormal spikes or drops during this time. The total thread count, at a given time, seems to be between 600 and 900, give or take. The total handle count, at a given time, seems to be between 15,000 and 23,00, give or take. The % Time in Jit .NET CLR jit counter for instance w3wp is 0 for about half of the time and will randomly peak at almost 100 the other half, most of the time peaking for just a moment but rarely peaking for about 10 minutes, unrelated to downtime.
There are random times throughout the day where the process dsmcsvc takes up most, if not all, of the % Processor Time. This is a process run by the Symantec Antivirus software. When this process takes up the % Processor Time there is a corresponding event in the Event Viewer signifying that a new virus definition file has been uploaded that is, an Application event with ID 7 'New virus definition file loaded. Version: #version number#'. When this event occurs, the Pages/Sec counter spikes. Sometimes it spikes to only 200-300 but will at times peak over 10,000. This event seems to be completely unrelated to website downtime. I have researched the Symantec Antivirus software and found that there is a known memory leak in old versions of this software. I have found that this software is known to cause high memory usage when the link to a process called NavLogon.exe is broken/does not exist. This process does not appear to exist on the server so I currently have no way of restoring the link to it. I also found that this software uses Crypt32.dll and that old versions of Crypt32.dll have a known memory leak. The Crypt32.dll which exists on the server was last updated in 2007.
The Performance Monitor log monitors the total Sessions Active ASP.Net Applications counter. During downtime, the total number of sessions does not exhibit any abnormal behavior, there are a normal amount of active sessions during this time. Active sessions at a given time can be between 0 and 200. I was informed that the time when the most users are active is during 1st shift, however during about 10pm and 2am every day, this number peaks.
The site runs JavaScript client side, and Visual Basic.net server side. All users have about 10-15 session variables almost all of the time.
When the site goes down there are no events which seem to correspond to its downtime in the Event Viewer.
I also have set up a W3C Extended Log File Format log for this site. During downtime there seems be an excessive amount of GET requests for a Telerik.RadUploadProgressHandler.ashx.
I have seriously run out of ideas at this point and have extensively searched the web for solutions and come up empty. Any feedback as to why this may be occurring would be great.
It does not appear to be a problem with memory as there is plenty of space on the drive.
Really? Memory and hard drive space are two completely different things. 2GB of RAM was okay a decade ago, when that server was new, but today it's laughably small.
But don't bother upgrading or adding RAM. This server is old enough, the problem is probably just that the hardware is reaching the end of it's useful life. Additionally, the operating system is also nearing it's end of life. Server 2003 is scheduled for end of life on July 14, 2015. After that date, there will be no new patches of any kind produced for Server 2003... not even critical security patches. That will make Server 2003 completely unsuitable as a web server.
This seems like a good time to execute a transition to a completely new server.
I have a strange case where one of my applications is causing the IIS (7.0) request queue to fill up. Requests are not being terminated after 30s as they should be. This then takes all DB connections from the pool and renders that app useless (other apps are unaffected).
I have no idea a) Why they are stalling in the first place, and b) why IIS is letting them sit there stalled rather than killing them. I would guess my app is locking something, perhaps something the GC is trying to reclaim.
My question is where do I start on debugging such an issue? I have no idea. It's currently happening only in production, but reasonably regularly (maybe once every 4 hours) on all web servers.
PS: There is potentially an argument that this question is better on serverfault than on SO, but given that I think this is a development problem with the app rather than an admin one, I have started on SO for now. I am however happy to re-post there if needed.
For reference using WinDbg was the solution I used. I attached WinDbg to w3wp process for the app pool once requests had queued. I could then view the call stack in each process, and although different most of them were sat waiting on a lock inside ResourceManager.
I still don't know why it was locking there, I thought ResourceManager was thread safe. I re-wrote some code to cache the ouput of ResourceManager in another class and that seems to be avoid the lock.
I'm trying out Windows Azure free trial - as I understand, you can host up to 10 websites on the free account.
My question is: is there any way of hosting a website along with some kind of background processing or scheduled task with the free account on Azure? I'm almost sure that it's not because Web roles support that and not Web sites.
Is there any other alternative to host an ASP.NET MVC website with some kind of background processing on Azure for free? Everything would be purely for educational or personal purposes.
The free sites in Windows Azure Web Sites can technically run some background operations because you can spin up a background thread in the application start up; however, there are a several issues with this approach:
Idle sites will be shut down. This means that if the site isn't seeing a lot of traffic the process can be shut down. I'm not sure that the background processing would keep it alive, or even if it did how reliable that would be kicking off. It will depend heavily on the type of background processing I would think.
The web sites at the free level have CPU and memory quotas. Running something a lot in the background may cause you to hit these quota more often than if the site is more idle. Hitting the quota will shut down the site until a specific time period has past. Be very aware of these quotas if you are using the free or shared levels. If you were planning to have this background processing working a queue for instance this likely won't work out well.
You could use something like the free level of the Scheduler in the Windows Store app to kick off some work by having it call in to your web site and asynchronously kick off the back ground work. This might work and avoid the CPU quota if the amount of background work is pretty small and is completed quickly and with little resources used. Note that there is also a scheduler available with Mobile Services. For educational purposes this may just fine.
Don't forget that with the free trial you get $200 worth of Azure for 30 days. If you are really just trying things out you can easily spin up full cloud services, VMs or even shared web sites during the trial. If you shut them down when not actively working on them the $200 can give you a decent amount of time to try things out.
I have a web app that will run forever (at least for a few days) on my local machine using the technique (hack?) described in Jeff Atwood's post: https://blog.stackoverflow.com/2008/07/easy-background-tasks-in-aspnet/
However when I run it on App Harbor my app doesn't run for more than an hour or so (I'm not sure when it dies) as long as I hit the site it stays up so I'm assuming it is being killed after an idle period, but I'm not sure why.
My app doesn't save any state or persist anything. It makes web service calls and survives errors in any calls.
I turned on a ping service to keep my app alive but I'm curious why this works on my local machine but not on App Harbor?
The guys behind App Harbor pays for EC2 instances for all running apps, so they naturally want to limit the cpu usage as much as possible. One way to achieve this is to shut down unused applications very fast and only restart them when someone actually try to access them. Paid hosting should not be limited in this way.
(As far as I have been informed they are able to host around 100k sites on less than twenty medium instances which is certainly quite impressive and calls for a very economic use of resources.)
To overcome the limitation you would need a cron job to ping your app harbour site. But this of course a quite recursive problem since you need app harbour to act as a cron job ;)
AppHarbor recycles the Application Pool frequently to keep sleeping websites from using idle CPU time. This is simply the price you pay of using a shared website hosting plan.
If you really want to run a background job then you should be using AppHarbor's background workers, since this is exactly the type of task they were built to run.
http://support.appharbor.com/kb/getting-started/background-workers
Simply build a new console application that runs your logic and include it in your solution. When you push the code the workers will be started automatically. If you happen to already have other exe's in your solution make sure to edit the app.config and set the 'deploy background worker' value to false.
I have inherited a somewhat complex system (and problem) that I need help with.
I have a webserver w/ the following specs:
Hardware:
Server 2003 32bit
IIS 6
8 cores (16 w/ hyperthreading)
12gb RAM
ASP.NET site
3 app pools, so 3 instances of w3wp.exe running.
This system serves a large number of people and bandwidth is fairly constant during business hours reaching ~ 68,000kbit/s
There are moments when the system "comes down" - site gets very slow which generates a lot of phone calls. Things usually slow down for 60 seconds, but has varied greatly in length. Sometimes only a few seconds and sometimes 3 minutes or more.
I have my app pools set to recycle somewhere about 600mb of consumed memory. That's not exact but they recycle on their own with much success. At times I recycle the "main" pool manually to clear the problem I'm describing.
This is what I know is going on when things are running slow.
Network bandwidth takes a considerable dip.
Requests Queued in the ASP.NET performance counters goes up.
In tandem w/ the Requests Queued rising page latency increases. (I employ a simple ASP page that makes a SQL call and just says "The system is live" - this page is monitored for latency)
Overall CPU usage rises.
Overall memory consumption of w3wp.exe rises.
In my mind here is what I imagine is happening.
Someone asks the system to generate a report or glob of data. This spins up a process that consumes a large number of threads (ie, all available threads) This causes all other requests to the system to wait in the ASP.NET que pool which essentially kills the site. The lack of activity causes the network traffic to dip.
I have read many articles about thread queues, thread pools, etc. This is a good example: http://williablog.net/williablog/post/2008/12/02/Increase-ASPNET-Scalability-Instantly.aspx and does what I believe is a clue to help me solve my problem... but I'm not sure. My "Machine.config" file for the version of asp.net that I am using does not specify any of the thread values listed in the article so we are default for everything which I believe is incorrect given our situation.
If you were me; What would you do next? Where do you think the problem is?
edit: Here is a screenshot. It should be obvious when the problem is happening.
http://i.imgur.com/5BJlq.png
edit:
I want to change these values for our setup. A few questions first:
1) After making the changes, what needs to be restarted for them to take effect?
2) How do these settings look for a system with 8 physical cores?
maxconnection = 96
maxIoThreads = 100
maxWorkerThreads = 100
minFreeThreads = 704
minLocalRequestFreeThreads = 608
Not fun.
Many root causes share common symptoms which makes it difficult to diagnose without getting dirty with the application. :) Pardon if some of these steps were implied.
Some next steps might be:
Review the IIS logs of each site looking for things like:
HTTP response codes (5xx,4xx,3xx)
Request response times
Review Windows Event Logs
How often are application pools cycling?
Application errors, etc.
Verify processModel settings as suggested by #vinayc to make sure predecessor didn't get 'tricky'
Install DebugDiag, its a surprisingly good tool for some basic analysis of memory and crash related problems.
This can also help you capture memory snaps to diagnose later.
Tess Ferrandez blog can help make heads/tails of memory snap analysis.
Understand how many web applications are running in each AppPool.
Investigate using a 'web garden' to possibly help minimize number of users impacted by 'slow down'
Is a virus scanner enabled? Is it running? If so, verify exclusions.
Are application teams available to help troubleshoot? Identify if they have any custom application instrumentation that might help diagnose problem.
Is the behavior 'new'? Or has it always been there? If 'new', can you track down which deployment might have caused the new behavior?
Could the the description given of the 'slow down' behavior be attributed to an apppool recycle and resulting jitting of the application again? ala - the first request syndrome.
Reviewing the logs helps understand how the sites/applications are being used, which can be especially important if you don't own the codebase. Logparser is an excellent tool for doing some IIS log analysis (as well as other formats).
Good luck!
Z
The settings that your are talking are part of processModel element under system.web element from machine.config. For IIS6, following are applicable:
autoConfig
maxIoThreads
maxWorkerThreads
minIoThreads
minWorkerThreads
requestQueueLimit
responseDeadlockInterval
Typically, you will only find autoConfig="true" and not other elements. Auto-config sets the values as per your machine configuration - the tuning is done as per recommended values (see Threading Explained section from this article) which are same as sighted by the link that you have provided.
The article although dated, i excellent resource if you want to tune up these settings manually.
On the other hand, at the load that you are serving, I would recommend two things (if you haven't tried already)
Use output caching aggressively - even if the data is dynamic, caching for say 30-60 seconds can give a definite boost at your load
If you suspect certain requests are hogging too many threads then attempt to move those resources under different app-pool (you can use different web-site with different sub-domain or you can use different virtual directory/application and choose different app-pool)