ASP.NET Max Pool size error after adding cores to web server - asp.net

Last night I doubled the processor cores (CPUs) on my web server from 4 to 8 to speed up a download process from an API which worked perfectly.
This caused an error in a different API on the same server:
"Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached."
When checking for open connections in SQL Server I can see it slowly creep up to 100, closing a few connections along the way and then the error starts showing as a response from the API once it hits 100.
These 2 APIs have been running perfectly fine (slowly) on 4 cores for 3 years, why would adding 4 more cores cause this to happen?
I do see potential leakage in the form of opening DataReaders but if an error occurs not closing them but ending the request and sending a response to the client, but why would this not have happened on half the amount of cores?
The only thing that makes the error go away is restarting the Application Pool in IIS

Doubling CPU units clearly boost paralleled jobs, so pooled resources are more likely to be exhausted. Thus, not a surprise to see the exception you talked about.
If you cannot fix code smells right now, then use a larger limit might help, Should I set max pool size in database connection string? What happens if I don't?

Related

Connection pool is full

In my IIS Server, I have many application pools (like 6 to 7) and there are many ASP.NET applications running on each of them (ex. 25 applications per pool). They all are connected with Oracle database by using ADO.NET.
All applications are just working fine, but sometimes we get error like
Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.
I know the possibilities for this like we are not closing our database connections properly. So here is my headache... I don't want to go each and every project to see where we forget to close connections it is very time taking task for us.
So is there any way to identify from which application connections are remaining opened? Can we see from IIS itself? Can we make some kind of utility to track from which project connection are remaining opened?
I'm not sure that it's a probleme of the connection to database. I think that you application are not disposing the context then the garbage collector can't clear memory. you can try to reduce the time for recylcling your application pools then check if you memory usage is decreasing or not.

Troubleshooting an IIS .NET website outage

Last night one of the websites (.NET 4.0 forms) hosted on my Win 2008 R2 (IIS 7.5) Server started to time out throwing the following error for all connected users.
TYPE System.Web.HttpException
MESSAGE Request timed out.
DETAIL System.Web.HttpException (0x80004005): Request timed out.
The outage was confined to just one website within IIS, the others continued to work fine.
Unfortunately I was unable to identify why the website was timing out. Here are the steps I took:
First thing I did was look at the task manager which revealed normal CPU and memory usage. Network activity was also moderate.
I then opened IIS to look at the live connections under 'Worker Processes'. There were about 60 live connections, so it didn't look like anything DDoS related.
Checked database connectivity (hosted on a separate server), all fine!
I then reset the website on IIS. That didn't work
I tried to then do a complete iisreset...still no luck :(
In the end (and under some duress) the only thing I could think to do to resolve this was to restart the server.
Restarting the server worked but I am nervous not knowing why this happened in the first place. Can anyone recommend any checks that I failed to carryout? Is there an official checklist for working through these sorts of IIS problems? I have reviewed the IIS logs but don't see anything unusual on the run up to the outage.
Any pointers or links to useful resources to help me understand and mitigate against this in future will be much appreciated.
EDIT
The only time I logged into the server that day was to add an additional web handler component (for remote deploy) to IIS Web Deploy. I'm doubtful this caused the outage as the server worked for for 6 hours after.
Because iisreset didn't helped and you had to restart whole machine, I would suspect it was a global resources shortage and mostly used website (or most resource consuming) was impacted. It could be because of not available RAM, network connections congestion due to some malfunctioning calls (for example a lot of CLOSE_WAIT sockets exhausting connections pool, we've seen that in production because of malfunction of external service). It could be also one specific client problem, which was disconnected after machine restart so eventually the problem disappeared.
I would start from:
Historical analysis
review Event Viewer to see any errors/warnings from that period of time,
although you have already looked into IIS logs, I would do it once again with help of Log Parser Lizard to make some statistics like number of request per client, network bandwith per client, average response time per client and so on.
Monitoring
continuously monitor Performance Counters:
\Processor(_Total_)\% Processor Time,
\.NET CLR Exceptions(_Global_)\# of Exceps Thrown / sec,
\Memory\Available MBytes,
\Web Service(Default Web Site)\Current Connections (per each your site name),
\ASP.NET v4.0.30319\Request Wait Time,
\ASP.NET v4.0.30319\Requests Current,
\ASP.NET v4.0.30319\Request Queued,
\Process(XXX)\Working Set,
\Process(XXX)\% Processor Time (XXX per each w3wp process),
\Network Interface(XXX)\Bytes total / sec
run Performance Analysis of Logs (PAL) Tool during time of failure to make a very detailed analysis of performance counters data,
run netstat -ano to analyze network traffic (or TCPView tool even better)
If all this will not lead you to any conclusion, create a Debug Diagnostic rule to create a memory dump of the process for long running requests and analyze it with WinDbg and PSSCor extension for .NET debugging.

Scalability issue when using outgoing asynchronous web requests on IIS 7.5

A bit of a long description below, but it is a quite tricky problem. I have tried to cover what we do know about the problem in order to narrow down the search. The question is more of an ongoing investigation than a single-question based one but I think it may help others as well. But please add information in comments or correct me if you think I am wrong about some assumptions below.
UPDATE 19/2, 2013: We have cleared some question marks in this and I have a theory of what the main problem is which I'll update below. Not ready to write a "solved" response to it yet though.
UPDATE 24/4, 2013: Things have been stable in production (though I believe it is temporary) for a while now and I think it is due to two reasons. 1) port increase, and 2) reduced number of outgoing (forwarded) requests. I'll continue this update futher down in the correct context.
We are currently doing an investigation in our production environment to determine why our IIS web server does not scale when too many outgoing asynchronous web service requests are being done (one incoming request may trigger multiple outgoing requests).
CPU is only at 20%, but we receive HTTP 503 errors on incoming requests and many outgoing web requests get the following exception: “SocketException: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full” Clearly there is a scalability bottleneck somewhere and we need to find out what it is and if it is possible to solve it by configuration.
Application context:
We are running IIS v7.5 integrated managed pipeline using .NET 4.5 on Windows 2008 R2 64 bit operating system. We use only 1 worker process in IIS. Hardware varies slightly but the machine used for examining the error is an Intel Xeon 8 core (16 hyper threaded).
We use both asynchronous and synchronous web requests. Those that are asynchronous are using the new .NET async support to make each incoming request make multiple HTTP requests in the application to other servers on persisted TCP connections (keep-alive). Synchronous request execution time is low 0-32 ms (longer times occur due to thread context switching). For the asynchronous requests, execution time can be up to 120 ms before the requests are aborted.
Normally each server serves up to ~1000 incoming requests. Outgoing requests are ~300 requests/sec up to ~600 requests/sec when problem starts to arise. Problems only occurs when outgoing async. requests are enabled on the server and we go above a certain level of outgoing requests (~600 req./s).
Possible solutions to the problem:
Searching the Internet on this problem reveals a plethora of possible solutions candidates. Though, they are very much dependent upon versions of .NET, IIS and operating system so it takes time to find something in our context (anno 2013).
Below is a list of solution candidates and the conclusions we have come to so far with regards to our configuration context. I have categorised the detected problem areas, so far in the following main categories:
Some queue(s) fill up
Problems with TCP connections and ports (UPDATE 19/2, 2013: This is the problem)
Too slow allocation of resources
Memory problems (UPDATE 19/2, 2013: This is most likely another problem)
1) Some queue(s) fill up
The outgoing asynchronous request exception message does indicate that some queue of buffer has been filled up. But it does not say which queue/buffer. Via the IIS forum (and blog post referenced there) I have been able to distinguish 4 of possibly 6 (or more) different types of queues in the request pipeline labeled A-F below.
Though it should be stated that of all the below defined queues, we see for certain that the 1.B) ThreadPool performance counter Requests Queued gets very full during the problematic load. So it is likely that the cause of the problem is in .NET level and not below this (C-F).
1.A) .NET Framework level queue?
We use the .NET framework class WebClient for issuing the asynchronous call (async support) as opposed to the HttpClient that we experienced had the same issue but with far lower req/s threshold. We do not know if the .NET Framework implementation hides any internal queue(s) or not above the Thread pool. We don’t think this is the case.
1.B) .NET Thread Pool
The Thread pool acts as a natural queue since the .NET Thread (default) Scheduler is picking threads from the thread pool to be executed.
Performance counter: [ASP.NET v4.0.30319].[Requests Queued].
Configuration possibilities:
(ApplicationPool) maxConcurrentRequestsPerCPU should be 5000 (instead of previous 12). So in our case it should be 5000*16=80.000 requests/sec which should be sufficient enough in our scenario.
(processModel) autoConfig = true/false which allows some threadPool related configuration to be set according to machine configuration. We use true which is a potential error candidate since these values may be set erroneously for our (high) need.
1.C) Global, process wide, native queue (IIS integrated mode only)
If the Thread Pool is full, requests starts to pile up in this native (not-managed) queue.
Performance counter:[ASP.NET v4.0.30319].[Requests in Native Queue]
Configuration possibilities: ????
1.D) HTTP.sys queue
This queue is not the same queue as 1.C) above. Here’s an explanation as stated to me “The HTTP.sys kernel queue is essentially a completion port on which user-mode (IIS) receives requests from kernel-mode (HTTP.sys). It has a queue limit, and when that is exceeded you will receive a 503 status code. The HTTPErr log will also indicate that this happened by logging a 503 status and QueueFull“.
Performance counter: I have not been able to find any performance counter for this queue, but by enabling the IIS HTTPErr log, it should be possible to detect if this queue gets flooded.
Configuration possibilities: This is set in IIS on the application pool, advanced setting: Queue Length. Default value is 1000. I have seen recommendations to increase it to 10.000. Though trying this increase has not solved our issue.
1.E) Operating System unknown queue(s)?
Although unlikely, I guess the OS could actually have a queue somewhere in between the network card buffer and the HTTP.sys queue.
1.F) Network card buffer:
As request arrive to the network card, it should be natural that they are placed in some buffer in order to be picked up by some OS kernel thread. Since this is kernel level execution, and thus fast, it is not likely that it is the culprit.
Windows Performance Counter: [Network Interface].[Packets Received Discarded] using the network card instance.
Configuration possibilities: ????
2) Problems with TCP connections and ports
This is a candidate that pops up here and there, though our outgoing (async) TCP requests are made of a persisted (keep-alive) TCP connection. So as the traffic grows, the number of available ephemeral ports should really only grow due to the incoming requests. And we know for sure that the problem only arises when we have outgoing requests enabled.
However, the problem may still arise due to that the port is allocated during a longer timeframe of the request. An outgoing request may take as long as 120 ms to execute (before the .NET Task (thread) is canceled) which might mean that the number of ports get allocated for a longer time period. Analyzing the Windows Performance Counter, verifies this assumption since the number of TCPv4.[Connection Established] goes from normal 2-3000 to peaks up to almost 12.000 in total when the problem occur.
We have verified that the configured maximum amount of TCP connections is set to the default of 16384. In this case, it may not be the problem, although we are dangerously close to the max limit.
When we try using netstat on the server it mostly returns without any output at all, also using TcpView shows very few items in the beginning. If we let TcpView run for a while it soon starts to show new (incoming) connections quite rapidly (say 25 connections/sec). Almost all connections are in TIME_WAIT state from the beginning, suggesting that they have already completed and waiting for clean up. Do those connections use ephemeral ports? The local port is always 80, and the remote port is increasing. We wanted to use TcpView in order to see the outgoing connections, but we can’t see them listed at all, which is very strange. Can’t these two tools handle the amount of connections we are having?
(To be continued.... But please fill in with info if you know it… )
Furhter more, as a side kick here. It was suggested in this blog post "ASP.NET Thread Usage on IIS 7.5, IIS 7.0, and IIS 6.0" that ServicePointManager.DefaultConnectionLimit should be set to int maxValue which otherwise could be a problem. But in .NET 4.5, this is the default already from the start.
UPDATE 19/2, 2013:
It is reasonable to assume that we did in fact hit the max limit of 16.384 ports. We doubled the number of ports on all but one server and only the old server would run into problem when we hit the old peak load of outgoing requests. So why did the TCP.v4.[Connections Established] never show us a higher number than ~12.000 at problem times? MY theory: Most likely, although not established as fact (yet), the Performance Counter TCPv4.[Connections Established] is not equivalent to the number of ports that are currently allocated. I have not had time to catch up on the TCP state studying yet, but I am guessing that there are more TCP states than what the "Connection Established" shows which would render the port as being ccupied. Though since we cannot use the "Connection Established" performance counter as a way to detect the danger of running out of ports, it is important that we find some other way of detecting when reaching this max port range. And as described in the text above, we are not able to use either with NetStat or the application TCPview for this on our production servers. This is a problem! (I'll write more about it in an upcoming response I think to this post)
The number of ports are restricted on windows to some maximum 65.535 (although the first ~1000 should probably not be used). But it should be possible to avoid the problem of running out of ports by decreasing the time for TCP state TIME_WAIT (default to 240 seconds) as described in numerous places.It should free up ports faster. I was first a bit hestitant about this doing this since we use both long running database queries as well as WCF calls on TCP and I wouldn't like to descrease the time constraint. Although not having caught up in my TCP state machine reading yet, I think it might not be a problem after all. The state TIME_WAIT, I think, is only there in order to allow for the handshake of a proper shut down to the client. So the actual data transfer on an existing TCP connection should not time out due to this time limit. Worse case scenario, the client is not shut down properly and it instead neads to time out. I guess all browsers may not be implementing this correctly and it could possibly be a problem on the client side only. Though I am guessing a bit here...
END UPDATE 19/2, 2013
UPDATE 24/4, 2013:
We have increased the number of port to to the maximum value. At the same time we do not get as many forwarded outgoing requests as earlier. These two in combination should be the reason why we have not had any incidents. However, it is only temporary since the number of outgoing requests are bound to increase again in the future on these servers. The problem thus lies in, I think, that port for the incoming requests has to remain open during the time frame for the response of the forwarded requests. In our application, this cancelation limit for these forwarded requests is 120 ms which could be compared with the normal <1ms to handle a non forwarded request. So in essence, I believe the definite number of ports is the major scalability bottleneck on such high throughput servers (>1000 requests/sec on ~16 cores machines) that we are using. This in combination with the GC work on cache reload (se below) makes the server especially vulernable.
END UPDATE 24/4
3) Too slow allocation of resources
Our performance counters show that the number of queued requests in the Thread Pool (1B) fluctuates a lot during the time of the problem. So potentially this means that we have a dynamic situation in which the queue length starts to oscillate due to changes in the environment. For instance, this would be the case if there are flooding protection mechanisms that are activated when traffic is flooding. As it is, we have a number of these mechanisms:
3.A) Web load balancer
When things go really bad and the server responds with a HTTP 503 error, the load balancer will automatically remove the web server from being active in production for a 15 second period. This means that the other servers will take the increased load during the time frame. During the “cooling period”, the server may finish serving its request and it will automatically be reinstated when the load balancer does its next ping. Of course this only is good as long as all servers don’t have a problem at once. Luckily, so far, we have not been in this situation.
3.B) Application specific valve
In the web application, we have our own constructed valve (Yes. It is a "valve". Not a "value") triggered by a Windows Performance Counter for Queued Requests in the thread pool. There is a thread, started in Application_Start, that checks this performance counter value each second. And if the value exceeds 2000, all outgoing traffic ceases to be initiated. The next second, if the queue value is below 2000, outgoing traffic starts again.
The strange thing here is that it has not helped us from reaching the error scenario since we don’t have much logging of this occurring. It may mean that when traffic hits us hard, things goes bad really quickly so that the 1 second time interval check actually is too high.
3.C) Thread pool slow increase (and decrease) of threads
There is another aspect of this as well. When there is a need for more threads in the application pool, these threads gets allocated very slowly. From what I read, 1-2 threads per second. This is so because it is expensive to create threads and since you don’t want too many threads anyways in order to avoid expensive context switching in the synchronous case, I think this is natural. However, it should also mean that if a sudden large burst of traffic hits us, the number of threads are not going to be near enough to satisfy the need in the asynchronous scenario and queuing of requests will start. This is a very likely problem candidate I think. One candidate solution may be then to increase the minimum amount of created threads in the ThreadPool. But I guess this may also effect performance of the synchronously running requests.
4) Memory problems
(Joey Reyes wrote about this here in a blog post)
Since objects get collected later for asynchronous requests (up to 120ms later in our case), memory problem can arise since objects can be promoted to generation 1 and the memory will not be recollected as often as it should. The increased pressure on the Garbage Collector may very well cause extended thread context switching to occur and further weaken capacity of the server.
However, we don’t see an increased GC- nor CPU usage during the time of the problem so we don’t think the suggested CPU throttling mechanism is a solution for us.
UPDATE 19/2, 2013: We use a cache swap mechanism at regular intervalls at which an (almost) full in-memory cache is reload into memory and the old cache can get garbage collected. At these times, the GC will have to work harder and steal resources from the normal request handling. Using Windows Performance counter for thread context switching it shows that the number of context switches decreases significantly from the normal high value at the time of a high GC usage. I think that during such cache reloads, the server is extra vulnernable for queueing up requests and it is necessary to reduce the footprint of the GC. One potential fix to the problem would be to just fill the cache without allocating memory all the time. A bit more work, but it should be doable.
UPDATE 24/4, 2013:
I am still in the middle of the cache reload memory tweak in order to avoid having the GC running as much. But we normally have some 1000 queued requests temporarily when the GC runs. Since it runs on all threads, it is naturall that it steals resources from the normal requests handling. I'll update this status once this tweak has been deployed and we can see a difference.
END UPDATE 24/4
I have implemented a reverse proxy through an Async Http Handler for benchmarking purposes (as a part of my Phd. Thesis) and run into the very same problems as you.
In order to scale it is mandatory to have processModel set to false and fine tune the thread pools. I have found that, contrary to what the documentation regarding processModel defaults says, many of the thread pools are not properly configured when processModel is set to true. The maxConnection setting it is also important as it limits your scalability if the limit is set too low. See http://support.microsoft.com/default.aspx?scid=kb;en-us;821268
Regarding your app running out of ports because of the TIME_WAIT delay on the socket, I have also faced the same problem because I was injecting traffic from a limited set of machines with more than 64k requests in 240 seconds. I lowered the TIME_WAIT to 30 seconds without any problems.
I also mistakenly reused a proxy object to a Web Services endpoint in several threads. Although the proxy doesn't have any state, I found that the GC had a lot of problems collecting the memory associated with its internal buffers (String [] instances) and that caused my app to run out of memory.
Some interesting performance counters that you should monitor are the ones related to Queued requests, requests in execution and request time under the ASP.NET apps category. If you see queued requests or that the execution time is low but the clients see long request times, then you have some sort of contention in your server. Also monitor counters under the LocksAndThreads category looking for contention.
Since asynchronous requests hold up the tcp sockets for longer, maybe you need to look at
maxconnection property within connection management in your web.config?
Please refer to this link: http://support.microsoft.com/default.aspx?scid=kb;en-us;821268
We faced similar problem and tuned this parameter to fix our issue. Maybe this will help you.
Edit: Also, lots of TIME_WAITs indicate a connection leak within the code based on past experience. Possible causes: 1) Not disposing connections used. 2) Incorrect implementation of connection pooling.

Can low memory on IIS server cause SQL Timeouts (SQL Server on separate box)?

I have an IIS Web Server that hosts 400 web applications (distributed across 30 application pools). They are both ASP.NET applications and WCF Services end points. The server has 32GB of RAM and is usually running fast; although it's running at 95% memory usage. Worker processes each take between 500MB and 1.5GB of RAM.
I also have another box running SQL Server. That one has plenty of free memory.
Sometimes, the Web Server starts throwing SQL Timeout exceptions. A few per minutes at first and rapidly increasing to hundreds per minute; effectively making the server down. This problem affects applications in all pools. Some requests still complete but most of them don't. While this happens the CPU usage on the server is around 30% (which is the normal load on that box).
While this is happening, we can still use SQL Server Management Studio (from the IIS Server) to execute requests successfully (and fast).
The fix is to restart IIS. And then everything goes back to normal until the next time.
Because the server is running with very low memory, I feel like this is the cause. But I cannot explain the relationship between low memory and sudden bursts of SQL Timeout exceptions.
Any idea?
Memory pressure can trigger paging and garbage collection. Both introduce latency which would not be present otherwise.
GC'ing 32GB of data can take seconds. Why would all app processes GC at the same time? Because at about 95% memory utilization Windows sets a "low memory" event that the CLR listens to. It will try to release memory to help other processes.
If the applications get into a paging frenzy that would also explain huge delays in normal execution.
This is just guessing, though. You can try proving it by looking at the "Hard page faults/sec" counter. There also must be a counter for "full GC" or "Gen 2 GC".
The fix would be running at a higher margin to the physical memory limit.
The first problem is to discover where the timeout is happening. Can you tell from the stack trace if the timeout is happening when executing a request against the database, or when connecting to the database? (Or even connecting to the web server?)
Timeouts executing database requests can be a variety of causes. The problem might be in the database with blocking processes, database maintenance (also locking), deadlocks, etc. When apps are running slowly, do you see a lot of entries in sys.dm_exec_requests, and if so, what are their wait_types?
Even if you can run SQL in the query window while the web server is timing out, that doesn't mean there isn't massive blocking or deadlocking going on.
If it is a timeout connecting to the database, then it is possible the ADO connection pools are overwhelmed and not getting cleaned up, or the database has a connection limit, and the web services are timing out waiting for a connection.
One of the best ways to find out what is going on is to capture a memory dump of the w3wp.exe process and analyze it. Even if you aren't adept at a debugger like WinDbg, Microsoft's DebugDiag tool can produce some nice reports with helpful information.
SqlCommand.CommandTimeout
This property is the cumulative time-out for all network reads during command execution or processing of the results. A time-out can still occur after the first row is returned, and does not include user processing time, only network read time.
It is a client based time out. If stuff is getting queued due to memory constraints then that could cause a timeout.
Are you retrieving a lot of data from these queries?
If some queries return a lot of data consider breaking them up and give the user a next and prior button.
Have you considered asynch like BeginExecuteReader?
The advantage is no timeout.
It does not release the calling thread.
isExecutingFTSindexWordOnce = true;
sqlCmdFTSindexWordOnce.BeginExecuteNonQuery(callbackFTSindexWordOnce, sqlCmdFTSindexWordOnce);
// isExecutingFTSindexWordOnce set to false in the callback
Debug.WriteLine("Calling thread active");
But I agree with your comment how to respond to the request as the answer does not come back to the calling thread.
Sorry I am used to WPF where I just update a public property on the call back.

SignalR - in production, after 8 hours doesn't respond

using SignalR in procution. at startup everything works fine, but after 8-9 hours, service stops working, without any exception, or any log information in Event Logs.
Info:
Online Users (who uses this service in this 8-9 hours): 3000
Online Concurrent Users Max (at same time): 200
Hubs Count: 1 (try catch in every method for logging)
after browser timeout, it returns "404 not found".
do u have any ideas?
What version of SignalR are you using? You should be using v0.5.2 as previous versions had issues with zombie connections which would shut down your app by causing either an OutOfMemoryException or exceed the allowed number of requests for the app pool.
Essentially what would happen is that the # of requests would get backed up (use the performance monitor to view ASP.NET/{Requests Current, Requests Queued, Requests Rejected} - See Performance Tuning) and/or you would max out your IIS requests and the service would shut down. You can manually override this and increase your Requests Current for ASP.NET. I increased this to about 20,000 on our production box.
If you're not maxing out your requests, your app pool may be shutting down due to increased memory usage or # of exceptions shutting down the app pool.
In the IIS manager, under Management, double-click on Configuration Editor.
Next, at the top, click on system.applicationHost/applicationPools then click on the RHS of the line that says "(Collection)". This will open up your application pool collection editor.
Select your SignalR app pool and check the Properties at the bottom. Here you can set the periodicRestart/memory threshold to whatever you want.
In our application I'm finding that we're good for about 45 minutes, due to the high traffic nature of the application.
Hopefully that helps.

Resources