IIS holding up requests in queue instead of processing those

IIS holding up requests in queue instead of processing those - asp.net

I'm executing a load test against an application hosted in Azure. It's a cloud service with 3 instances behind an internal load balancer (Hash based load balancing mode).
When I execute the load test, it queues request even though the req/sec and total current request to IIS is quite low. I'm not sure what could be the problem.
Any suggestions?
Adding few screenshot of performance counters which might help you take decision.
Click on image to view original image.
Edit-1: Per request from Rohit Rajan,
Cloud Service is having 2 instances (meaning 2 VMs), each of them having 14 GBs of RAM and 8 cores.
I'm executing a Step load pattern start with 100 and add 100,150 user every 5 minutes, till 4-5 hours until the load reaches to 10,000 VUs.
Any call to external system are written async. Database calls are synchronous.

There is no straight forward answer to your question. One possible way would be to explore additional investigation options.
Based on your explanation, there seems to be a bottleneck within the application which is causing the requests to queue-up.
In order to investigate this, collect a memory dump when you see the requests queuing up and then use DebugDiag to run a hang analysis on it.
There are several ways to gather the memory dump.
Task Manager
Procdump.exe
Debug Diagnostics
Process Explorer
Once you have the memory dump you can install debug diag and then run analysis on it. It will generate a report which can help you get started.
Debug Diagnostics download: https://www.microsoft.com/en-us/download/details.aspx?id=49924

Related

JMeter Aggressivity

I am trying to stress test a IIS running a AspNet core App.
to do this i setup a Thread Group with 100 workers
In the thread group I use a Loop Controller
in the loop controller I use a Access Log sampler in order to replay real Get requests obtained from NCSA formatted logfile.
I am amazed to see that i obtain as total throughput only 100 request per sec.
how can i check if this is a limitation of jmeter or if this is a limitation of my web App ?
I would expect jmeter to blast my server and see target CPU shoot at 100%. or shall I increas again already high value of 100 threads ?

Total throughput is 86 requests per second
100 users might be not enough to "blast" your IIS instance, I would rather recommend going for stress test, i.e. start with 1 user and gradually increase the load until throughput starts decreasing or errors start occurring, whatever comes the first. Moreover it might be the case that the bottleneck is not CPU usage, it may be somewhere else, in case of incorrect configuration or inefficient algorithms the web application may not fully utilize underlying OS and hardware resources
Don't use GUI mode for testing, it's only for test development and debugging, when it comes to execution you should be running JMeter tests in command-line non-GUI mode
There are 464 errors, check jmeter.log file for any suspicious entries
I don't think you can really replay your access log, it can be used only for something simple like static HTML pages, if there is authentication or dynamic parameters it might be the case all your requests are hitting the same login page, you can try running your test using View Results Tree listener and inspect the responses to ensure that your test is doing what it's supposed to be doing

How to send 50.000 HTTP requests in a few seconds?

I want to create a load test for a feature of my app. It’s using a Google App Engine and a VM. The user sends HTTP requests to the App Engine. It’s realistic that this Engine gets thousands of requests in a few seconds. So I want to create a load test, where I send 20.000 - 50.000 in a timeframe of 1-10 seconds.
How would you solve this problem?
I started to try using Google Cloud Task, because it seems perfect for this. You schedule HTTP requests for a specific timepoint. The docs say that there is a limit of 500 tasks per second per queue. If you need more tasks per second, you can split this tasks into multiple queues. I did this, but Google Cloud Tasks does not execute all the scheduled task at the given timepoint. One queue needs 2-5 minutes to execute 500 requests, which are all scheduled for the same second :thinking_face:
I also tried a TypeScript script running asynchronous node-fetch requests, but I need for 5.000 requests 77 seconds on my macbook.

I don't think you can get 50.000 HTTP requests "in a few seconds" from "your macbook", it's better to consider going for a special load testing tool (which can be deployed onto GCP virtual machine in order to minimize network latency and traffic costs)
The tool choice is up to you, either you need to have powerful enough machine type so it would be able to conduct 50k requests "in a few seconds" from a single virtual machine or the tool needs to have the feature of running in clustered mode so you could kick off several machines and they would send the requests together at the same moment of time.
Given you mention TypeScript you might want to try out k6 tool (it doesn't scale though) or check out Open Source Load Testing Tools: Which One Should You Use? to see what are other options, none of them provides JavaScript API however several don't require programming languages knowledge at all

A tool you could consider using is siege.
This is Linux based and to prevent any additional cost by testing from an outside system out of GCP.
You could deploy siege on a relatively large machine or a few machines inside GCP.
It is fairly simple to set up, but since you mention that you need 20-50k in a span of a few seconds, siege by default only allows 255 requests per second. You can make this larger, though, so it can fit your needs.
You would need to play around on how many connections a machine can establish, since each machine will have a certain limit based on CPU, Memory and number of network sockets. You could just increase the -c number, until the machine gives an "Error: system resources exhausted" error or something similar. Experiment with what your virtual machine on GCP can handle.

Can low memory on IIS server cause SQL Timeouts (SQL Server on separate box)?

I have an IIS Web Server that hosts 400 web applications (distributed across 30 application pools). They are both ASP.NET applications and WCF Services end points. The server has 32GB of RAM and is usually running fast; although it's running at 95% memory usage. Worker processes each take between 500MB and 1.5GB of RAM.
I also have another box running SQL Server. That one has plenty of free memory.
Sometimes, the Web Server starts throwing SQL Timeout exceptions. A few per minutes at first and rapidly increasing to hundreds per minute; effectively making the server down. This problem affects applications in all pools. Some requests still complete but most of them don't. While this happens the CPU usage on the server is around 30% (which is the normal load on that box).
While this is happening, we can still use SQL Server Management Studio (from the IIS Server) to execute requests successfully (and fast).
The fix is to restart IIS. And then everything goes back to normal until the next time.
Because the server is running with very low memory, I feel like this is the cause. But I cannot explain the relationship between low memory and sudden bursts of SQL Timeout exceptions.
Any idea?

Memory pressure can trigger paging and garbage collection. Both introduce latency which would not be present otherwise.
GC'ing 32GB of data can take seconds. Why would all app processes GC at the same time? Because at about 95% memory utilization Windows sets a "low memory" event that the CLR listens to. It will try to release memory to help other processes.
If the applications get into a paging frenzy that would also explain huge delays in normal execution.
This is just guessing, though. You can try proving it by looking at the "Hard page faults/sec" counter. There also must be a counter for "full GC" or "Gen 2 GC".
The fix would be running at a higher margin to the physical memory limit.

The first problem is to discover where the timeout is happening. Can you tell from the stack trace if the timeout is happening when executing a request against the database, or when connecting to the database? (Or even connecting to the web server?)
Timeouts executing database requests can be a variety of causes. The problem might be in the database with blocking processes, database maintenance (also locking), deadlocks, etc. When apps are running slowly, do you see a lot of entries in sys.dm_exec_requests, and if so, what are their wait_types?
Even if you can run SQL in the query window while the web server is timing out, that doesn't mean there isn't massive blocking or deadlocking going on.
If it is a timeout connecting to the database, then it is possible the ADO connection pools are overwhelmed and not getting cleaned up, or the database has a connection limit, and the web services are timing out waiting for a connection.
One of the best ways to find out what is going on is to capture a memory dump of the w3wp.exe process and analyze it. Even if you aren't adept at a debugger like WinDbg, Microsoft's DebugDiag tool can produce some nice reports with helpful information.

SqlCommand.CommandTimeout
This property is the cumulative time-out for all network reads during command execution or processing of the results. A time-out can still occur after the first row is returned, and does not include user processing time, only network read time.
It is a client based time out. If stuff is getting queued due to memory constraints then that could cause a timeout.
Are you retrieving a lot of data from these queries?
If some queries return a lot of data consider breaking them up and give the user a next and prior button.
Have you considered asynch like BeginExecuteReader?
The advantage is no timeout.
It does not release the calling thread.
isExecutingFTSindexWordOnce = true;
sqlCmdFTSindexWordOnce.BeginExecuteNonQuery(callbackFTSindexWordOnce, sqlCmdFTSindexWordOnce);
// isExecutingFTSindexWordOnce set to false in the callback
Debug.WriteLine("Calling thread active");
But I agree with your comment how to respond to the request as the answer does not come back to the calling thread.
Sorry I am used to WPF where I just update a public property on the call back.

Long running-time script in PHP causes NGINX server to get very busy

I'll try to be very specific on this - it won't be easy, so please try to follow.
We have a script that runs with PHP on NGINX - PHP-fpm FastCGI.
This script gets information from the user trying to access it, and runs some algorithm on real-time. It cannot be a scheduled process running in the background.
Sometimes, it even takes for the page between 5-12 seconds to load - and it's ok.
Generally, we collect data from the user and make several outgoing request to third-party servers, collect the data, analyse it and return a response for the user .
The problem is,
There are many users running this script, and the server gets very busy - since they're all active connection on the server, waiting for a response.
We have 2 servers running under 1 load balancer, and that's not enough.
Sometimes the servers have more the 1,500 active connections at a time. You can imagine how these servers respond at that timeframe.
I'm looking for a solution.
We can add more and more servers to the LB, but it just sounds absurd that it's the only solution there is.
We ran over that script and optimized it to the maximum, I can promise you that -
There is no real solution for the long-time running of that script, since it depends on 3rd party servers that take time to respond to us on live traffic.
Is there a solution you can think of, to keep this script as it is -
but somehow to lower the impact of these active connection on the overall servers' functioning?
Sometimes, they just simply stop to respond.
Thank you very much for reading!

3 months old question, I know, but I cant help it thinking that:
In case you're sure that the sum of the network work for all requests to the third-party servers plus the corresponding processing of the responses inside your PHP script is much lower than the limits of your hardware.
Your PHP script is then likely inefficiently busy-looping until all responses come back from the third-party servers
If I were dealing with such an issue I'd do:
Stop using your custom external C++ curl thing, as the PHP script is busy-waiting for it anyways.
Google and read up on non-busy-looping usage of PHP's curl-multi implementation
Hope this makes sense.

My advice is to set limited timeouts for requests and to use asynchronous requests for each third-party request.
For example, for your page you have to display results of 5 third-party requests. It means, that inside script you call cURL or file_get_contents 5 times, but script becomes frozen for each timeout from third party. Step by step. It means, that if for each request you have to wait 10 seconds for the response - you'll have 50 seconds in total.
User calls the script -> script wait to end -> server is loaded for 50 seconds
Now, if each request to third party will be sent asynchronously - it will reduce script's load time to the maximum request delay. So, you'll have few smaller scripts, that will live shorter life - and it will decrease load on the server.
User calls the script -> script is loaded -> requests are sent -> there are no scripts that are waiting for the response and consuming resources of your server
May the AJAX be with you! ;)

This is a very old question, but since I had a similar problem I can share my solution. Long running scripts impact various parts of the system and cause stresses in webservers (in active connects), php-fpm and mysql/other databases. These tend to cause a number of knock on effects such as other requests starting to fail.
Firstly make sure you have netdata (https://github.com/netdata/netdata) installed on the server. If you are running many instances you might find having a Grafana/Prometheus setup is worth it too.
Next make sure it can see the PHP FPM process, Mysql and Nginx. There are many many things Netdata shows, but for this problem, my key metrics were:
Connections (mysql_local.connections) - is the database full of connections
PHP-FPM Active Connections (phpfpm_local.connections) - is PHP failing to keep up
PHP-FPM Request Duration (phpfpm_local.request_duration) - Is the time to process going thru the roof?
Disk Utilization Time (disk_util.sda) - this shows if the disk cannot keep up (100% = bad under load)
Users Open Files (users.files)
Make sure that you have sufficient file handles (https://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/), and the disk is not fully occupied. Both of these will block you from making stuff work, so make them big on the troublesome server.
Next check Nginx has enough resources in nginx.conf:
worker_processes auto;
worker_rlimit_nofile 30000;
events {
worker_connections 768;
}
This will give you time to work out what is going wrong.
Next look at php-fpm (/etc/php/7.2/fpm/pool.d/www.conf):
set pm.max_spare_servers high such as 100
set pm.max_requests = 500 - just in case you have a script that doesn't free itself properly.
Then watch. The problem for me was every request blocks an incoming connection. More requests to the same script will block more connections. The machine can be operating fine, but a single slow script doing a curl hit or a slow SQL statement will take that connection for its entire duration, so 30 seconds = 1 less php process to handle incoming requests. Eventually you hit 500, and you run out. If you can increase the number of FPM processes to match the frequency of slow script requests to the number seconds they run for. So if the script takes 2 seconds, and gets hit 2 times a second, you will need a constant 4 additional fpm worker threads chewed up doing nothing.
If you can do that, stop there - the extra effort beyond that is probably not worth it. If it still feels like it will be an issue - create a second php-fpm instance on the box and send all requests for the slow script to that new instance. This allows you to fail those requests discretely in the case of too much run time. This will give you the power to do two important things:
Control the amount of resources devoted to the slow script
Mean that all other scripts are never blocked by the slow script and (assuming the OS limits are high enough) never affected by resource limits.
Hope that helps someone struggling under load!

Slow BizTalk File Receive

I have an application with a file receive location. After the host instance has been running for a few hours the receive location fails to identify new files dropped into the folder that it is monitoring. It doesn't forget about them altogether, it's just that performance grinds to a crawl. The receive location is configured to poll the target folder every 60 seconds but after host instance has been running for an hour or so, then it seems that the target folder is being polled only every thirty minutes. If I restart the host instance then the files waiting in the target folder are collected right away and performance is fine for the next hour or so.
The same application runs fine in a different environment.
There are now obvious entries in the event log related to the problem.
All the BizTalk SQL jobs are running fine except for Backup BizTalk Server (BizTalkMgmtDb).
Any suggestions gratefully received.
Thanks
Rob

Here are some additional tools which may help you identify and diagnose BizTalk database issues.
BizTalk MsgBox Viewer
Here is a tool to repair identified errors:
Terminator
Use at your own risk... read the glogs and docs. Start with the message box viewer and let us know our results.

Without more details, the biggest tell is that your Backup Job is failing. If the backup job is failing, it may not be properly configured. If it is properly configured and still failing, then you've got other issues. Can you give us some more information about your BizTalk install.
What version are you running?
What are our database sizes?
What are your purge and archive settings like?
Is there any long running blocks in your SQL Server DB coming from BizTalk?

Another thing to consider is the user accounts the send, receive and orchestration hosts are running under. Please check the BizTalk Administration Console. If they are all running the same account, sometimes the orchestrations can starve the send and receive processes of CPU time. I believe priority is given to orchestrations then receive, then send. Even if you are just developing, it is useful to use separate accounts for this. This also improves security.
The Wrox BizTalk Server 2006 will also supply tuning advice.

What other things are going on with the server? Is BizTalk pegged otherwise or is it idle?

You mention that the solution does not have any problems in another environment, so it's likely that there is a configuration problem.
Check the following:
** On SQL Server, set some upper memory limit for SQL Server. By default, SQL Server uses whatever it can get and then hangs onto it, so set a reasonable limit so that your system can operate without spending a lot of time paging memory onto and from your hard drive(s).
** Ensure that you have available disk space - maybe you are running low - this can lead to all kinds of strange problems.
** Try to split up the system's paging file among its physical drives (if you have more than one drive on the system). Also consider using a faster drive, or if you have lots of cash laying around, get a SAN.
** In BizTalk, is tracking enabled? If so, are you also tracking message bodies? Disable tacking or message body tracking and see if there is a difference.
** Start performance monitor and monitor the following counters when running your solution
Object: BizTalk Messaging
Instance: (select the receiving host) %%
Counter: Documents Received/Sec
Object: BizTalk Messaging
Instance: (select the transmitting host) %%
Counter: Documents Sent/Sec
Object: XLANG/s Orchestrations
Instance: (select the processing host) %%
Counter: Orchestrations Completed/Sec.
%% You may have only one host, so just use it. Since BizTalk configurations vary, I am using generic names for hosts.
The preceding counters monitor the most basic aspects of your server, but may help to narrow down places to look further. You can, of course, add CPU and Memory too. If you have time (days...maybe weeks) you could monitor for processes that allocate memory and never release it. Use the following counter...
Object: Memory
Counter: Pool Nonpaged Bytes
Slow decline of this counter indicates that a process is not releasing memory, which affects everything on the system.
Let us know how things turn out!

I had the same problem with, when my orchestration was idle for some time it took a long time to process the first msg. A article of EvYoung helped me solve this problem.
"This is caused by application domain unloading within the BizTalk host process. If an AppDomain is shutdown after idle, the next message that comes needs to wait for the Orchestration to compile again. Depending on the complexity of your design, this can be a noticeable wait. To prevent this in low latency requirement scenario, you can modify the BTSNTSVC.EXE.config file and set SecondsIdleBeforeShutdown property to -1. This will prevent AppDomain shutdown due to idle."
You can find the article in here:
http://blogs.msdn.com/b/biztalkcpr/archive/2008/05/08/thoughts-on-orchestration-performance.aspx
It took me to long to respond but i thought i might help someone. cheers :)

Some good suggestions from others. I will add :
Do you have any custom receive pipeline components on the receive location ? If so perhaps one is leaking memory, calling some external component eg database which is taking a long time ?
How big are the files you are receiving ?
On the File transport properties of your receive location, set "file renaming" on, do the files get renamed within 60s.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex