We got a high traffic website which generates a lot of I/O. Within 10 minutes it has been reading over 10 gb of data (w3wp in question seen in task manager). For memory and application hangs I have been using WinDbg with success. But I don't know how I can find the object(s) / method(s) within a process which are responsible for the highest I/O.
Is this even possible?
Edit
The question is: Is there a way to profile I/O operations in a .NET assembly, say: list of threads sorted by highest disk I/O (or something similar that would help me where to look)
ANTS Performance Profiler
I have used this tool to great success - dealing with finding the specific instructions which are causing ~512GB of memory on a high-volume web farm getting chewed up within 5-10 minutes. Sounds like a very similar situation as yours.
Now, to be realistic - it's not going to magically solve your problem. It still requires a lot of setup, thorough analysis and detective work. But this tool definitely took the problem from "practically unsolvable" to "solvable within days".
Update:
As I mentioned in the comments (and Ben Emmett echoed), we can use ANTS to monitor memory, file system handles - pretty much any resource consumption and drill down the call stack to see the effects of specific routines.
I came up with this tool AppDynamics Lite which displays your application calls costs and performance in a visual way. It might help you to find out which functions are making the most costy IO operations.
Quoting;
Understand the health of your CLR with key metrics like response time, throughput, exception rate, and garbage collection time as well as key system resource like CPU, memory and disk I/O.
Worth giving a shot as it is trial/free for 30 days. Hope it helps.
Ps: I'm not affiliated with AppDynamics in any way.
You can use the (free) Windows Performance Toolkit from Windows 8 which does run also on Windows Vista and later. There you can turn on system wide profiling to see what was going on in all processes at once. No instrumentation necessary. Only one reboot is required to set an arcane registry key which is done by WPRUI.exe automatically.
With XPerf you could enable IO Init stack walking so that a call stack is taken for every IO which is started. The only issue is that the stacks will be broken for 64 bit processes which means that you will see only the first method above the BCL methods of your code because there is a Windows 7 bug in the stackwalking capabilities of the OS.
A workaround is to Ngen your assemblies or move to Server 2012 or switch to x86 for profiling to see deeper call stacks.
You will see all file IO and CPU activity even without any call stacks and the file names along how long the hard disc was used. That should give you good information which part of your app is causing the disc IO. From the partial call stacks you should be able to pinpoint your issue even without full stacks.
The tool will give you much more insight than any commercially available profiler at the expense that you need to learn how to use it. Since the call stacks do not end at your code or in user mode but in the kernel you can also determine if e.g. the virus scanner is causing significant IO delays. But you need to know how your processor does work. This toolset was originally aimed at kernel devs which explains why you see so many useless columns.
In the picture below you see file IO and CPU consumption stacked. When you select your high IO file in the disc IO graph it will highlight in the CPU consumption all related call stacks which were taken at the same time while the IO was active. This way you can diretly navigate from the IO to your potentially blocked threads.
Related
I'm a researcher in statistical pattern recognition, and I often run simulations that run for many days. I'm running Ubuntu 12.04 with Linux 3.2.0-24-generic, which, as I understand, supports multicore and hyper-threading. With my Intel Core i7 Sandy Bridge Quadcore with HTT, I often run 4 simulations (programs that take a long time) at the same time. Before I ask my question, here are the things that I already (think I) know.
My OS (Ubuntu 12.04) detects 8 CPUs due to hyper-threading.
The scheduler in my OS is clever enough never to schedule two programs to run on two logical (virtual) cores belonging to the same physical core, because the OS supports SMP (Simultaneous Multi-Threading).
I have read the Wikipedia page on Hyper-Threading.
I have read the HowStuffWorks page on Sandy Bridge.
OK, my question is as follows. When I run 4 simulations (programs) on my computer at the same time, they each run on a separate physical core. However, due to hyper-threading, each physical core is split into two logical cores. Therefore, is it true that each of the physical cores is only using half of its full capacity to run each of my simulations?
Thank you very much in advance. If any part of my question is not clear, please let me know.
This answer is probably late, but I see that nobody offered an accurate description of what's going on under the hood.
To answer your question, no, one thread will not use half a core.
One thread can work inside the core at a time, but that one thread can saturate the whole core processing power.
Assume thread 1 and thread 2 belong to core #0. Thread 1 can saturate the whole core's processing power, while thread 2 waits for the other thread to end its execution. It's a serialized execution, not parallel.
At a glance, it looks like that extra thread is useless. I mean the core can process 1 thread at once right?
Correct, but there are situations in which the cores are actually idling because of 2 important factors:
cache miss
branch misprediction
Cache miss
When it receives a task, the CPU searches inside its own cache for the memory addresses it needs to work with. In many scenarios the memory data is so scattered that it is physically impossible to keep all the required address ranges inside the cache (since the cache does have a limited capacity).
When the CPU doesn't find what it needs inside the cache, it has to access the RAM. The RAM itself is fast, but it pales compared to the CPU's on-die cache. The RAM's latency is the main issue here.
While the RAM is being accessed, the core is stalled. It's not doing anything. This is not noticeable because all these components work at a ridiculous speed anyway and you wouldn't notice it through some CPU load software, but it stacks additively. One cache miss after another and another hampers the overall performance quite noticeably.
This is where the second thread comes into play. While the core is stalled waiting for data, the second thread moves in to keep the core busy. Thus, you mostly negate the performance impact of core stalls.
I say mostly because the second thread can also stall the core if another cache miss happens, but the likelihood of 2 threads missing the cache in a row instead of 1 thread is much lower.
Branch misprediction
Branch prediction is when you have a code path with more than one possible result. The most basic branching code would be an if statement.
Modern CPUs have branch prediction algorithms embedded into their microcode which try to predict the execution path of a piece of code. These predictors are actually quite sophisticated and although I don't have solid data on prediction rate, I do recall reading some articles a while back stating that Intel's Sandy Bridge architecture has an average successful branch prediction rate of over 90%.
When the CPU hits a piece of branching code, it practically chooses one path (path which the predictor thinks is the right one) and executes it. Meanwhile, another part of the core evaluates the branching expression to see if the branch predictor was indeed right or not. This is called speculative execution.
This works similarly to 2 different threads: one evaluates the expression, and the other executes one of the possible paths in advance.
From here we have 2 possible scenarios:
The predictor was correct. Execution continues normally from the speculative branch which was already being executed while the code path was being decided upon.
The predictor was wrong. The entire pipeline which was processing the wrong branch has to be flushed and start over from the correct branch.
OR, the readily available thread can come in and simply execute while the mess caused by the misprediction is resolved. This is the second use of hyperthreading.
Branch prediction on average speeds up execution considerably since it has a very high rate of success. But performance does incur quite a penalty when the prediction is wrong.
Branch prediction is not a major factor of performance degradation since, like I said, the correct prediction rate is quite high.
But cache misses are a problem and will continue to be a problem in certain scenarios.
From my experience hyperthreading does help out quite a bit with 3D rendering (which I do as a hobby). I've noticed improvements of 20-30% depending on the size of the scenes and materials/textures required. Huge scenes use huge amounts of RAM making cache misses far more likely. Hyperthreading helps a lot in overcoming these misses.
Since you are running on a Linux kernel you are in luck because the scheduler is smart enough to make sure your tasks is divided on between your physical cores.
Linux became hyperthredding aware in kernel 2.4.17 ( ref: http://kerneltrap.org/node/391 )
Note that the reference is from the old O(1) scheduler. Linux now uses the CFS scheduling algorithm which was introduced in kernel 2.6.23 and should be even better.
But as already suggested you can experiment by disabling hyper threading in bios and see if your particular workload runs faster or slower with or without hyperthreading enabled. If you start 8 tasks instead of 4 you will probably find that the total executing time for 8 tasks on hyperthreading is faster than two separate runs with 4 tasks but again the best thing to do is to experiment. Good luck!
If you are really want just 4 dedicated cores, you should be able to disable hyperthreading in your BIOS page. Also, and this part I'm less clear on, I believe that the processor is smart enough to do more work on a single thread if its second logical core is idle.
No, it's not exactly true. A hyperthreaded core is not two cores. Some things can run in parallel, but not as much as on two separate cores.
I have been developing a quite large application, and I uploaded it to my server some days ago. Now I have found out it has several memory leaks - Uh oh.
My server is running Windows Server 2008 on 1GB ram. When I have 0 people online, only 550-600mb is used. When one people comes online the memory starts skyrocketing, and if 3-4 people are online all 1GB ram is used.
The application is made in ASP.NET with AJAX. It has many updatepanels which runs every second and quite a lot of javascript. It uses 5-7 sessions at all times. I use LINQ to SQL as database communication.
I tried perfmon.exe on my server, and I found:
Gen 0 collections goes from 0% to
100% within minutes
Gen 1 collections
goes from 0% to 50% within 5 minutes
Gen 2 is very close to 0% at all
times
Total heap bytes goes up to
100% very fast
I also ran an analysis of my program with Visual Studio. 8% Of my total runtime is done in .ToList() methods, which properly is caused by LINQ to SQL.
My theories....
(1) Linq to SQL dataContext
This might be a crazy thing to do, but: In my data access layer I have a load of methods:
AddSomethingToDatabase();
AddSomethingElseToDatabase();
DeleteSomethingFromDatabase();
Each of these has the following initialization:
GameDataContext db = new GameDataContext();
Which means the above statement runes nearly every second or more.
(2) No objects implement IDisposable
I have to be honest: I have never worked with IDisposable. As far as I have read, this might be a problem.
Also, if this is the leak, which classes should implement it? I do not have any I/O work or others, only the DataContext.
(3) Loads of UpdatePanels and jQuery
I have some fear loads of updatepanels can give problems with performance, but I do not know how to check it.
So my question is: Any ideas on what the memory leak could be? Any ideas on how to find the memory leak? And any ideas on how to solve it?
I would love to hear from someone who has experience with the situation above!
Thanks,
Lars
I am not sure at all that there is a problem here. All the suggestions for memory leak troubleshooting seem to be just really bad advice when you have not yet established that you have a memory leak since your memory on the server is so low that this cannot be established.
So here is my 2 cents - some might not like it but as long as it could point you at the right direction, I do not mind the downvotes.
It seems that you have a very stringent memory requirement. 1GB of RAM for a Windows 2008 Server just gives about enough RAM to do its OS related job. This is way way below recommended RAM requirements for it which if I am not wrong is minimum 2 GB RAM. Overhead of just running a w3wp.exe and IIS would be around 200-300 MB. The fact that generation 2 is always is around 0% is the best evidence that all looks good and your server is probably being starved of the memory.
My suggestion is to give your server at least 2GB of RAM (4GB should be better) and then monitor the memory usage and see if it is going up. If so, post another question with your findings and we should be able to help.
You absolutely must ensure that IDisposable objects get Dispose called when you are done with them. The simplest way to do this is to use using:
using (GameDataContext db = new GameDataContext())
{
// code that uses 'db' goes in here
}
// Dispose called when 'using' scope ends
If you still have problems after doing this throughout, then profiling is needed, but fix this first since it's a no-brainer.
Your own objects usually only need to implement IDisposable if they encapsulate unmanaged resources for which you wish to guarantee deterministic release back to the OS, so that those resources - file handles, sockets, and so on - are not sitting around waiting for GC for an interval of time you cannot rely on.
I don't have an answer for your question 3), sorry.
I'd recommend you use a memory profiler. Redgate's ANTS is pretty superior; it can give you a breakdown of which objects are in memory at a given time.
I am not expert at this. But If you try ANTS Memory Profiler might help you figure out where the problem is.
Scitech memory profiler found our leaks and gives good advice.
I'm profiling a asp(classic) web service. The web service makes database calls, reads/writes to files, and processes xml. On a windows server 2003 box(2.7ghz, 4 core, 4gb ram) how many requests per second should I be able to handle before things start to fail.
I'm building a tool to test this, but I'm looking for a number of requests per second to shoot for.
I know this is fairly vague, but please give the best estimate you can. If you need more information, please ask.
95% of the performance of any data-driven app is dependent on the database: 1) the way you do your calls, 2) the indexes, 3) the hardware under the database (disk subsystem in particular).
I have seen a machine, like you are describing, handle 40 requests per second (2500/minute), but numbers like 10 per second (600/minute) are more common. I would expect even lower if you are running your DB on the same machine, and even lower still if that DB is SQLExpress or MSAccess.
Also, at capacity, your app will probably not fail, but IIS will Queue requests, once it is saturated, and may timeout some of those requests if it can't service them before the timeout expires.
Btw, instead of building a tool to test your app, you may want to look into using a test tool such as Microsoft WCAT. It is pretty smooth and easy to use.
How fast should it be? Fast enough.
How fast is fast enough? That's a question that only you and your users can answer. If your service is horrifically inefficient and keeps up with demand, it's fast enough. If your service is assembly-optimized, lightning-fast, and overwhelmed with requests, it's not fast enough.
If the server is handling its actual workload, then don't worry about how fast it "should" be. When the server is having trouble, or when you anticipate that it soon will, then you should look at improving the code or upgrading the hardware. Remember Knuth's Law – premature optimization is the root of all evil. Any work you do now to make it faster may never pay off, and you may be forced to make compromises with flexivility or maintainability. Remember, too, an older adage – if it ain't broke, don't fix it.
Yes I would also say 10 per second is a good benchmark. For a high performance app you would want to get more than this, but if you have no specific goal you should generally be able to get at least 10 requests per sec for a general web page with a bunch of database queries.
I am studying various ASP.Net deployment approaches. In there, I got a basic question. Is there any thumb rule about enviornment definition? What could be called a 'good' setup if I have to support 1000 concurrent users(requests).
I understand that there are many factors like how application is designed etc. But assuming that everything else is great, what configuration should I look for like Which processor, how much RAM etc?
Also how many concurrent users below configuration should be able to support ?
CPU: Dual 3.40 GHz Intel Xeon (Hyper-Threaded)
Memory : 3GB
OS: Windows Server 2003 SP2
Thanks for thelp
Having been on both sides of the equation (web developer and hardware engineer), my current opinion is that the answer involves both of those sides as well.
Your hardware needs to be not only sufficient for general usage, but it also has to cope with reasonable unexpected peaks and failures - which means that it needs to be redundant, and in excess of your capacity planning.
Your software needs to be designed so its easily redundant - theres no point in speccing a tiered hardware architecture (now or for future planning) if the software is going to require significant amount of changes to handle it.
Your software also needs to be designed so sudden unexpected peaks in resource usage don't happen as a regular occurrence for no external reason (eg marketing campaign).
I know that you say you understand the non-hardware factors, but the real answer to your question is that there is no real way to answer it without knowing the other factors - each situation and circumstance is unique, and requires a unique solution.
However, in an effort to add generalised recommendations, try these:
CPU - choose something with a lot of cache, and individual cache per core as well. This will do wonders to speed up the system. I typically go for dual core, dual processor at a minimum (for a total of 4 cores on two seperate physical cpus). Processor speed ratings don't really matter as much as you think these days.
Memory - fast memory, minimum of 8GB of it. Use the smallest dimms possible for the server.
Harddisk - SAS 15K RPM at a minimum, RAID 6 for the data partition on one controller, RAID 1 or 6 for the system partition on another controller. Choose a good quality controller backed by a good support or warranty package - your controller is no good if it dies in 3 years time and you can't get a replacement.
But above all, don't just install the OS and app and let it be, profile the set up as much as possible, don't be afraid of making changes to optimise to the individual setup (within reason). Move your ASP.Net temporary files to a fast disk (or a ram disk - if they are going to be rebuilt anyway, no matter worrying over losing them). Move the database to a second server, with a crossover 1GBit link between the two. Turn off disk maintenance in the OS, turn off services you do not need.
Good luck!
How much traffic can one web server handle? What's the best way to see if we're beyond that?
I have an ASP.Net application that has a couple hundred users. Aspects of it are fairly processor intensive, but thus far we have done fine with only one server to run both SqlServer and the site. It's running Windows Server 2003, 3.4 GHz with 3.5 GB of RAM.
But lately I've started to notice slows at various times, and I was wondering what's the best way to determine if the server is overloaded by the usage of the application or if I need to do something to fix the application (I don't really want to spend a lot of time hunting down little optimizations if I'm just expecting too much from the box).
What you need is some info on Capacity Planning..
Capacity planning is the process of planning for growth and forecasting peak usage periods in order to meet system and application capacity requirements. It involves extensive performance testing to establish the application's resource utilization and transaction throughput under load. First, you measure the number of visitors the site currently receives and how much demand each user places on the server, and then you calculate the computing resources (CPU, RAM, disk space, and network bandwidth) that are necessary to support current and future usage levels.
If you have access to some profiling tools (such as those in the Team Suite edition of Visual Studio) you can try to set up a testing server and running some synthetic requests against it and see if there's any specific part of the code taking unreasonably long to run.
You should probably check some graphs of CPU and memory usage over time before doing this, to see if it can even be that. (A number alike to the UNIX "load average" could be a useful metric, I don't know if Windows has anything like it. Basically the average number of threads that want CPU time for every time-slice.)
Also check the obvious, that you aren't running out of bandwidth.
Measure, measure, measure. Rico Mariani always says this, and he's right.
Measure req/sec, RAM, CPU, Sessions, etc.
You may come up with a caching strategy (Output caching, data caching, caching dependencies, and so on.)
See also how your SQL Server is doing... indexes are a good place to start but not the only thing to look at..
On that hardware, a .NET application should be able to serve about 200-400 requests per second. If you have only a few hundred users, I doubt you are seeing even 2 requests per second, so I think you have a lot of capacity on that box, even with SQL server running.
Without know all of the details, I would say no, you will not see any performance improvement by adding servers.
By the way, if you're not using the Output Cache, I would start there.