Does high CPU usage indicate a software module is designed wrong - cpu-usage

We have a process that is computationally intensive. When it runs it typically it uses 99% of the available CPU. It is configured to take advantage of all available processors so I believe this is OK. However, one of our customers is complaining because alarms go off on the server on which this process is running because of the high CPU utilization. I think that there is nothing wrong with high CPU utilization per se. The CPU drops back to normal when the process stops running and the process does run to completion (no infinite loops, etc.). I'm just wondering if I am on solid ground when I say that high CPU usage is not a problem per se.
Thank you,
Elliott

if I am on solid ground when I say that high CPU usage is not a problem per se
You are on solid ground.
We have a process that is computationally intensive
Then I'd expect high CPU usage.
The CPU drops back to normal when the process stops running
Sounds good so far.
Chances are that the systems you client are using are configured to notify when the CPU usage goes over a certain limit, as sometimes this is indicative of a problem (and sustained high usage can cause over heating and associated problems).
If this is expected behavior, your client needs to adjust their monitoring - but you need to ensure that the behavior is as expected on their systems and that it is not likely to cause problems (ensure that high CPU usage is not sustained).

Alarm is not a viable reason for poor design. The real reason may be that it chokes other tasks on the system. Modern OSes usually take care of this by lowering dynamic priority of the CPU hungry process in such a way that others that are less demanding of CPU time will get higher priority. You may tell the customer to "nice" the process to start with, since you probably don't care if it runs 10 mins of 12 mins. Just my 2 cents :)

Related

Limit CPU & Memory for *nix Process

Is it possible to limit CPU & Memory for the *nix Process?
The CPU limit may look like "use no more than 10% of one core".
The memory limit may look like "use no more than 100Mb", the OS may limit it or kill the process if it try to exceed the limit, both ways are fine.
Any *nix that could do that would be fine.
It seems it is possible to implement it with virtual machines, but it is not acceptable because the overhead is too huge.
If you happen to use Solaris, the ability to limit resource usage is a native feature.
Memory (RAM) usage can be capped per process using the rcap.max-rss setting while CPU usage can be limited per project using the project.cpu-caps.
Note that Solaris also allows OS level virtualization (a.k.a. zones) which have no significant overhead, if any, compared to a bare metal OS instance.
Resource capping is part of Solaris zones configuration.
Try CPULimit
cpulimit is a simple program which attempts to limit the cpu usage of a process (expressed in percentage, not in cpu time). This is useful to control batch jobs, when you don't want them to eat too much cpu. It does not act on the nice value or other scheduling priority stuff, but on the real cpu usage. Also, it is able to adapt itself to the overall system load, dynamically and quickly.

Single-CPU programs running on Hyper-Threading-enabled quadcore CPU

I'm a researcher in statistical pattern recognition, and I often run simulations that run for many days. I'm running Ubuntu 12.04 with Linux 3.2.0-24-generic, which, as I understand, supports multicore and hyper-threading. With my Intel Core i7 Sandy Bridge Quadcore with HTT, I often run 4 simulations (programs that take a long time) at the same time. Before I ask my question, here are the things that I already (think I) know.
My OS (Ubuntu 12.04) detects 8 CPUs due to hyper-threading.
The scheduler in my OS is clever enough never to schedule two programs to run on two logical (virtual) cores belonging to the same physical core, because the OS supports SMP (Simultaneous Multi-Threading).
I have read the Wikipedia page on Hyper-Threading.
I have read the HowStuffWorks page on Sandy Bridge.
OK, my question is as follows. When I run 4 simulations (programs) on my computer at the same time, they each run on a separate physical core. However, due to hyper-threading, each physical core is split into two logical cores. Therefore, is it true that each of the physical cores is only using half of its full capacity to run each of my simulations?
Thank you very much in advance. If any part of my question is not clear, please let me know.
This answer is probably late, but I see that nobody offered an accurate description of what's going on under the hood.
To answer your question, no, one thread will not use half a core.
One thread can work inside the core at a time, but that one thread can saturate the whole core processing power.
Assume thread 1 and thread 2 belong to core #0. Thread 1 can saturate the whole core's processing power, while thread 2 waits for the other thread to end its execution. It's a serialized execution, not parallel.
At a glance, it looks like that extra thread is useless. I mean the core can process 1 thread at once right?
Correct, but there are situations in which the cores are actually idling because of 2 important factors:
cache miss
branch misprediction
Cache miss
When it receives a task, the CPU searches inside its own cache for the memory addresses it needs to work with. In many scenarios the memory data is so scattered that it is physically impossible to keep all the required address ranges inside the cache (since the cache does have a limited capacity).
When the CPU doesn't find what it needs inside the cache, it has to access the RAM. The RAM itself is fast, but it pales compared to the CPU's on-die cache. The RAM's latency is the main issue here.
While the RAM is being accessed, the core is stalled. It's not doing anything. This is not noticeable because all these components work at a ridiculous speed anyway and you wouldn't notice it through some CPU load software, but it stacks additively. One cache miss after another and another hampers the overall performance quite noticeably.
This is where the second thread comes into play. While the core is stalled waiting for data, the second thread moves in to keep the core busy. Thus, you mostly negate the performance impact of core stalls.
I say mostly because the second thread can also stall the core if another cache miss happens, but the likelihood of 2 threads missing the cache in a row instead of 1 thread is much lower.
Branch misprediction
Branch prediction is when you have a code path with more than one possible result. The most basic branching code would be an if statement.
Modern CPUs have branch prediction algorithms embedded into their microcode which try to predict the execution path of a piece of code. These predictors are actually quite sophisticated and although I don't have solid data on prediction rate, I do recall reading some articles a while back stating that Intel's Sandy Bridge architecture has an average successful branch prediction rate of over 90%.
When the CPU hits a piece of branching code, it practically chooses one path (path which the predictor thinks is the right one) and executes it. Meanwhile, another part of the core evaluates the branching expression to see if the branch predictor was indeed right or not. This is called speculative execution.
This works similarly to 2 different threads: one evaluates the expression, and the other executes one of the possible paths in advance.
From here we have 2 possible scenarios:
The predictor was correct. Execution continues normally from the speculative branch which was already being executed while the code path was being decided upon.
The predictor was wrong. The entire pipeline which was processing the wrong branch has to be flushed and start over from the correct branch.
OR, the readily available thread can come in and simply execute while the mess caused by the misprediction is resolved. This is the second use of hyperthreading.
Branch prediction on average speeds up execution considerably since it has a very high rate of success. But performance does incur quite a penalty when the prediction is wrong.
Branch prediction is not a major factor of performance degradation since, like I said, the correct prediction rate is quite high.
But cache misses are a problem and will continue to be a problem in certain scenarios.
From my experience hyperthreading does help out quite a bit with 3D rendering (which I do as a hobby). I've noticed improvements of 20-30% depending on the size of the scenes and materials/textures required. Huge scenes use huge amounts of RAM making cache misses far more likely. Hyperthreading helps a lot in overcoming these misses.
Since you are running on a Linux kernel you are in luck because the scheduler is smart enough to make sure your tasks is divided on between your physical cores.
Linux became hyperthredding aware in kernel 2.4.17 ( ref: http://kerneltrap.org/node/391 )
Note that the reference is from the old O(1) scheduler. Linux now uses the CFS scheduling algorithm which was introduced in kernel 2.6.23 and should be even better.
But as already suggested you can experiment by disabling hyper threading in bios and see if your particular workload runs faster or slower with or without hyperthreading enabled. If you start 8 tasks instead of 4 you will probably find that the total executing time for 8 tasks on hyperthreading is faster than two separate runs with 4 tasks but again the best thing to do is to experiment. Good luck!
If you are really want just 4 dedicated cores, you should be able to disable hyperthreading in your BIOS page. Also, and this part I'm less clear on, I believe that the processor is smart enough to do more work on a single thread if its second logical core is idle.
No, it's not exactly true. A hyperthreaded core is not two cores. Some things can run in parallel, but not as much as on two separate cores.

"System Idle Process" eats CPU on a high threading application

I have a multi-threaded web application with about 1000~2000 threads at production environment.
I expect CPU usage on w3wp.exe but System Idle Process eats CPU. Why?
The Idle process isn't actually a real process, it doesn't "eat" your CPU time. the %cpu you see next to it is actually unused %cpu (more or less).
The reason for the poor performance of your application is most likely due to your 2000 threads. Windows (or indeed any operating system) was never meant to run so many threads at a time. You're wasting most of the time just context switching between them, each getting a couple of milliseconds of processing time every ~30 seconds (15ms*2000=30sec!!!!).
Rethink your application.
the idle process is simply holding process time until a program needs it, its not actually eating any cycles at all. you can think the system idle time as 'available cpu'
System Idle Process is not a real process, it represents unused processor time.
This means that your app doesn't utilize the processor completely - it may be memory-bound or CPU-bound; possibly the threads are waiting for each other, or for external resources? Context switching overhead could also be a culprit - unless you have 2000 cores, the threads are not actually running all at the same time, but assigned time slices by the task scheduler, this also takes some time.
You have not provided a lot of details so I can only speculate at this point. I would say it is likely that most of those threads are doing nothing. The ones that are doing something are probably IO bound meaning that they are spending most of their waiting for the external resource to respond.
Now lets talk about the "1000~2000 threads". There are very few cases (maybe none) where having that many threads is a good idea. I think your current issue is a perfect example of why. Most of those threads are (apparently anyway) doing nothing but wasting resources. If you want to process multiple tasks in parallel, espcially if they are IO bound, then it is better to take advantage of pooled resources like the ThreadPool or by using the Task Parallel Library.

How to go from CPU time to CPU utilization?

I'm trying to recognize a run away threads in my own application and close them for good before they render machine inaccessible.
However, I can only get CPU time for the thread, that is limitation of API I'm using. Is there any way to evaluate CPU utilization from that data?
I was thinking about comparing it to real time and if it is close - than that thread is loading CPU too much. What do you think about that heuristic, will it work?
CPU time divided by real time will give you CPU utilization.

Should a non-blocking code push CPU to 100%

I've been thinking today about NodeJS and it attitude towards blocking, it got me thinking, if a block of code is purely non-blocking, say calculating some real long alogirthm and variables are all present in the stack etc.. should this push a single core non hyperthreaded to CPU as Windows Task Manager defines it to 100% as it aims to complete this task as quickly as possible? Say that this is generally calculation that can take minutes.
Yes, it should. The algorithm should run as fast as it can. It's the operating system's job to schedule time to other processes if necessary.
If your non-blocking computation intensive code doesn't use 100% of the CPU then you are wasting cycles in the idle task. It always irritates me to see the idle task using 99% of the CPU.
As long as the CPU is "given" to other processes when there are some that need it to do their calculations, I suppose it's OK : why not use the CPU if it's available and there is some work to do ?
As RAM can be paged out to disk, all applications are potentially blocking. This would happen if the algorithm uses more RAM than available on the system. As a result, it won't hit 100%.

Resources