I am executing dhrystone 2.1 on freescale IMX6 quad processor with 1GHz.Below are the things I tried.
1. Executed dhrystone alone first time.
2. With an application running in the background, I executed dhrystone.
In either cases I am getting DMIPS value same. I do not understand. In second case DMIPS should reduce.Please let me know
You should think about why you expect the dhrystone benchmark to perform worse with another program running in the background. If you want them to fight for cpu time, you need to make sure they are bother scheduled on the same core, because if they are scheduled on different cores, they will each receive 100% cpu time. Another reason your could expect dhrystone to run slower with an app in the background would be shared cache collisions, or memory bandwidth. Both of these reason though I would disqualify for this discussion, because dhrystone is an extremely simple benchmark, which doesn't require much memory bandwidth or cache space. So your best way of slowing it down is scheduling the other app on the same core and restricting them both, so they can't be scheduled elsewhere. For more info on how to perform dhrystone benchmarking for arm refer to this document:
http://infocenter.arm.com/help/topic/com.arm.doc.dai0273a/DAI0273A_dhrystone_benchmarking.pdf
Related
I'd like to know optimal number of cores needed to build a project with GNU make.
I can use --max-load to tune for an existing system, but I want to know if doubling or tripling the core count and memory would improve build wall clock times.
If I could collect statistics on how many recipes make holds waiting for a free core to execute and how long they occupy the core, this could be turned into a standard job scheduling problem.
I don't think there's any way to answer your question, really. Maybe you can be more specific about what you'd like to know.
Obviously the more cores you have, assuming sufficient memory to support them, then the more recipes make can invoke in parallel without crushing your system.
If you have 2 cores and you run make -j300 then make will dutifully invoke 300 jobs at once and your operating system will dutifully attempt to run all of them at the same time. Most likely, your system will be swapping and context switching so much that it will make very little progress and it would take less wall clock time to run make -j2 instead.
On the other hand, if you have 256 cores then make -j300 is probably quite reasonable... assuming you have enough memory to ensure that all those jobs don't wait swapping memory out.
And of course, at some point (but probably far away from any reasonable number of cores you have unless you have a lot of money to spend) you will run into disk IO issues with so many compiler processes running at the same time trying to read source from the disk to compile.
My goto number is num cpus + 1. This is based on a lot of informal benchmarks, and is usually very close to the optimal number. -j9 on a hyper-threaded four core laptop, and -j49 on my usual production build server.
The + 1 means that make keeps all the CPUs occupied, even as jobs are being retired, and is usually a teensy-weensy bit faster than without the increment.
It also means that other users can use the same multiplier without melting the machine.
Be aware though, that although -j49 ensures there are only 49 processes actually running, the parent make will potentially have many more child processes than that. For instance, a single compile may mean the shell is called, which calls a shell script, which calls the compiler driver, which calls the correct compiler stage. On some toolchains my -j49 builds have a peak of 245 child processes. A bit annoying when my ulimit max user processes is only 512.
I'm benchmarking an MPI program with different compiler setups.
Right now I use Linux's time to do so:
$> $(which time) mpirun -v [executable]
The values I get look ok in terms of what I expected.
Are there any reasons why I should not be using time for this?
Measuring the needed CPU time is of main interest here.
I'm aware that benchmarking on a single machine is not necessarily going to be consistent with what's happening on a cluster, but this is out of scope.
You should not use time for the purpose of getting the CPU time of a MPI program.
Firstly, that is not going to work in a distributed setup. Now your question is not clear whether you target a single node or a cluster, but that doesn't even matter. An MPI implementation may use whatever mechanism for launching even on a single node. So time may or may not include the CPU time of the actual application processes.
But there is more conceptional issues: What does CPU time for an MPI program mean? That would be the sum of CPU time of all processes. That is a bad metric for benchmarking: It does not quantify improvement, and it does not correlate to overall runtime. For instance a very imbalanced version of your code may use less CPU time but more wall time than a balanced one. Or enabling busy waiting instead of blocking may improve overall runtime, but also increase used CPU time. To really understand what is happenening, and which process uses what kind of resources, you should resort to a proper parallel performance analysis tool.
In HPC, you are not going to be budgeted by CPU time but rather by reserved CPUs * walltime. So if you must use a one dimensional metric, then walltime is the way to go. Now you can use time mpirun ... to get that, although accuracy won't be great for short running applications.
We got a high traffic website which generates a lot of I/O. Within 10 minutes it has been reading over 10 gb of data (w3wp in question seen in task manager). For memory and application hangs I have been using WinDbg with success. But I don't know how I can find the object(s) / method(s) within a process which are responsible for the highest I/O.
Is this even possible?
Edit
The question is: Is there a way to profile I/O operations in a .NET assembly, say: list of threads sorted by highest disk I/O (or something similar that would help me where to look)
ANTS Performance Profiler
I have used this tool to great success - dealing with finding the specific instructions which are causing ~512GB of memory on a high-volume web farm getting chewed up within 5-10 minutes. Sounds like a very similar situation as yours.
Now, to be realistic - it's not going to magically solve your problem. It still requires a lot of setup, thorough analysis and detective work. But this tool definitely took the problem from "practically unsolvable" to "solvable within days".
Update:
As I mentioned in the comments (and Ben Emmett echoed), we can use ANTS to monitor memory, file system handles - pretty much any resource consumption and drill down the call stack to see the effects of specific routines.
I came up with this tool AppDynamics Lite which displays your application calls costs and performance in a visual way. It might help you to find out which functions are making the most costy IO operations.
Quoting;
Understand the health of your CLR with key metrics like response time, throughput, exception rate, and garbage collection time as well as key system resource like CPU, memory and disk I/O.
Worth giving a shot as it is trial/free for 30 days. Hope it helps.
Ps: I'm not affiliated with AppDynamics in any way.
You can use the (free) Windows Performance Toolkit from Windows 8 which does run also on Windows Vista and later. There you can turn on system wide profiling to see what was going on in all processes at once. No instrumentation necessary. Only one reboot is required to set an arcane registry key which is done by WPRUI.exe automatically.
With XPerf you could enable IO Init stack walking so that a call stack is taken for every IO which is started. The only issue is that the stacks will be broken for 64 bit processes which means that you will see only the first method above the BCL methods of your code because there is a Windows 7 bug in the stackwalking capabilities of the OS.
A workaround is to Ngen your assemblies or move to Server 2012 or switch to x86 for profiling to see deeper call stacks.
You will see all file IO and CPU activity even without any call stacks and the file names along how long the hard disc was used. That should give you good information which part of your app is causing the disc IO. From the partial call stacks you should be able to pinpoint your issue even without full stacks.
The tool will give you much more insight than any commercially available profiler at the expense that you need to learn how to use it. Since the call stacks do not end at your code or in user mode but in the kernel you can also determine if e.g. the virus scanner is causing significant IO delays. But you need to know how your processor does work. This toolset was originally aimed at kernel devs which explains why you see so many useless columns.
In the picture below you see file IO and CPU consumption stacked. When you select your high IO file in the disc IO graph it will highlight in the CPU consumption all related call stacks which were taken at the same time while the IO was active. This way you can diretly navigate from the IO to your potentially blocked threads.
I'm a researcher in statistical pattern recognition, and I often run simulations that run for many days. I'm running Ubuntu 12.04 with Linux 3.2.0-24-generic, which, as I understand, supports multicore and hyper-threading. With my Intel Core i7 Sandy Bridge Quadcore with HTT, I often run 4 simulations (programs that take a long time) at the same time. Before I ask my question, here are the things that I already (think I) know.
My OS (Ubuntu 12.04) detects 8 CPUs due to hyper-threading.
The scheduler in my OS is clever enough never to schedule two programs to run on two logical (virtual) cores belonging to the same physical core, because the OS supports SMP (Simultaneous Multi-Threading).
I have read the Wikipedia page on Hyper-Threading.
I have read the HowStuffWorks page on Sandy Bridge.
OK, my question is as follows. When I run 4 simulations (programs) on my computer at the same time, they each run on a separate physical core. However, due to hyper-threading, each physical core is split into two logical cores. Therefore, is it true that each of the physical cores is only using half of its full capacity to run each of my simulations?
Thank you very much in advance. If any part of my question is not clear, please let me know.
This answer is probably late, but I see that nobody offered an accurate description of what's going on under the hood.
To answer your question, no, one thread will not use half a core.
One thread can work inside the core at a time, but that one thread can saturate the whole core processing power.
Assume thread 1 and thread 2 belong to core #0. Thread 1 can saturate the whole core's processing power, while thread 2 waits for the other thread to end its execution. It's a serialized execution, not parallel.
At a glance, it looks like that extra thread is useless. I mean the core can process 1 thread at once right?
Correct, but there are situations in which the cores are actually idling because of 2 important factors:
cache miss
branch misprediction
Cache miss
When it receives a task, the CPU searches inside its own cache for the memory addresses it needs to work with. In many scenarios the memory data is so scattered that it is physically impossible to keep all the required address ranges inside the cache (since the cache does have a limited capacity).
When the CPU doesn't find what it needs inside the cache, it has to access the RAM. The RAM itself is fast, but it pales compared to the CPU's on-die cache. The RAM's latency is the main issue here.
While the RAM is being accessed, the core is stalled. It's not doing anything. This is not noticeable because all these components work at a ridiculous speed anyway and you wouldn't notice it through some CPU load software, but it stacks additively. One cache miss after another and another hampers the overall performance quite noticeably.
This is where the second thread comes into play. While the core is stalled waiting for data, the second thread moves in to keep the core busy. Thus, you mostly negate the performance impact of core stalls.
I say mostly because the second thread can also stall the core if another cache miss happens, but the likelihood of 2 threads missing the cache in a row instead of 1 thread is much lower.
Branch misprediction
Branch prediction is when you have a code path with more than one possible result. The most basic branching code would be an if statement.
Modern CPUs have branch prediction algorithms embedded into their microcode which try to predict the execution path of a piece of code. These predictors are actually quite sophisticated and although I don't have solid data on prediction rate, I do recall reading some articles a while back stating that Intel's Sandy Bridge architecture has an average successful branch prediction rate of over 90%.
When the CPU hits a piece of branching code, it practically chooses one path (path which the predictor thinks is the right one) and executes it. Meanwhile, another part of the core evaluates the branching expression to see if the branch predictor was indeed right or not. This is called speculative execution.
This works similarly to 2 different threads: one evaluates the expression, and the other executes one of the possible paths in advance.
From here we have 2 possible scenarios:
The predictor was correct. Execution continues normally from the speculative branch which was already being executed while the code path was being decided upon.
The predictor was wrong. The entire pipeline which was processing the wrong branch has to be flushed and start over from the correct branch.
OR, the readily available thread can come in and simply execute while the mess caused by the misprediction is resolved. This is the second use of hyperthreading.
Branch prediction on average speeds up execution considerably since it has a very high rate of success. But performance does incur quite a penalty when the prediction is wrong.
Branch prediction is not a major factor of performance degradation since, like I said, the correct prediction rate is quite high.
But cache misses are a problem and will continue to be a problem in certain scenarios.
From my experience hyperthreading does help out quite a bit with 3D rendering (which I do as a hobby). I've noticed improvements of 20-30% depending on the size of the scenes and materials/textures required. Huge scenes use huge amounts of RAM making cache misses far more likely. Hyperthreading helps a lot in overcoming these misses.
Since you are running on a Linux kernel you are in luck because the scheduler is smart enough to make sure your tasks is divided on between your physical cores.
Linux became hyperthredding aware in kernel 2.4.17 ( ref: http://kerneltrap.org/node/391 )
Note that the reference is from the old O(1) scheduler. Linux now uses the CFS scheduling algorithm which was introduced in kernel 2.6.23 and should be even better.
But as already suggested you can experiment by disabling hyper threading in bios and see if your particular workload runs faster or slower with or without hyperthreading enabled. If you start 8 tasks instead of 4 you will probably find that the total executing time for 8 tasks on hyperthreading is faster than two separate runs with 4 tasks but again the best thing to do is to experiment. Good luck!
If you are really want just 4 dedicated cores, you should be able to disable hyperthreading in your BIOS page. Also, and this part I'm less clear on, I believe that the processor is smart enough to do more work on a single thread if its second logical core is idle.
No, it's not exactly true. A hyperthreaded core is not two cores. Some things can run in parallel, but not as much as on two separate cores.
I've been thinking today about NodeJS and it attitude towards blocking, it got me thinking, if a block of code is purely non-blocking, say calculating some real long alogirthm and variables are all present in the stack etc.. should this push a single core non hyperthreaded to CPU as Windows Task Manager defines it to 100% as it aims to complete this task as quickly as possible? Say that this is generally calculation that can take minutes.
Yes, it should. The algorithm should run as fast as it can. It's the operating system's job to schedule time to other processes if necessary.
If your non-blocking computation intensive code doesn't use 100% of the CPU then you are wasting cycles in the idle task. It always irritates me to see the idle task using 99% of the CPU.
As long as the CPU is "given" to other processes when there are some that need it to do their calculations, I suppose it's OK : why not use the CPU if it's available and there is some work to do ?
As RAM can be paged out to disk, all applications are potentially blocking. This would happen if the algorithm uses more RAM than available on the system. As a result, it won't hit 100%.