Confusion about the performance counter

Confusion about the performance counter - opencl

I am working on a program written in OpenCL and running on Fusion APU (CPU+GPU on one die). I wan to get some performance counters such as instructions number, branch number and so on. I have two tools on hand: AMD APP Profiler and CodeAnalyst. When I use the APP Profiler, I found that it seems can only provide instructions counter for GPU, cannot for CPU. Then I use CodeAnalyst, but then three confusions occurred.
On App Profiler, it can give the number of ALUInsts (i.e. the number of executed ALU instructions per work-item) is about 70000. The whole thread space on GPU has 8192 threads, so I intuitively think there are 70000 * 8192 instructions executed by GPU. Is that right?
When I use CodeAnalyst to measure the instructions for the same program on CPU part, it just gave "Ret inst", "Ret branch" such kind of counters, but I am not sure about one thing: this program runs on both CPU and GPU at the same time, what are these counters for? For CPU only, for GPU only? or the sum?
No matter what these counters for, I found that the value of Ret Inst (i.e. retired instructions) is about 40000, it seems too small for the whole program, I guess the instructions for a program should be at order of billions, how it could be only 4w? The attached pic shows the results.
Is there any people can help me resolve these confusions, I am just a tyro here, wish kind help from all of you. Thanks!

Related

OpenCL shared memory optimisation

I am solving a 2d Laplace equation using OpenCL.
The global memory access version runs faster than the one using shared memory.
The algorithm used for shared memory is same as that in the OpenCL Game of Life code.
https://www.olcf.ornl.gov/tutorials/opencl-game-of-life/
If anyone has faced the same problem please help. If anyone wants to see the kernel I can post it.

If your global-memory really runs faster than your local-memory version (assuming both are equally optimized depending on the memory space you're using), maybe this paper could answer your question.
Here's a summary of what it says:
Usage of local memory in a kernel add another constraint to the number of concurrent workgroups that can be run on the same compute unit.
Thus, in certain cases, it may be more efficient to remove this constraint and live with the high latency of global memory accesses. More wavefronts (warps in NVidia-parlance, each workgroup is divided into wavefronts/warps) running on the same compute unit allow your GPU to hide latency better: if one is waiting for a memory access to complete, another can compute during this time.
In the end, each kernel will take more wall-time to proceed, but your GPU will be completely busy because it is running more of them concurrently.

No, it doesn't. It only says that ALL OTHER THINGS BEING EQUAL, an access from local memory is faster than an access from global memory. It seems to me that global accesses in your kernel are being coalesced which yields better performance.

Using shared memory (memory shared with CPU) isn't always going to be faster. Using a modern graphics card It would only be faster in the situation that the GPU/CPU are both performing oepratoins on the same data, and needed to share information with each-other, as memory wouldn't have to be copied from the card to the system and vice-versa.
However, if your program is running entirely on the GPU, it could very well execute faster by running in local memory (GDDR5) exclusively since the GPU's memory will not only likely be much faster than your systems, there will not be any latency caused by reading memory over the PCI-E lane.
Think of the Graphics Card's memory as a type of "l3 cache" and your system's memory a resource shared by the entire system, you only use it when multiple devices need to share information (or if your cache is full). I'm not a CUDA or OpenCL programmer, I've never even written Hello World in these applications. I've only read a few white papers, it's just common sense (or maybe my Computer Science degree is useful after all).

Single-CPU programs running on Hyper-Threading-enabled quadcore CPU

I'm a researcher in statistical pattern recognition, and I often run simulations that run for many days. I'm running Ubuntu 12.04 with Linux 3.2.0-24-generic, which, as I understand, supports multicore and hyper-threading. With my Intel Core i7 Sandy Bridge Quadcore with HTT, I often run 4 simulations (programs that take a long time) at the same time. Before I ask my question, here are the things that I already (think I) know.
My OS (Ubuntu 12.04) detects 8 CPUs due to hyper-threading.
The scheduler in my OS is clever enough never to schedule two programs to run on two logical (virtual) cores belonging to the same physical core, because the OS supports SMP (Simultaneous Multi-Threading).
I have read the Wikipedia page on Hyper-Threading.
I have read the HowStuffWorks page on Sandy Bridge.
OK, my question is as follows. When I run 4 simulations (programs) on my computer at the same time, they each run on a separate physical core. However, due to hyper-threading, each physical core is split into two logical cores. Therefore, is it true that each of the physical cores is only using half of its full capacity to run each of my simulations?
Thank you very much in advance. If any part of my question is not clear, please let me know.

This answer is probably late, but I see that nobody offered an accurate description of what's going on under the hood.
To answer your question, no, one thread will not use half a core.
One thread can work inside the core at a time, but that one thread can saturate the whole core processing power.
Assume thread 1 and thread 2 belong to core #0. Thread 1 can saturate the whole core's processing power, while thread 2 waits for the other thread to end its execution. It's a serialized execution, not parallel.
At a glance, it looks like that extra thread is useless. I mean the core can process 1 thread at once right?
Correct, but there are situations in which the cores are actually idling because of 2 important factors:
cache miss
branch misprediction
Cache miss
When it receives a task, the CPU searches inside its own cache for the memory addresses it needs to work with. In many scenarios the memory data is so scattered that it is physically impossible to keep all the required address ranges inside the cache (since the cache does have a limited capacity).
When the CPU doesn't find what it needs inside the cache, it has to access the RAM. The RAM itself is fast, but it pales compared to the CPU's on-die cache. The RAM's latency is the main issue here.
While the RAM is being accessed, the core is stalled. It's not doing anything. This is not noticeable because all these components work at a ridiculous speed anyway and you wouldn't notice it through some CPU load software, but it stacks additively. One cache miss after another and another hampers the overall performance quite noticeably.
This is where the second thread comes into play. While the core is stalled waiting for data, the second thread moves in to keep the core busy. Thus, you mostly negate the performance impact of core stalls.
I say mostly because the second thread can also stall the core if another cache miss happens, but the likelihood of 2 threads missing the cache in a row instead of 1 thread is much lower.
Branch misprediction
Branch prediction is when you have a code path with more than one possible result. The most basic branching code would be an if statement.
Modern CPUs have branch prediction algorithms embedded into their microcode which try to predict the execution path of a piece of code. These predictors are actually quite sophisticated and although I don't have solid data on prediction rate, I do recall reading some articles a while back stating that Intel's Sandy Bridge architecture has an average successful branch prediction rate of over 90%.
When the CPU hits a piece of branching code, it practically chooses one path (path which the predictor thinks is the right one) and executes it. Meanwhile, another part of the core evaluates the branching expression to see if the branch predictor was indeed right or not. This is called speculative execution.
This works similarly to 2 different threads: one evaluates the expression, and the other executes one of the possible paths in advance.
From here we have 2 possible scenarios:
The predictor was correct. Execution continues normally from the speculative branch which was already being executed while the code path was being decided upon.
The predictor was wrong. The entire pipeline which was processing the wrong branch has to be flushed and start over from the correct branch.
OR, the readily available thread can come in and simply execute while the mess caused by the misprediction is resolved. This is the second use of hyperthreading.
Branch prediction on average speeds up execution considerably since it has a very high rate of success. But performance does incur quite a penalty when the prediction is wrong.
Branch prediction is not a major factor of performance degradation since, like I said, the correct prediction rate is quite high.
But cache misses are a problem and will continue to be a problem in certain scenarios.
From my experience hyperthreading does help out quite a bit with 3D rendering (which I do as a hobby). I've noticed improvements of 20-30% depending on the size of the scenes and materials/textures required. Huge scenes use huge amounts of RAM making cache misses far more likely. Hyperthreading helps a lot in overcoming these misses.

Since you are running on a Linux kernel you are in luck because the scheduler is smart enough to make sure your tasks is divided on between your physical cores.
Linux became hyperthredding aware in kernel 2.4.17 ( ref: http://kerneltrap.org/node/391 )
Note that the reference is from the old O(1) scheduler. Linux now uses the CFS scheduling algorithm which was introduced in kernel 2.6.23 and should be even better.
But as already suggested you can experiment by disabling hyper threading in bios and see if your particular workload runs faster or slower with or without hyperthreading enabled. If you start 8 tasks instead of 4 you will probably find that the total executing time for 8 tasks on hyperthreading is faster than two separate runs with 4 tasks but again the best thing to do is to experiment. Good luck!

If you are really want just 4 dedicated cores, you should be able to disable hyperthreading in your BIOS page. Also, and this part I'm less clear on, I believe that the processor is smart enough to do more work on a single thread if its second logical core is idle.

No, it's not exactly true. A hyperthreaded core is not two cores. Some things can run in parallel, but not as much as on two separate cores.

OpenCL and multiple video cards

My understanding of the differences between CPUs and GPUs is that the GPUs are not general purpose processors such that if a video card contains 10 GPUs, each GPU actual share the same program pointer and to optimize parallelism on the GPU I need to ensure each GPU is actually running the same code.
Synchronisation is not a problem on the same card since each GPU is physically running in parallel so they should all complete at the same time.
My question is, how does this work on multiple cards? At the speed at which they operate at, doesn't the hardware make a slight difference in execution times such that a calculation on one GPU on one card may end quicker or slower than the same calculation on another GPU on another card?
thanks

Synchronisation is not a problem on the same card since each GPU is physically running in parallel so they should all complete at the same time.
This is not true. Different threads on a GPU may complete at different times due to differences in memory access latency, for example. That is why there are synchronization primitives in OpenCL such as the barrier command. You can never assume that your threads are running precisely in parallel.
The same is true for multiple GPUs. There is no guarantee that they are in sync, so you will need to rely on API calls such as clFinish to explicitly synchronize their work.

I think you may be confused about how threads work on a GPU. First to address the issue of multiple GPUs. Multiple GPUs NEVER share the program pointer, so they will almost never complete a kernel at the same time.
On a single GPU, only threads that are executing ON THE SAME COMPUTE UNIT (or SM in NVIDIA parlance) AND are part of the same warp/wavefront are guaranteed to execute in sync.
You can never really count on this, but for some devices the compiler can determine that will be the case (I am specifically thinking about some AMD devices, as long as the worgroup size is hardcoded to 64).
In any case, as #vocaro pointed out, that's why you need to use a barrier for local memory.
To emphasize, even on the same GPU, threads are not executing in parallel across the whole device - only within each compute unit.

OpenCL- waste of host computing power

I am new to OpenCL, please tell me that the host cpu can be used only for allocating memory to the device, or we can use it can as an openCL device. (Because after the allocation is done, the host cpu will be idle).

You can use a cpu as a compute device. Opencl even allows multicore/processor systems to segment cores into separate compute units. I like to use this feature to divide the cpus on my system into groups based on NUMA nodes. It is possible to divide a cpu into compute devices which all share the same level of cache memory (L1, L2, L3 or L4).
You need a platform that supports it, such as AMD's SDK. I know there are ways to have Nvidia and AMD platforms on the same machine, but I have never had to do so myself.
Also, the opencl event/callback system allows you to use your cpu as you normally would while the gpu kernels are executing. In this way, you can use openmp or any other code on the host while you wait for the gpu kernel to finish.

There's no reason the CPU has to be idle, but it needs a separate job to do. Once you've submitted work to OpenCL you can:
Get on with something else, like preparing the next set of work, or performing calculation on something completely different.
Have the CPU set up as another compute device, and so submit a piece of work to it.
Personally I tend to find myself needing the first case more often as it's rare I find myself with two tasks that are independent and lend themselves to OpenCL style. The trick is keeping things balanced so you're not waiting a long time for the GPU task to finish, or having the GPU idle while the CPU is getting on with other work.
It's the same problem OpenGL coders had to conquer. Avoiding being CPU or GPU bound, and balancing between the two for best performance.

What is the highest number of threads that is reasonable to simultaneously run in Jmeter?

I want to use the highest possible number of threads (to use less computers) but without making the bottleneck to be in the client.

JMeter can simulate a very High Load provided you use it right.
Don't listen to Urban Legends that say JMeter cannot handle high load.
Now as for answer, it depends on:
your machine power
your jvm 32 bits or 64 bits
your jvm allocated memory -Xmx
your test plan ( lot of beanshell, post processor, xpath ... Means lots of cpu)
your os configuration (tunable)
Gui / non gui mode
So there is no theorical answer but following Best Practices will ensure JMeter performs well.
Note that with jmeter you can distribute load through remote testing, read:
Remote Testing > 15.4 Using a different sample sender
And finally use cloud based testing if it's not enough.
Read this for tuning tips:
http://www.ubik-ingenierie.com/blog/jmeter_performance_tuning_tips/
Read this book for doing load testing and using JMeter correctly.

I have used JMeter a fair bit and found it is not great at generating really high load. On a 2Ghz Core2 Duo with 2Gb memory you can reasonably expect about 100 threads.
That being said, it is best to run it on your hardware so that the CPU of the PC does not peak at 100% - a stable 80%-90% is best otherwise the results are affected.
I have also tried WAPT 5 - it successfully ran 1000+ threads from the same PC. It is not free but it is more useable than JMeter but doesn't have all of the features.
Outdated answer since at least version 2.6 see https://stackoverflow.com/a/11922239/460802 for a more up to date one.

The JMeter Wiki reports cases where JMeter was used with as much as 1000 threads. I have used it with at most 100 threads, but the Links in the Wiki suggest resource reductions I never tried.

One of the issues we had with running JMeter on Windows XP was the Windows XP TCP Connection Limit. Limit should be removed in order to run use the JMeter to workstation’s full potential
More info here. AFAIK, does not apply to other OS.

I used JMeter since 2004 and i launched lot of load tests.
With PC Windows 7 64 bits 4Go RAM iCore5.
I think JMeter can support 300 to 400 concurrent threads for Http (Sampler) protocol with only one "Aggregate Report Listener" who writes in the log file results and timers between call pages.
For a big load test you could configure JMeter with slaves (load generators) like this
http://jmeter-plugins.org/wiki/HttpSimpleTableServer/
I have already done tests with 11 PC slaves to simulate 5000 threads.

I have not used JMeter, but the answer probably depends on your hardware. Best bet might be to establish metrics of performance, guess at the number of threads and then run a binary search as follows.
Source was Wikipedia.
Number guessing game...
This rather simple game begins something like "I'm thinking of an integer between forty and sixty inclusive, and to your guesses I'll respond 'High', 'Low', or 'Yes!' as might be the case." Supposing that N is the number of possible values (here, twenty-one as "inclusive" was stated), then at most questions are required to determine the number, since each question halves the search space. Note that one less question (iteration) is required than for the general algorithm, since the number is already constrained to be within a particular range.
Even if the number we're guessing can be arbitrarily large, in which case there is no upper bound N, we can still find the number in at most steps (where k is the (unknown) selected number) by first finding an upper bound by repeated doubling. For example, if the number were 11, we could use the following sequence of guesses to find it: 1, 2, 4, 8, 16, 12, 10, 11
One could also extend the technique to include negative numbers; for example the following guesses could be used to find −13: 0, −1, −2, −4, −8, −16, −12, −14, −13

It is more dependent on the kind of performance testing you do(load, spike, endurance etc) on a specific server (a little on hardware dependency)
Keep in mind around these parameters
- the client machine on which you are targeting the run of jmeter, there will be a certain amount of heap memory allocated, ensure to have a healthy allocation so that the script does not error out. The highest i had run on jmeter was 1500 on a local environment ( client - server arch), On a Web arch, the highest i had a run was based upon Non- functional requirement were limited to 250 threads,
so it ideally depends on the kinds of performance testing and deployment style and so on..

There is not standard number for this. The maximum number of threads that you can generate from one computer depends completely on the computer's hardware and the OS. The OS by default occupies certain amount of CPU and the RAM.
To find out the maximum threads your computer can handle you can prepare a sample test and run it with only a few threads. Then with each cycle of test run increase the number of threads gradually. During this you also need to monitor the CPU, RAM, Disk I/O and Network I/O of your computer. The moment any of these reach near or beyond 80% (Again for you to decide if near is okay for you or beyond), that is the maximum number of threads your computer can handle. To be on the safer side I would stop at the number when the resource utilization reaches 70%.

It'll depend on the hardware you run on as well as the underlying script. I've always felt that this fuzziness is the biggest problem with traditional load testing tools. If you've got a small budget ($200 or so gets you a LOT of testing), check out my company's load testing service, BrowserMob.
Besides our Real Browser Users (RBUs) which control thousands on actual browsers for the purpose of performance and load testing, we also have traditional virtual users (VUs). Scripts are written in JavaScript and can make various HTTP calls.
The reason I bring it up is that I always felt that the game of trying to figure out how many VUs you can fit on your load gen hardware is dangerous. It's so easy to get bad results without realizing it.
To solve that for BrowserMob, we took an extremely conservative approach on the number of VUs and RBUs per CPU core: no more than 1 browser or 50 threads per CPU core, and sometimes much less. In the world of cloud computing, CPU cycles are so cheap that it just doesn't make sense to try to overload machines.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex