Opencl Workitems and streaming processors

Opencl Workitems and streaming processors - opencl

What is the relation between an workitem and a streaming processor(cuda core). I read somewhere that the number of workitems SHOULD greatly exceed the number of cores, otherwise there is no performance improvement. But why is this so?? I thought 1 core repsresents 1 workitem. Can someone help me to understand this?
Thanks

GPUs and most other hardware tend to do arithmetic much faster than they can access most of their available memory. Having many more work items than you have processors lets the scheduler stagger the memory use, while those work items which have already read their data are using the ALU hardware to do the processing.
Here is a good page about optimization in opencl. Scroll down to "
2.4. Removing 'Costly' Global GPU Memory Access", where it goes into this concept.

The reason is mainly scheduling - a single core/processor/unit can usually run multiple threads and switch between them to hide memory latency (SMT). So it's generally a good idea for each core to have multiple threads queued up for it.
A thread will usually correspond to at least one work item, although depending on driver and hardware, multiple work-items might be combined into one thread, to make use of SIMD/vector capabilities of a core.

Related

OpenCL local_work_size NULL

When enqueueing an OpenCL kernel, local_work_size can be set to NULL, in which case the OpenCL implementation will determine how to break the global work-items into appropriate work-group instances.
Automatically calculating the local_work_size seems like a great feature (better than guessing a multiple of 64).
Does OpenCL's work group size choice tend to be optimal? Are there cases where it would be better to manually specify local_work_size?

This depends on how your kernel is written. Often times to get the best performance your kernels need to make assumptions based on the local work size. For example in convolution you want to use the maximum amount of local memory you can to prevent extra reads back to global memory. You will want to handle as many threads as you can based on the incoming kernel sizes and how much local memory your device has. Configuring your local work size based on incoming parameters such as the kernel size can be the difference in major speed ups not just small differences. This is one reason why a language such as Renderscript Compute will never be able to provide performance close to optimized OpenCL/Cuda that allow for the developer to be aware of the hardware they are running on.
Also you are not guessing about the size. Well you certainly can make general assumptions but you can achieve better performance by looking at the architecture you are running (check AMD/NVIDIA/Intel guides on each device) and optimizing for them. You may change that at runtime by having tweaks in your code to modify your OpenCL kernel at runtime (since it is just a string) or you could have multiple kernels and select the best one at runtime.
That said using NULL for the workgroup is a great way to not worry about optimization and to just test out acceleration on the GPU with little effort. You will almost certainly get much better performance if you are aware of the hardware, make better choices, and write your kernels with knowledge of the size of the local workgroup.

CPU/Intel OpenCL performance issues, implementation questions

I have some questions hanging in the air without an answer for a few days now. The questions arose because I have an OpenMP and an OpenCL implementations of the same problem. The OpenCL runs perfectly on GPU but has 50% less performance when ran on CPU (compared to the OpenMP implementation). A post is already dealing with the difference between OpenMP and OpenCL performances, but it doesn't answer my questions. At the moment I face these questions:
1) Is it really that important to have "vectorized kernel" (in terms of the Intel Offline Compiler)?
There is a similar post, but I think my question is more general.
As I understand: a vectorized kernel not necessarily means that there is no vector/SIMD instruction in the compiled binary. I checked the assembly codes of my kernels, and there are a bunch of SIMD instructions. A vectorized kernel means that by using SIMD instructions you can execute 4 (SSE) or 8 (AVX) OpenCL "logical" threads in one CPU thread. This can only be achieved if ALL your data is consecutively stored in the memory. But who has such perfectly sorted data?
So my question would be: Is it really that important to have your kernel "vectorized" in this sense?
Of course it gives performance improvement, but if most of the computation intensive parts in the kernel are done by vector instructions then you might get near the "optimal" performance. I think the answer to my question lies in the memory bandwidth. Probably vector registers better fit to efficient memory access. In that case the kernel arguments (pointers) have to be vectorized.
2) If I allocate data in local memory on a CPU, where will it be allocated? OpenCL shows L1 cache as local memory, but it is clearly not the same type of memory like on GPU's local memory. If its stored in the RAM/global memory, then there is no sense copying data into it. If it would be in cache, some other process would might flush it out... so that doesn’t make sense either.
3) How are "logical" OpenCL threads mapped to real CPU software/hardware(Intel HTT) threads? Because if I have short running kernels and the kernels are forked like in TBB (Thread Building Blocks) or OpenMP then the fork overhead will dominate.
4) What is the thread fork overhead? Are there new CPU threads forked for every "logical" OpenCL threads or are the CPU threads forked once, and reused for more "logical" OpenCL threads?
I hope I'm not the only one who is interested in these tiny things and some of you might now bits of these problems. Thank you in advance!
UPDATE
3) At the moment OpenCL overhead is more significant then OpenMP, so heavy kernels are required for efficient runtime execution. In Intel OpenCL a work-group is mapped to an TBB thread, so 1 virtual CPU core executes a whole work-group (or thread block). A work-group is implemented with 3 nested for loops, where the inner most loop is vectorized, if possible. So you could imagine it something like:
#pragam omp parallel for
for(wg=0; wg < get_num_groups(2)*get_num_groups(1)*get_num_groups(0); wg++) {
for(k=0; k<get_local_size(2); k++) {
for(j=0; j<get_local_size(1); j++) {
#pragma simd
for(i=0; i<get_local_size(0); i++) {
... work-load...
}
}
}
}
If the inner most loop can be vectorized it steps with SIMD steps:
for(i=0; i<get_local_size(0); i+=SIMD) {
4) Every TBB thread is forked once during the OpenCL execution and they are reused. Every TBB thread is tied to a virtual core, ie. there is no thread migration during the computation.
I also accept #natchouf-s answer.

I may have a few hints to your questions. In my little experience, a good OpenCL implementation tuned for the CPU can't beat a good OpenMP implementation. If it does, you could probably improve the OpenMP code to beat the OpenCL one.
1) It is very important to have vectorized kernels. It is linked to your question number 3 and 4. If you have a kernel that handles 4 or 8 input values, you'll have much less work items (threads), and hence much less overhead. I recommend to use the vector instructions and data provided by OpenCL (like float4, float8, float16) instead of relying on auto-vectorization.
Do not hesitate to use float16 (or double16): this will be mapped to 4 sse or 2 avx vectors and will divide by 16 the number of work items required (which is good for CPU, but not always for GPU: I use 2 different kernels for CPU and GPU).
2) local memory on CPU is the RAM. Don't use it on a CPU kernel.
3 and 4) I don't really know, it will depend on the implementation, but the fork overhead seems important to me.

for question 3:
Intel group logical OpenCL threads into one hardware thread.
and the group size can varies from 4, 8, to 16.
A logical OpenCL thread map to one SIMD lane of execution unit.
one execution unit has two SIMD engines with a width of 4.
please refer to following document for further details.
https://software.intel.com/sites/default/files/Faster-Better-Pixels-on-the-Go-and-in-the-Cloud-with-OpenCL-on-Intel-Architecture.pdf

MPI vs openMP for a shared memory

Lets say there is a computer with 4 CPUs each having 2 cores, so totally 8 cores. With my limited understanding I think that all processors share same memory in this case. Now, is it better to directly use openMP or to use MPI to make it general so that the code could work on both distributed and shared settings. Also, if I use MPI for a shared setting would performance decrease compared with openMP?

Whether you need or want MPI or OpenMP (or both) heavily depends the type of application you are running, and whether your problem is mostly memory-bound or CPU-bound (or both). Furthermore, it depends on the type of hardware you are running on. A few examples:
Example 1
You need parallelization because you are running out of memory, e.g. you have a simulation and the problem size is so large that your data does not fit into the memory of a single node anymore. However, the operations you perform on the data are rather fast, so you do not need more computational power.
In this case you probably want to use MPI and start one MPI process on each node, thereby making maximum use of the available memory while limiting communication to the bare minimum.
Example 2
You usually have small datasets and only want to speed up your application, which is computationally heavy. Also, you do not want to spend much time thinking about parallelization, but more your algorithms in general.
In this case OpenMP is your first choice. You only need to add a few statements here and there (e.g. in front of your for loops that you want to accelerate), and if your program is not too complex, OpenMP will do the rest for you automatically.
Example 3
You want it all. You need more memory, i.e. more computing nodes, but you also want to speed up your calculations as much as possible, i.e. running on more than one core per node.
Now your hardware comes into play. From my personal experience, if you have only a few cores per node (4-8), the performance penalty created by the general overhead of using OpenMP (i.e. starting up the OpenMP threads etc.) is more than the overhead of processor-internal MPI communication (i.e. sending MPI messages between processes that actually share memory and would not need MPI to communicate).
However, if you are working on a machine with more cores per node (16+), it will become necessary to use a hybrid approach, i.e. parallelizing with MPI and OpenMP at the same time. In this case, hybrid parallelization will be necessary to make full use of your computational resources, but it is also the most difficult to code and to maintain.
Summary
If you have a problem that is small enough to be run on just one node, use OpenMP. If you know that you need more than one node (and thus definitely need MPI), but you favor code readability/effort over performance, use only MPI. If using MPI only does not give you the speedup you would like/require, you have to do it all and go hybrid.
To your second question (in case that did not become clear):
If you setup is such that you do not need MPI at all (because your will always run on only one node), use OpenMP as it will be faster. But If you know that you need MPI anyways, I would start with that and only add OpenMP later, when you know that you've exhausted all reasonable optimization options for MPI.

With most distributed memory platforms nowadays consisting of SMP or NUMA nodes it just makes no sense to not use OpenMP. OpenMP and MPI can perfectly work together; OpenMP feeds the cores on each node and MPI communicates between the nodes. This is called hybrid programming. It was considered exotic 10 years ago but now it is becoming mainstream in High Performance Computing.
As for the question itself, the right answer, given the information provided, has always been one and the same: IT DEPENDS.

For use on a single shared memory machine like that, I'd recommend OpenMP. It make some aspects of the problem simpler and might be faster.
If you ever plan to move to a distributed memory machine, then use MPI. It'll save you solving the same problem twice.
The reason I say OpenMP might be faster is because a good implementation of MPI could be clever enough to spot that it's being used in a shared memory environment and optimise its behaviour accordingly.

Just for a bigger picture, hybrid programming has become popular because OpenMP benefits from cache topology, by using the same address space. As MPI might have the same data replicated over the memory (because process can't share data) it might suffer from cache cancelation.
On the other hand, if you partition your data correctly, and each processor has a private cache, it might come to a point were your problem fit completely in cache. In this case you have super linear speedups.
By talking in cache, there are very different cache topology on recent processors, and has always: IT DEPENDS...

I need a short C programm that runs slower on a processor with HyperThreading than on one without it

I want to write a paper with Compiler Optimizations for HyperTreading. First step would be to investigate why a processor with HyperThreading( Simultaneous Multithreading) could lead to poorer performances than a processor without this technology. First step is to find an application that is better without HyperThreading, so i can run some hardware performance counters on it. Any suggest on how or where i could find one?
So, to summarize. I know that HyperThreading benefits are between -10% and +30%. I need a C application that falls in the 10% performance penalty.
Thanks.

Probably the main drawback of hyperthreading is the effective halving of cache sizes. Each thread will be populating the cache, and so each, in effect, has half the cache.
To create a programme which runs worse with hyperthreading than without, create a single threaded programme which performs a task which just fits inside L1 cache. Then add a second thread, which shares the workload, the works from "the other end" of the data, as it were. You will find performance falls through the floor - this is because both threads now must access L2 cache.
Hyperthreading can dramatically improve or worsen performance. It is completely dependent on use. None of this -10%/+30% stuff - that's ridiculous.

I'm not familiar with compiler optimizations for HT, nor the different between i7 HT and P4's as David pointed out. However, you can expect some general behaviors.
Context switching is very expensive. So if you have one core and run two threads on it simultaneously, switching back and forth one thread from the other always gives you performance penalty. However, threads do not use the core all the time. For example, if the thread reads or writes memory, it just waits for the memory access to be done, without using the core, usually for more than 100 cycles. There are many other cases that a thread need to stall like this, e.g., I/O operations, data dependencies, etc. Here HT helps, because it can ships out the waiting (or blocked) thread, and executes another thread instead.
Therefore, you can think if all threads are really unlikely to be blocked, then context switching will only cause overhead. Think about very computation-bounded application working on a small set of data.

Hardware Assisted Garbage Collection

I was thinking on ways that functional languages could be more tied directly to their hardware and was wondering about any hardware implementations of garbage collection.
This would speed things up significantly as the hardware itself would implicitly handle all collection, rather than the runtime of some environment.
Is this what LISP Machines did? Has there been any further research into this idea? Is this too domain specific?
Thoughts? Objections? Discuss.

Because of Generational Collection, I'd have to say that tracing and copying are not huge bottlenecks to GC.
What would help, is hardware-assisted READ barriers which take away the need for 'stop the world' pauses when doing stack scans and marking the heap.
Azul Systems has done this: http://www.azulsystems.com/products/compute_appliance.htm
They gave a presentation at JavaOne on how their hardware modifications allowed for completely pauseless GC.
Another improvement would be hardware assisted write barriers for keeping track of remembered sets.
Generational GCs, and even more so for G1 or Garbage First, reduce the amount of heap they have to scan by only scanning a partition, and keeping a list of remembered sets for cross-partition pointers.
The problem is this means ANY time the mutator sets a pointer it also has to put an entry in the appropriate rememered set. So you have (small) overhead even when you're not GCing. If you can reduce this, you'd reduce both the pause times neccessary for GCing, and overall program performance.

One obvious solution was to have memory pointers which are larger than your available RAM, for example, 34bit pointers on a 32 bit machine. Or use the uppermost 8 bits of a 32bit machine when you have only 16MB of RAM (2^24). The Oberon machines at the ETH Zurich used such a scheme with a lot success until RAM became too cheap. That was around 1994, so the idea is quite old.
This gives you a couple of bits where you can store object state (like "this is a new object" and "I just touched this object"). When doing the GC, prefer objects with "this is new" and avoid "just touched".
This might actually see a renaissance because no one has 2^64 bytes of RAM (= 2^67 bits; there are about 10^80 ~ 2^240 atoms in the universe, so it might not be possible to have that much RAM ever). This means you could use a couple of bits in todays machines if the VM can tell the OS how to map the memory.

Yes. Look at the related work sections of these 2 papers:
https://research.microsoft.com/en-us/um/people/simonpj/papers/parallel-gc/index.htm
http://www.filpizlo.com/papers/pizlo-ismm2007-stopless.pdf
Or at this one:
http://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon12StallFree.pdf

There was an article on lambda the ultimate describing how you need a GC-aware virtual memory manager to have a really efficient GC, and VM mapping is done mostly by hardware these days. Here you are :)

You're a grad student, sounds like a good topic for a research grant to me.
Look into FPGA design and computer architechture there are plenty of free processor design availiable on http://www.opencores.org/
Garbage collection can be implemented as a background task, it's already intended for parallel operation.
Pete

I'm pretty sure that some prototypes should exist. But develop, and produce hardware specific features is very expensive. It took very long time to implement MMU or TLB at a hardware level, which are quite easy to implement.
GC is too big to be efficiently implemented into hardware level.

Older sparc systems had tagged memory ( 33 bits) which was usable to mark addresses.
Not fitted today ?
This came from their LISP heritage IIRC.
One of my friends built a generational GC that tagged by stealing a bit from primitives. It worked better.
So, yes it can be done, but nodody bothers tagging things anymore.
runT1mes' comments about hardware assisted generational GC are worth reading.
given a sufficiently big gate array (vertex-III springs to mind) an enhanced MMU that supported GC activities would be possible.

Probably the most relevant piece of data needed here is, how much time (percentage of CPU time) is presently being spent on garbage collection? It may be a non-problem.
If we do go after this, we have to consider that the hardware is fooling with memory "behind our back". I guess this would be "another thread", in modern parlance. Some memory might be unavailable if it were being examined (maybe solvable with dual-port memory), and certainly if the HWGC is going to move stuff around, then it would have to lock out the other processes so it didn't interfere with them. And do it in a way that fits into the architecture and language(s) in use. So, yeah, domain specific.
Look what just appeared... in another post... Looking at java's GC log.

I gather the biggest problem is CPU registers and the stack. One of the things you have to do during GC is traverse all the pointers in your system, which means knowing what those pointers are. If one of those pointers is currently in a CPU register then you have to traverse that as well. Similarly if you have a pointer on the stack. So every stack frame has to have some sort of map saying what is a pointer and what isn't, and before you do any GC traversing you have to get any pointers out into memory.
You also run into problems with closures and continuations, because suddenly your stack stops being a simple LIFO structure.
The obvious way is to never hold pointers on the CPU stack or in registers. Instead you have each stack frame as an object pointing to its predecessor. But that kills performance.

Several great minds at MIT back in the 80s created the SCHEME-79 chip, which directly interpreted a dialect of Scheme, was designed with LISP DSLs, and had hardware-gc built in.

Why would it "speed things up"? What exactly should the hardware be doing?
It still has to traverse all references between objects, which means it has to run through a big chunk of data in main memory, which is the main reason why it takes time. What would you gain by this? Which particular operation is it that you think could be done faster with hardware support? Most of the work in a garbage collector is just following pointers/references, and occasionally copying live objects from one heap to another. how is this different from the instructions a CPU already supports?
With that said, why do you think we need faster garbage collection? A modern GC can already be very fast and efficient, to the point where it's basically a solved problem.