What is a warp in context of OpenCL? - opencl

I am fairly certain that a warp is only defined in CUDA. But maybe I'm wrong. What is a warp in terms of OpenCL?
It's not the same as work group, is it?
Any relevant feedback is highly appreciated. Thanks!

It isn't defined in the OpenCL standard. A warp is a thread as executed by the hardware (CUDA threads are not really threads and map onto a warp as separate SIMD elements with some clever hardware/software mapping). It is a collection of work-items and there can be multiple warps in a work-group.
An OpenCL subgroup was designed to be compatible with a hardware thread, and hence is able to represent a warp in the OpenCL kernel, but it is entirely up to NVIDIA to decide to implement subgroups or not and of course an OpenCL subgroup cannot expose every feature that NVIDIA can expose for warps because it is a standard, while NVIDIA can do anything they like on their own devices.

Related

OpenCL dead lock possibility

I'm using global atomics to synchronize between work groups in OpenCL.
So the kernel uses a code like
... global volatile uint* counter;
if(get_local_id(0) == 0) {
while(*counter != expected_value);
}
barrier(0);
To wait until counter becomes expected_value.
And at another place it does
if(get_local_id(0) == 0) atomic_inc(counter);
Theoretically the algorithm is such that this should always work, if all work groups are running concurrencly. But if one work group starts only after another has completely finished, then the kernel can deadlock.
On CPU and on GPU (NVidia CUDA platform), it seems to always work, with a large number of work groups (over 8000).
For the algorithm this seems to be the most efficient implementation. (It does a prefix sums over each line in a 2D buffer.)
Does OpenCL and/or NVidia's OpenCL implementation guarantee that this always works?
Does OpenCL and/or NVidia's OpenCL implementation guarantee that this
always works?
As far as the OpenCL standard is concerned, this is not guaranteed (similarly for CUDA). Now, in practice, it may very well work due to your specific OpenCL implementation, but bear in mind that it's not guaranteed by the standard, so make sure you understand your implementation's execution model to ensure this is safe, and that such code won't necessarily be portable across other conforming implementations.
Theoretically the algorithm is such that this should always work, if all work groups are running concurrencly
OpenCL states that work groups can run in any order, and not necessarily in parallel nor even concurrently. CUDA has similar wording, although CUDA 9 does support a form of grid-wise synchronization.
OpenCL spec, 3.2.2 Execution Model: Execution of kernel-instances:
A conforming implementation may choose to serialize the work-groups so a correct algorithm cannot assume that work-groups will execute in parallel. There is no safe and portable way to synchronize across the independent execution of work-groups since once in the work-pool, they can execute in any order.

Opencl Workitems and streaming processors

What is the relation between an workitem and a streaming processor(cuda core). I read somewhere that the number of workitems SHOULD greatly exceed the number of cores, otherwise there is no performance improvement. But why is this so?? I thought 1 core repsresents 1 workitem. Can someone help me to understand this?
Thanks
GPUs and most other hardware tend to do arithmetic much faster than they can access most of their available memory. Having many more work items than you have processors lets the scheduler stagger the memory use, while those work items which have already read their data are using the ALU hardware to do the processing.
Here is a good page about optimization in opencl. Scroll down to "
2.4. Removing 'Costly' Global GPU Memory Access", where it goes into this concept.
The reason is mainly scheduling - a single core/processor/unit can usually run multiple threads and switch between them to hide memory latency (SMT). So it's generally a good idea for each core to have multiple threads queued up for it.
A thread will usually correspond to at least one work item, although depending on driver and hardware, multiple work-items might be combined into one thread, to make use of SIMD/vector capabilities of a core.

OpenCL local_work_size NULL

When enqueueing an OpenCL kernel, local_work_size can be set to NULL, in which case the OpenCL implementation will determine how to break the global work-items into appropriate work-group instances.
Automatically calculating the local_work_size seems like a great feature (better than guessing a multiple of 64).
Does OpenCL's work group size choice tend to be optimal? Are there cases where it would be better to manually specify local_work_size?
This depends on how your kernel is written. Often times to get the best performance your kernels need to make assumptions based on the local work size. For example in convolution you want to use the maximum amount of local memory you can to prevent extra reads back to global memory. You will want to handle as many threads as you can based on the incoming kernel sizes and how much local memory your device has. Configuring your local work size based on incoming parameters such as the kernel size can be the difference in major speed ups not just small differences. This is one reason why a language such as Renderscript Compute will never be able to provide performance close to optimized OpenCL/Cuda that allow for the developer to be aware of the hardware they are running on.
Also you are not guessing about the size. Well you certainly can make general assumptions but you can achieve better performance by looking at the architecture you are running (check AMD/NVIDIA/Intel guides on each device) and optimizing for them. You may change that at runtime by having tweaks in your code to modify your OpenCL kernel at runtime (since it is just a string) or you could have multiple kernels and select the best one at runtime.
That said using NULL for the workgroup is a great way to not worry about optimization and to just test out acceleration on the GPU with little effort. You will almost certainly get much better performance if you are aware of the hardware, make better choices, and write your kernels with knowledge of the size of the local workgroup.

MPI vs openMP for a shared memory

Lets say there is a computer with 4 CPUs each having 2 cores, so totally 8 cores. With my limited understanding I think that all processors share same memory in this case. Now, is it better to directly use openMP or to use MPI to make it general so that the code could work on both distributed and shared settings. Also, if I use MPI for a shared setting would performance decrease compared with openMP?
Whether you need or want MPI or OpenMP (or both) heavily depends the type of application you are running, and whether your problem is mostly memory-bound or CPU-bound (or both). Furthermore, it depends on the type of hardware you are running on. A few examples:
Example 1
You need parallelization because you are running out of memory, e.g. you have a simulation and the problem size is so large that your data does not fit into the memory of a single node anymore. However, the operations you perform on the data are rather fast, so you do not need more computational power.
In this case you probably want to use MPI and start one MPI process on each node, thereby making maximum use of the available memory while limiting communication to the bare minimum.
Example 2
You usually have small datasets and only want to speed up your application, which is computationally heavy. Also, you do not want to spend much time thinking about parallelization, but more your algorithms in general.
In this case OpenMP is your first choice. You only need to add a few statements here and there (e.g. in front of your for loops that you want to accelerate), and if your program is not too complex, OpenMP will do the rest for you automatically.
Example 3
You want it all. You need more memory, i.e. more computing nodes, but you also want to speed up your calculations as much as possible, i.e. running on more than one core per node.
Now your hardware comes into play. From my personal experience, if you have only a few cores per node (4-8), the performance penalty created by the general overhead of using OpenMP (i.e. starting up the OpenMP threads etc.) is more than the overhead of processor-internal MPI communication (i.e. sending MPI messages between processes that actually share memory and would not need MPI to communicate).
However, if you are working on a machine with more cores per node (16+), it will become necessary to use a hybrid approach, i.e. parallelizing with MPI and OpenMP at the same time. In this case, hybrid parallelization will be necessary to make full use of your computational resources, but it is also the most difficult to code and to maintain.
Summary
If you have a problem that is small enough to be run on just one node, use OpenMP. If you know that you need more than one node (and thus definitely need MPI), but you favor code readability/effort over performance, use only MPI. If using MPI only does not give you the speedup you would like/require, you have to do it all and go hybrid.
To your second question (in case that did not become clear):
If you setup is such that you do not need MPI at all (because your will always run on only one node), use OpenMP as it will be faster. But If you know that you need MPI anyways, I would start with that and only add OpenMP later, when you know that you've exhausted all reasonable optimization options for MPI.
With most distributed memory platforms nowadays consisting of SMP or NUMA nodes it just makes no sense to not use OpenMP. OpenMP and MPI can perfectly work together; OpenMP feeds the cores on each node and MPI communicates between the nodes. This is called hybrid programming. It was considered exotic 10 years ago but now it is becoming mainstream in High Performance Computing.
As for the question itself, the right answer, given the information provided, has always been one and the same: IT DEPENDS.
For use on a single shared memory machine like that, I'd recommend OpenMP. It make some aspects of the problem simpler and might be faster.
If you ever plan to move to a distributed memory machine, then use MPI. It'll save you solving the same problem twice.
The reason I say OpenMP might be faster is because a good implementation of MPI could be clever enough to spot that it's being used in a shared memory environment and optimise its behaviour accordingly.
Just for a bigger picture, hybrid programming has become popular because OpenMP benefits from cache topology, by using the same address space. As MPI might have the same data replicated over the memory (because process can't share data) it might suffer from cache cancelation.
On the other hand, if you partition your data correctly, and each processor has a private cache, it might come to a point were your problem fit completely in cache. In this case you have super linear speedups.
By talking in cache, there are very different cache topology on recent processors, and has always: IT DEPENDS...

Developing with OpenCl on ATI and Nvidia on the same time

our workgroup is slowly trying a little bit of OpenCl in a side project. So far 'everybody' is working on NVIDIA Quadro FX 580. Now we are planning to buy new computers for new colleages and instead of the FX 580 we could buy ATI FirePro V4800 instead, which costs only 15Eur more and give us 1Gig instead of 512Gig of Ram which will benificial for our data intensive tasks.
So, how much trouble is it to develop OpenCl code at the same time on Nvidia and ATI?
I read the following SO question, Running OpenCL on hardware from mixed vendors, which was very pessimistic about developing on/for different vendors. On the other side, the question is already a year old.
What do you reccomend?
I have previous worked extensively with CUDA programming language.
I have been planning to start developing apps using OpenCL. As you mentioned one of the best features with OpenCL is running on many vendor hardware (Intel, AMD and Nvidia).
One project that I came across that used openCL extensively for large scale development is http://sourceforge.net/projects/hypgad/. It might be a good idea to look at the source code from this group and understand how they have developed their application on so many hardware including sony cell processor.
Another approach would be to use PyOPENCL, which provides higher abstraction than OpenCL and can significantly reduce the coding effort.
Do you need the code to run unchanged on both bits of hardware? If so you may have to develop for a limited subset of common functions.
If you can run slightly different c ode on each you will probably get better performance - in CUDA/OpenCL you generally have to tune the algorithms for the amount of ram, number of GPU engines anyway so it shoudldn't be much more work to also tweak for NVidia/AMD
The biggest problem is workgroup sizes. Some ATI cards I have used crash at above 64, but then it may be the Apple OSX 10.6 drivers I am using.
Developing for both ATI and NVIDIA is actually not too difficult so long as you avoid using any part of either vendor's SDK. Stick to OpenCL as it is defined in the OpenCL spec. (www.khronos.org/opencl) and your code will stay syntax portable. Due to differences in the underlying architectures, performance portability may be an issue. Local & Global worksizes really have to be determined independently for each card to maximize performance. Another thing to pay attention to is the types being used. Vector types (float2, float4) are especially useful on ATI cards, as each processing element actually contains 4 execution units (one for each RGB color channel, plus aplha).

Resources