OpenCL dead lock possibility - opencl

I'm using global atomics to synchronize between work groups in OpenCL.
So the kernel uses a code like
... global volatile uint* counter;
if(get_local_id(0) == 0) {
while(*counter != expected_value);
}
barrier(0);
To wait until counter becomes expected_value.
And at another place it does
if(get_local_id(0) == 0) atomic_inc(counter);
Theoretically the algorithm is such that this should always work, if all work groups are running concurrencly. But if one work group starts only after another has completely finished, then the kernel can deadlock.
On CPU and on GPU (NVidia CUDA platform), it seems to always work, with a large number of work groups (over 8000).
For the algorithm this seems to be the most efficient implementation. (It does a prefix sums over each line in a 2D buffer.)
Does OpenCL and/or NVidia's OpenCL implementation guarantee that this always works?

Does OpenCL and/or NVidia's OpenCL implementation guarantee that this
always works?
As far as the OpenCL standard is concerned, this is not guaranteed (similarly for CUDA). Now, in practice, it may very well work due to your specific OpenCL implementation, but bear in mind that it's not guaranteed by the standard, so make sure you understand your implementation's execution model to ensure this is safe, and that such code won't necessarily be portable across other conforming implementations.
Theoretically the algorithm is such that this should always work, if all work groups are running concurrencly
OpenCL states that work groups can run in any order, and not necessarily in parallel nor even concurrently. CUDA has similar wording, although CUDA 9 does support a form of grid-wise synchronization.
OpenCL spec, 3.2.2 Execution Model: Execution of kernel-instances:
A conforming implementation may choose to serialize the work-groups so a correct algorithm cannot assume that work-groups will execute in parallel. There is no safe and portable way to synchronize across the independent execution of work-groups since once in the work-pool, they can execute in any order.

Related

OpenCL as a general purpose C runtime compiler (Why is kernel on CPU slower than direct C?)

I have been playing with openCL as a general purpose means to execute run-time C code. While I'm interested in getting code to run eventually on the GPU, I am currently looking at the overhead of OpenCL on the CPU compared to running straight complied C.
Obviously there is overhead to the preping and compilation of the kernal in OpenCL. But even when I just time the final execution and buffer read:
clEnqueueNDRangeKernel(...);
clFinish();
clEnqueueReadBuffer(...);
The overhead for a simple calculation is significant compared to the straight C code (factor of 30)., even with 1000 loops, giving it an opportunity to parallelize. Obviously there is overhead with reading back a buffer from the OpenCL code... but it is sitting on the same CPU, so it can't be that large of overhead.
Does this sound right?
The call to clEnqueueReadBuffer() is most likely performing a memcpy-like operation which is costly and non-optimal.
If you want to avoid this, you should allocate your data on the host normally (e.g. with malloc() or new) and pass the CL_MEM_USE_HOST_PTR flag to clCreateBuffer() when you create the OpenCL buffer object. Then use clEnqueueMapBuffer() and clEnqueueUnmapMemObject() to pass the data to/from OpenCL without actually performing a copy. This should be close to optimal w.r.t. minimizing OpenCL overhead.
For a higher-level abstraction implementing this technique, take a look at the mapped_view<T> class in Boost.Compute (simple example here).

Shipping reliable OpenCL applications - Tools/Techniques/Tips?

I want to ship OpenCL code that should work on all OpenCL 1.1 compatible GPUs. Rather than buying a bunch of GPUs and testing on them, are there any tools that can help ensure reliability?
If anyone has experience shipping OpenCL applications to a wide hardware base, I'd be interested in knowing about any other methods for testing reliability.
I've a bit of knowledge on this. Unfortunately, the answer is: depends on what the kernel is doing.
My biggest gripe is with NVIDIA and OpenCL, since they don't seem to support: vectors (float2, 4, etc) and global offsets. Kind of obnoxious. Intel and ATI are both solid, but even then vector sizes can differ. The above doesn't really matter if you are doing image convolution.
It matters if you want to run AMD FFT on an NVIDIA card, are doing matrix math, etc. To address the vector issue, you can write multiple kernels that each have a different vector size and call the right one: MatrixMult_float4(...).
You can check whether your code compiles by using the AMD KernelAnalyzer2, although this does need some component of the Catalyst drivers so it only works for me on PCs with AMD GPUs. There is also the Intel Kernel Builder, which works for devices with Intel OpenCL SDK support. Nvidia's implementation has bugs in it, especially on newer GPUs in my experience so there the best is to test one GPU from each generation.
To avoid extensions and validate CL language versions, one could try to test compile the code using the LLVM, or just getting the grammar for validation, e.g. as BNF.
There's a promising open source project, which probably contains useful stuff: http://bazaar.launchpad.net/~pocl/pocl/master/files/head:/lib/CL/
However, the problems I encountered were:
Newline characters caused build breakers on certain implementations (CR, LF, CRLF) in OpenCL source files. Specifying one of these as the only valid line ending would be just stupid. If one is editing source files on different platforms in conjunction with an SCM, it could get inconvenient. So I remove comments and clean up line breaks before compilation.
Performance: Feeding the GPU efficiently using multithreading; different hardware constellations have different bottlenecks. Here I needed a client side pipeline with multiple dispatcher threads. Of course, the amount of work that remains for the CPU depends on the task or capabilities, amount and resources of computing devices. Things that needed serialized execution or dynamic loop counts have been such candidates.

OpenCL local_work_size NULL

When enqueueing an OpenCL kernel, local_work_size can be set to NULL, in which case the OpenCL implementation will determine how to break the global work-items into appropriate work-group instances.
Automatically calculating the local_work_size seems like a great feature (better than guessing a multiple of 64).
Does OpenCL's work group size choice tend to be optimal? Are there cases where it would be better to manually specify local_work_size?
This depends on how your kernel is written. Often times to get the best performance your kernels need to make assumptions based on the local work size. For example in convolution you want to use the maximum amount of local memory you can to prevent extra reads back to global memory. You will want to handle as many threads as you can based on the incoming kernel sizes and how much local memory your device has. Configuring your local work size based on incoming parameters such as the kernel size can be the difference in major speed ups not just small differences. This is one reason why a language such as Renderscript Compute will never be able to provide performance close to optimized OpenCL/Cuda that allow for the developer to be aware of the hardware they are running on.
Also you are not guessing about the size. Well you certainly can make general assumptions but you can achieve better performance by looking at the architecture you are running (check AMD/NVIDIA/Intel guides on each device) and optimizing for them. You may change that at runtime by having tweaks in your code to modify your OpenCL kernel at runtime (since it is just a string) or you could have multiple kernels and select the best one at runtime.
That said using NULL for the workgroup is a great way to not worry about optimization and to just test out acceleration on the GPU with little effort. You will almost certainly get much better performance if you are aware of the hardware, make better choices, and write your kernels with knowledge of the size of the local workgroup.

CPU/Intel OpenCL performance issues, implementation questions

I have some questions hanging in the air without an answer for a few days now. The questions arose because I have an OpenMP and an OpenCL implementations of the same problem. The OpenCL runs perfectly on GPU but has 50% less performance when ran on CPU (compared to the OpenMP implementation). A post is already dealing with the difference between OpenMP and OpenCL performances, but it doesn't answer my questions. At the moment I face these questions:
1) Is it really that important to have "vectorized kernel" (in terms of the Intel Offline Compiler)?
There is a similar post, but I think my question is more general.
As I understand: a vectorized kernel not necessarily means that there is no vector/SIMD instruction in the compiled binary. I checked the assembly codes of my kernels, and there are a bunch of SIMD instructions. A vectorized kernel means that by using SIMD instructions you can execute 4 (SSE) or 8 (AVX) OpenCL "logical" threads in one CPU thread. This can only be achieved if ALL your data is consecutively stored in the memory. But who has such perfectly sorted data?
So my question would be: Is it really that important to have your kernel "vectorized" in this sense?
Of course it gives performance improvement, but if most of the computation intensive parts in the kernel are done by vector instructions then you might get near the "optimal" performance. I think the answer to my question lies in the memory bandwidth. Probably vector registers better fit to efficient memory access. In that case the kernel arguments (pointers) have to be vectorized.
2) If I allocate data in local memory on a CPU, where will it be allocated? OpenCL shows L1 cache as local memory, but it is clearly not the same type of memory like on GPU's local memory. If its stored in the RAM/global memory, then there is no sense copying data into it. If it would be in cache, some other process would might flush it out... so that doesn’t make sense either.
3) How are "logical" OpenCL threads mapped to real CPU software/hardware(Intel HTT) threads? Because if I have short running kernels and the kernels are forked like in TBB (Thread Building Blocks) or OpenMP then the fork overhead will dominate.
4) What is the thread fork overhead? Are there new CPU threads forked for every "logical" OpenCL threads or are the CPU threads forked once, and reused for more "logical" OpenCL threads?
I hope I'm not the only one who is interested in these tiny things and some of you might now bits of these problems. Thank you in advance!
UPDATE
3) At the moment OpenCL overhead is more significant then OpenMP, so heavy kernels are required for efficient runtime execution. In Intel OpenCL a work-group is mapped to an TBB thread, so 1 virtual CPU core executes a whole work-group (or thread block). A work-group is implemented with 3 nested for loops, where the inner most loop is vectorized, if possible. So you could imagine it something like:
#pragam omp parallel for
for(wg=0; wg < get_num_groups(2)*get_num_groups(1)*get_num_groups(0); wg++) {
for(k=0; k<get_local_size(2); k++) {
for(j=0; j<get_local_size(1); j++) {
#pragma simd
for(i=0; i<get_local_size(0); i++) {
... work-load...
}
}
}
}
If the inner most loop can be vectorized it steps with SIMD steps:
for(i=0; i<get_local_size(0); i+=SIMD) {
4) Every TBB thread is forked once during the OpenCL execution and they are reused. Every TBB thread is tied to a virtual core, ie. there is no thread migration during the computation.
I also accept #natchouf-s answer.
I may have a few hints to your questions. In my little experience, a good OpenCL implementation tuned for the CPU can't beat a good OpenMP implementation. If it does, you could probably improve the OpenMP code to beat the OpenCL one.
1) It is very important to have vectorized kernels. It is linked to your question number 3 and 4. If you have a kernel that handles 4 or 8 input values, you'll have much less work items (threads), and hence much less overhead. I recommend to use the vector instructions and data provided by OpenCL (like float4, float8, float16) instead of relying on auto-vectorization.
Do not hesitate to use float16 (or double16): this will be mapped to 4 sse or 2 avx vectors and will divide by 16 the number of work items required (which is good for CPU, but not always for GPU: I use 2 different kernels for CPU and GPU).
2) local memory on CPU is the RAM. Don't use it on a CPU kernel.
3 and 4) I don't really know, it will depend on the implementation, but the fork overhead seems important to me.
for question 3:
Intel group logical OpenCL threads into one hardware thread.
and the group size can varies from 4, 8, to 16.
A logical OpenCL thread map to one SIMD lane of execution unit.
one execution unit has two SIMD engines with a width of 4.
please refer to following document for further details.
https://software.intel.com/sites/default/files/Faster-Better-Pixels-on-the-Go-and-in-the-Cloud-with-OpenCL-on-Intel-Architecture.pdf

MPI vs openMP for a shared memory

Lets say there is a computer with 4 CPUs each having 2 cores, so totally 8 cores. With my limited understanding I think that all processors share same memory in this case. Now, is it better to directly use openMP or to use MPI to make it general so that the code could work on both distributed and shared settings. Also, if I use MPI for a shared setting would performance decrease compared with openMP?
Whether you need or want MPI or OpenMP (or both) heavily depends the type of application you are running, and whether your problem is mostly memory-bound or CPU-bound (or both). Furthermore, it depends on the type of hardware you are running on. A few examples:
Example 1
You need parallelization because you are running out of memory, e.g. you have a simulation and the problem size is so large that your data does not fit into the memory of a single node anymore. However, the operations you perform on the data are rather fast, so you do not need more computational power.
In this case you probably want to use MPI and start one MPI process on each node, thereby making maximum use of the available memory while limiting communication to the bare minimum.
Example 2
You usually have small datasets and only want to speed up your application, which is computationally heavy. Also, you do not want to spend much time thinking about parallelization, but more your algorithms in general.
In this case OpenMP is your first choice. You only need to add a few statements here and there (e.g. in front of your for loops that you want to accelerate), and if your program is not too complex, OpenMP will do the rest for you automatically.
Example 3
You want it all. You need more memory, i.e. more computing nodes, but you also want to speed up your calculations as much as possible, i.e. running on more than one core per node.
Now your hardware comes into play. From my personal experience, if you have only a few cores per node (4-8), the performance penalty created by the general overhead of using OpenMP (i.e. starting up the OpenMP threads etc.) is more than the overhead of processor-internal MPI communication (i.e. sending MPI messages between processes that actually share memory and would not need MPI to communicate).
However, if you are working on a machine with more cores per node (16+), it will become necessary to use a hybrid approach, i.e. parallelizing with MPI and OpenMP at the same time. In this case, hybrid parallelization will be necessary to make full use of your computational resources, but it is also the most difficult to code and to maintain.
Summary
If you have a problem that is small enough to be run on just one node, use OpenMP. If you know that you need more than one node (and thus definitely need MPI), but you favor code readability/effort over performance, use only MPI. If using MPI only does not give you the speedup you would like/require, you have to do it all and go hybrid.
To your second question (in case that did not become clear):
If you setup is such that you do not need MPI at all (because your will always run on only one node), use OpenMP as it will be faster. But If you know that you need MPI anyways, I would start with that and only add OpenMP later, when you know that you've exhausted all reasonable optimization options for MPI.
With most distributed memory platforms nowadays consisting of SMP or NUMA nodes it just makes no sense to not use OpenMP. OpenMP and MPI can perfectly work together; OpenMP feeds the cores on each node and MPI communicates between the nodes. This is called hybrid programming. It was considered exotic 10 years ago but now it is becoming mainstream in High Performance Computing.
As for the question itself, the right answer, given the information provided, has always been one and the same: IT DEPENDS.
For use on a single shared memory machine like that, I'd recommend OpenMP. It make some aspects of the problem simpler and might be faster.
If you ever plan to move to a distributed memory machine, then use MPI. It'll save you solving the same problem twice.
The reason I say OpenMP might be faster is because a good implementation of MPI could be clever enough to spot that it's being used in a shared memory environment and optimise its behaviour accordingly.
Just for a bigger picture, hybrid programming has become popular because OpenMP benefits from cache topology, by using the same address space. As MPI might have the same data replicated over the memory (because process can't share data) it might suffer from cache cancelation.
On the other hand, if you partition your data correctly, and each processor has a private cache, it might come to a point were your problem fit completely in cache. In this case you have super linear speedups.
By talking in cache, there are very different cache topology on recent processors, and has always: IT DEPENDS...

Resources