OpenCL - communication between workgroups - opencl

Can i make a work group communicate with another workgroup without using global-memory? If yes, how?
Using local memory is at least 10 times faster and using registers are even 50 times faster than global memory. But I guess these memories do not reach outside of workgroup.
Thanks

You can't communicate work-groups, each work-group is an isolated computing part that runs in parallel to all others.
The only method to communicate workgroups is by spliting the kernel in two kernels, and saving the output of kernel 1 in global memory. After that, using it as input to kernel 2 to continue the processing.

Related

OpenCL kernel queueing delays

I have a gigantic pile of data, 100GB. I only have 1GB of Video memory. I need to queue my kernel many times with MaxWorkgroupSize chunks. That's going to be ~10000 kernel queueings and 100 Memory transfers. How badly will this affect my performance time? Also, is there a faster way of processing so much data? Would I just be better off running on my cpu with 8 threads, because then there is no data transfer and kernel delays. I'm asking before I code the thing because I want to make sure I have the right approach.
It depends on the nature of the work. GPUs are SIMD machines. If you are typically doing the same thing for each item (e.g. branches are normally going the same place for each work item), then that bodes well for a GPU. Even so, 8 thread CPU has OpenCL implementations for it as well. Also, in environments like Intel's embedded GPU (AMD too?) you should consider the CL_MEM_USE_HOST_PTR flag on the memory buffer. You can use it to get a zero-copy overhead.
Multiple enqueueing of same kernel doesn't impose any performance hit per enqueue in comparison to single kernel run. More to say, it becomes a little bit faster due to caching.
Also, you can run your code on CPU & GPU simultaneously, as both are OpenCL-compatible devices.
Your Device can use memory objects, allocated from Host's RAM (CL_MEM_ALLOC_HOST_PTR & CL_MEM_USE_HOST_PTR flags in clCreateBuffer() function). Anyway, memory transfers may not be the bottleneck.

OpenCL and multiple video cards

My understanding of the differences between CPUs and GPUs is that the GPUs are not general purpose processors such that if a video card contains 10 GPUs, each GPU actual share the same program pointer and to optimize parallelism on the GPU I need to ensure each GPU is actually running the same code.
Synchronisation is not a problem on the same card since each GPU is physically running in parallel so they should all complete at the same time.
My question is, how does this work on multiple cards? At the speed at which they operate at, doesn't the hardware make a slight difference in execution times such that a calculation on one GPU on one card may end quicker or slower than the same calculation on another GPU on another card?
thanks
Synchronisation is not a problem on the same card since each GPU is physically running in parallel so they should all complete at the same time.
This is not true. Different threads on a GPU may complete at different times due to differences in memory access latency, for example. That is why there are synchronization primitives in OpenCL such as the barrier command. You can never assume that your threads are running precisely in parallel.
The same is true for multiple GPUs. There is no guarantee that they are in sync, so you will need to rely on API calls such as clFinish to explicitly synchronize their work.
I think you may be confused about how threads work on a GPU. First to address the issue of multiple GPUs. Multiple GPUs NEVER share the program pointer, so they will almost never complete a kernel at the same time.
On a single GPU, only threads that are executing ON THE SAME COMPUTE UNIT (or SM in NVIDIA parlance) AND are part of the same warp/wavefront are guaranteed to execute in sync.
You can never really count on this, but for some devices the compiler can determine that will be the case (I am specifically thinking about some AMD devices, as long as the worgroup size is hardcoded to 64).
In any case, as #vocaro pointed out, that's why you need to use a barrier for local memory.
To emphasize, even on the same GPU, threads are not executing in parallel across the whole device - only within each compute unit.

Parallelism in OpenCL on 1 cpu device

Is it possible to achieve the same level of parallelism with a multiple core CPU device as that of multiple heterogenous devices ( like GPU and CPU ) in OpenCL?
I have an intel i5 and am looking to optimise my code. When I query the platform for devices I get only one device returned: the CPU. I was wondering how I could optimise my code by using this.
Also, if I used a single command queue for this device, would the application automatically assign the kernels to different compute devices or does it have to be done manually by the programmer?
Can a cpu device achieve the same level of parallelism as a gpu? Pretty much always no.
The number of compute units in a gpu is almost always more than in a cpu. For example, $50 can get you a video card with 10 compute units (Radeon 6450). The cheapest 8-core cpus on newegg are going for $189 (desktop cpu) and $269 (server).
The compute units of a cpu will run faster due to clock speed, and execute branching code much better than a gpu. You want a cpu if your workload has a lot of conditional statements.
A gpu will execute the same instructions on many pieces of data. The 6450 gpu has 16 'stream processors' per compute unit to make this happen. Gpus are great when you have to do the same (small/medium) tasks many times. Matrix multiplication, n-boy computations, reduction operations, and some sorting algorithms run much better on gpu/accelerator hardware than on a cpu.
I answered a similar question with more detail a few weeks ago. (This one)
Getting back to your question about the "same level of parallelism" -- cpus don't have the same level of parallelism as gpus, except in cases where the gpu under performs on the execution of the actual kernel.
On your i5 system, there would be only one cpu device. This represents the entire cpu. When you query for the number of compute units, opencl will return the number of cores you have. If you want to use all cores, you just run the kernel on your device, and opencl will use all of the compute units (cores) for you.
Short answer: yes, it will run in parallel and no, no need to do it manually.
Long answer:
Also, if I used a single command queue for this device, would the application automatically assign the kernels to different compute devices [...]
Either you need to revise your OpenCL vocabulary or I didn't understand your question. You only have one device and core != device!
One CPU, regardless of how many cores it has, is one device. The same goes for a GPU: one GPU, which has hundreds of cores, is only one device. You send jobs to the device through the queue and the device's driver. Your jobs can (and will) be split up into work-items. Then, some (how many depends on the device/driver) work-items are executed in parallel. On the GPU aswell as on the CPU, one work-item is executed by one kernel. (This might not be completely true but it is a very helpful abstraction.)
If you enqueue several kernels in one queue (without connecting them through a wait event!), the driver may or may not run them in parallel.
It is the very goal of OpenCL to allow you to compute work-items in parallel regardless of whether it is using several devices' cores in parallel or only a single devices cores.
If this confuses you, watch these really good (and long) videos: http://macresearch.org/opencl
How are you determining the OPENCL device count? I have an Intel I3 laptop that gives me 2 OpenCL compute units? It has 2 cores.
According to Intels spec an I5-2300 has 4 cores and supports 4 threads. It isn't hyper-threaded. I would expect a OpenCL call to the query the # devices to give you a count of 4.

OpenCL- waste of host computing power

I am new to OpenCL, please tell me that the host cpu can be used only for allocating memory to the device, or we can use it can as an openCL device. (Because after the allocation is done, the host cpu will be idle).
You can use a cpu as a compute device. Opencl even allows multicore/processor systems to segment cores into separate compute units. I like to use this feature to divide the cpus on my system into groups based on NUMA nodes. It is possible to divide a cpu into compute devices which all share the same level of cache memory (L1, L2, L3 or L4).
You need a platform that supports it, such as AMD's SDK. I know there are ways to have Nvidia and AMD platforms on the same machine, but I have never had to do so myself.
Also, the opencl event/callback system allows you to use your cpu as you normally would while the gpu kernels are executing. In this way, you can use openmp or any other code on the host while you wait for the gpu kernel to finish.
There's no reason the CPU has to be idle, but it needs a separate job to do. Once you've submitted work to OpenCL you can:
Get on with something else, like preparing the next set of work, or performing calculation on something completely different.
Have the CPU set up as another compute device, and so submit a piece of work to it.
Personally I tend to find myself needing the first case more often as it's rare I find myself with two tasks that are independent and lend themselves to OpenCL style. The trick is keeping things balanced so you're not waiting a long time for the GPU task to finish, or having the GPU idle while the CPU is getting on with other work.
It's the same problem OpenGL coders had to conquer. Avoiding being CPU or GPU bound, and balancing between the two for best performance.

OpenCL Execution model multiple queued kernels

I was curious as to how the GPU executes the same kernel multiple times.
I have a kernel which is being queued hundreds (possibly thousands) of times in a row, and using the AMD App Profiler I noticed that it would execute clusters of kernels extremely fast, then like clockwork every so often a kernel would "hang" (i.e. take orders of magnitude longer to execute). I think it's every 64th kernel that hangs.
This is odd because each time through the kernel performs the exact same operations with the same local and global sizes. I'm even re-using the same buffers.
Is there something about the execution model that I'm missing (perhaps other programs/the OS accessing the GPU or the timing frequency of the GPU memory). I'm testing this on an ATI HD5650 card under Windows 7 (64-bit), with AMD App SDK 2.5 with in-order queue execution.
As a side note, if the I don't have any global memory accesses in my kernel (a rather impractical prospect), the profiler puts a gap in between the quick executing kernels and where the slow executing kernels were before is now a large empty gap where none of my kernels are being executed.
As a follow-up question, is there anything that can be done to fix this?
It's probable you're seeing the effects of your GPU's maximum number of concurrent tasks. Each task enqueued is assigned to one or more multiprocessors, which are frequently capable of running hundreds of workitems at a time - of the same kernel, enqueued in the same call. Perhaps what you're seeing is the OpenCL runtime waiting for one of the multiprocessors to free up. This relates most directly to the occupancy issue - if the work size can't keep the multiprocessor busy, through memory latencies and all, it has idle cycles. The limit here depends on how many registers (local or private memory) your kernel requires. In summary, you want to write your kernel to operate on multiple pieces of data more so than queueing it many times.
Did your measurement include reading back results from the apparently fast executions?

Resources