OpenCL Parallel Dispatch - opencl

I am using beta support for OpenCL 2.0 on NVIDIA and targeting highend GPU like 1080ti. In my compute pipeline, I need to sometimes dispatch work to independently image process relatively small images. In theory, I think these images should be able to be processed in parallel on a single GPU because the amount of work groups for a single image won't saturate all the compute units of the GPU.
Is this possible in OpenCL? Does this have a name in OpenCL?
If it is possible, is using multiple queues for a single device the only way to do this? Or will the driver look at the "waitEventList" and decide which kernels can be processed in parallel?
Do I need CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE?

1- Yes, this is one of ways to achieve high yield on occupation of compute units. General name can be "pipelining"(with help of asynchronous enqueueing and/or dynamic parallelism). There are different ways, one is doing reads on 1 queue, doing writes on another queue, doing compute on a third queue with 3 queues in control with wait events; second way could be having M queues each doing a different image's read-compute-write work without events.
2- You can even use single queue but an out-of-ordered type so kernels are dispatched independently. But at least for some amd cards, even an in-order queue can optimize independent kernels (according to amd's codexl) with concurrent execution(this may be out of opencl specs). Wait events can be a constraint to stop this type of driver-side optimizations(again, at least on amd)
From 2.x onwards, there is device-side queueing ability so you can enqueue 1 kernel from host and that kernel can enqueue N kernels, independently of host intervention(if all data is already uploaded to card), this may not be as latency-hiding as using multiple host-side queues(if data is needed from host to device).
3- Out of order execution is not forced on vendors so this may not work.

Related

Can opencl chain multiple passes without returning to CPU?

I want to auto scale some data. So, I want to pass through all the data and find the maximum extents of the data. Then I want to go through the data, do calculations, and send the results to opengl for rendering. Is this type of multipass thing possible in opencl? Or does the CPU have to direct the "find extents" calc, get the results, and then direct the other calc with that?
It sounds like you would need two OpenCL kernels, one for calculating the min and max and the other to actually scale the data. Using OpenCL command queues and events you can queue up these two kernels in order and store the results from the first in global memory, reading those results in the second kernel. The semantics of OpenCL command queues and events (assuming you don't have out-of-order execution enabled) will ensure that one completes before the other without any interaction from your host application (see clEnqueueNDRangeKernel).

Translating C code to OpenCL

I am trying to translate a smaller program written in C into openCL. I am supposed to transfer some input data to the GPU and then perform ALL calculations on the device using successive kernel calls.
However, I am facing difficulties with parts of the code that are not suitable for parallelization since I must avoid transferring data back and forth between CPU and GPU because of the amount of data used.
Is there a way to execute some kernels without the parallel processing so I can replace these parts of code with them? Is this achieved by setting global work size to 1?
Yes, you can execute code serially on OpenCL devices. To do this, write your kernel code the same as you would in C and then execute it with the clEnqueueTask() function.
You could manage two devices :
the GPU for highly parallelized code
the CPU for sequential code
This is a bit complex as you must manage one command-queue by device to schedule each kernel on the appropriate device.
If the devices are part of the same platform (typically AMD) you can use the same context, otherwise you will have to create one more context for the CPU.
Moreover, if you want to have a more fine-grained CPU task-parallelization you could use device-fission if your CPU supports it.

Write multiple kernels or a Single kernel

Suppose that I've two big functions. Is it better to write them in a separate kernels and call them sequentially, or is better to write only one kernel? (I don't want to read the data back and force form between host and device in between). What about the speed up if I want to call the kernel many times?
One thing to consider is the effect of register pressure on hardware utilization and performance.
As a general rule, big kernels have big register footprints. Typical OpenCL devices (ie. GPUs) have very finite register file sizes and large kernels can result in lower concurrency (fewer concurrent warps/wavefronts), less opportunities for latency hiding, and poorer overall performance. On the other hand, kernel launch overheads are pretty low on most platforms, so if your algorithm doesn't have an enormous amount of state to save between "phases" of execution, the penalty of using multiple kernels can be rather low.
Using multiple kernels also has another side benefit -- you get implicit synchronization between all work units for free. Often that can eliminate the need for atomic memory operations and synchronization primitives which can have a negative impact on code performance.
The ultimate guide should be measured performance. There is no universal rule-of-thumb for this sort of things. Benchmarking is the only way to know for sure.
In general this is a question of (maybe) slightly better performance vs. readibility of your code. Copying buffers is no issue as long as you keep them within the same context. E.g. you could set one output buffer of a kernel to be an input buffer of the next kernel, which would not involve any copying.
The proper way to code in OpenCL is to separate your code into parallel tasks, and each of them is a kernel. This is, each "for loop" should be a kernel. Some times one single CPU code function could result in a 4 kernel implementation in OCL.
If you need to store data between kernel executions just use OpenCL buffers and do not copy to host (this solves the DEVICE<->HOST bottleneck).
If both functions act to different data you could propably write a single kernel, but that depends on the complexity of the operation being run.

Are OpenCL work items executed in parallel?

I know that work items are grouped into the work groups, and you cannot synchronize outside of a work group.
Does it mean that work items are executed in parallel?
If so, is it possible/efficient to make 1 work group with 128 work items?
The work items within a group will be scheduled together, and may run together. It is up to the hardware and/or drivers to choose how parallel the execution actually is. There are different reasons for this, but one very good one is to hide memory latency.
On my AMD card, the 'compute units' are divided into 16 4-wide SIMD units. This means that 16 work items can technically be run at the same time in the group. It is recommended that we use multiples of 64 work items in a group, to hide memory latency. Clearly they cannot all be run at the exact time. This is not a problem, because most kernels are in fact, memory bound, so the scheduler (hardware) will swap the work items waiting on the memory controller out, while the 'ready' items get their compute time. The actual number of work items in the group is set by the host program, and limited by CL_DEVICE_MAX_WORK_GROUP_SIZE. You will need to experiment with the optimal work group size for your kernel.
The cpu implementation is 'worse' when it comes to simultaneous work items. There are only ever as many work items running as you have cores available to run them on. They behave more sequentially in the cpu.
So do work items run at the exactly same time? Almost never really. This is why we need to use barriers when we want to be sure they pause at a given point.
In the (abstract) OpenCL execution model, yes, all work items execute in parallel, and there can be millions of them.
Inside a GPU, all work items of the same work group must be executed on a single "core". This puts a physical restriction on the number of work items per work group (256 or 512 is the max, but it can be smaller for large kernels using a lot of registers). All work groups are then scheduled on the (usually 2 to 16) cores of the GPU.
You can synchronize threads (work items) inside a work group, because they all are resident in the same core, but you can't synchronize threads from different work groups, since they may not be scheduled at the same time, and could be executed on different cores.
Yes, it is possible to have 128 work items inside a work group, unless it consumes too many resources. To reach maximum performance, you usually want to have the largest possible number of threads in a work group (at least 64 are required to hide memory latency, see Vasily Volkov's presentations on this subject).
The idea is that they can be executed in parallel if possible (whether they actually will be executed in parallel depends).
Yes, work items are executed in parallel.
To get the maximal possible number of work items, use clGetDeviceInfo with CL_DEVICE_MAX_WORK_GROUP_SIZE. It depends on the hardware.
Whether it's efficient or not primarily depends on the task you want to implement. If you need a lot of synchronization, it may be that OpenCL does not fit your task. I can't say much more without knowing what you actually want to do.
The work-items in a given work-group execute concurrently on the processing elements of a sigle processing unit.

How many tasks can be executed simultaneously on GPU device?

I'm using OpenCL and have ATI 4850 card. It has:
CL_DEVICE_MAX_COMPUTE_UNITS: 10
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
CL_DEVICE_MAX_WORK_ITEM_SIZES:(256, 256, 256)
CL_DEVICE_AVAILABLE: 1
CL_DEVICE_NAME: ATI RV770
How many tasks can it execute simultaneously?
Is it CL_DEVICE_MAX_COMPUTE_UNITS * CL_DEVICE_MAX_WORK_ITEM_SIZES = 2560?
To be more specific: a single core processor can execute only one task in the one moment, dual-core can execute 2 tasks... How many tasks can execute my GPU at one moment? Or rephrased: How many processors does my GPU have?
The RV770 has 10 SIMD cores, each consisting of 16 shader cores, each consisting of 5 ALUs (VLIW5 architecture). A total of 800 ALUs that can do parallel computations. I don't think there's a way to get all these numbers out of OpenCL. I'm also not sure what you would equate to a CPU core. Perhaps a shader core? You can read about VLIW at Wikipedia. It's an interesting design.
If you say a CPU core is only executing one "task" at any given time, even though it has multiple ALUs working in parallel, then I guess you can say the RV770 would be working on 160 tasks. But with the differences in how different chips work, I think "core" and "task" can become difficult to define. A CPU with hyperthreading can even execute two sets of code at the same time. With OpenCL I don't believe it is possible yet to execute more than one kernel at any given time - unless recent driver updates have changed that.
Anyway, I think it is more important to present your work to the GPU in a way that gives the best performance. Unfortunately there's no way to find the best work group size other than experimenting. At least not that I know of. One help is that if the drivers support OpenCL 1.1 you can query the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and set your work size to a multiple of that. Otherwise, going for a multiple of 64 is probably a safe bet.
GPU work ends up becoming wavefronts/warps.
Using a GPU for UI and compute is effectively using it for many programs without being aware of it. Many for the GUI drawing, plus whatever compute kernels you are executing. Fast OpenCL clients are asynchronous and overlap multiple instance of work so they won't be latency-bound. It is expected that you'll use multiple kernels in parallel.
There doesn't seem to be a "hard" limit other than memory limiting the number of buffers you can use. When using the same GPU for UI and for compute, you must throttle your work. In my experience, issuing too much work will cause starvation of the GUI and/or your compute kernels. There doesn't seem to be anything in the way of ensuring that you won't have starvation (long delays before a work item begins actually executing). Some work item(s) may sit for a very long time (10s seconds or more in bad cases) while the GPU does other work items. I speculate that items are dispatched to pipelines based on data availability and little or nothing is there to prevent starvation of work items.
Limiting how far ahead work is enqueued greatly improves GUI responsiveness by letting the GPU drain its work queue almost/sometimes to empty, reducing GUI drawing workitem starvation delays.

Resources