CUDA MPS for OpenCL? - opencl

CUDA MPS allows you to run multiple processes in parallel on the GPU, thus fully utilizing the GPU for operations that don't take full advantage. Is there an equivalent for OpenCL? Or is there a different approach in OpenCL?

If you use multiple OpenCL command queues that don't have event interdependencies, an OpenCL runtime could keep the GPU cores busy with varied work from each queue. It's really up to the implementation as to whether this actually happens. You'd need to check each vendor's OpenCL guide to see if they support concurrent GPU kernels.

Related

Vectorized Code on GPU

I am using OpenCL to execute a procedure on different GPUs and CPUs simultaneously to get a high performance results. The Intel OpenCL is always showing a message that the Kernel is not vectorized, so it will only run on different cores but will not run using SIMD instructions. My question is, if I rewrite the code so that the SIMD instruction can be exploit with the OpenCL code, will it increase the GPU Performance also?
Yes - but beware that this is not necessary on AMD GCN based APU/GPU or Nvidia Fermi or higher GPU hardware for good performance -they do scalar operations with great utilization. CPUs and Intels GPU however can greatly benefit via SIMD instructions which is what the vector operations boil down to.

OpenCL AMD S10000 dual GPU execution

I have the S10000 AMD GPU, which has 2 GPUs inside. When I run clinfo the output looks like these are treated as separate GPUs. To run my kernel across both of these GPUs do I need to create 2 separate openCL queues and partition my work-groups? Do these two GPUs share memory?
Yes, you will need to create separate command queues for each GPU and manually partition the workload between them. The GPUs do not share memory, so you will also have to make sure data is transferred to both GPUs as necessary. If you create a single context containing both GPUs, the implementation will automatically deal with moving buffers between the GPUs as and when needed. However, in my experience it is often better to do this explicitly, as sometimes the implementation will generate false dependencies between kernels that both use the same buffer and will serialise kernel execution.

HyperQ support in OpenCL

I want to run heterogeneous kernels that execute on a single GPU asynchronously. I think this is possible in Nvidia Kepler K20(Or any device having compute capability 3.5+) by launching each of this kernels to a different stream and the runtime system maps them to different hardware queues based on the resource availability.
Is this feature accessible in OpenCL?
If it is so, what is the equivalent of a CUDA 'Stream' in OpenCL?
Do Nvidia drivers support such an execution on their K20 cards through OpenCL?
Is their any AMD GPU that has similar feature(or is there anything on development)?
Answer for any of these questions will help me a lot.
In principle, you can use OpenCL command queues to achieve CKE (Concurrent Kernel Execution). You can launch them from different CPU threads. Here are few links that might help you get started:
How do I know if the kernels are executing concurrently?
http://devgurus.amd.com/thread/142485
I am not sure how would it work with NVIDIA Kepler GPUs as we are having strange issues using OpenCL on K20 GPU.

Can a GPU be the host of a OpenCL program?

Little disclaimer: This is more the kind of theoretical / academic question than an actual problem I've got.
The usual way of setting up a parallel program in OpenCL is to write a C/C++ program, which sets up the devices (GPU and/or other CPUs), kernel and data buffers for executing the kernel on the device.
This program gets launched from the host, which used to be a CPU.
Would it be possible to write a OpenCL program where the host is a GPU and the devices other GPUs and/or CPUs?
What would be the prerequisites for such a scenario?
Do one need a special GPU or would it be possible to use any OpenCL-capable GPU?
Are you looking for a complete host or just a kernel launcher?
Up coming CUDA (v 5.0) introduces a feature to launch a kernel inside a kernel. Therefore, a device can be used for launching a kernel on itself. May be this feature will be supported by OpenCL too in near future.

Sharing the GPU between OpenCL capable programs

Is there a method to share the GPU between two separate OpenCL capable programs, or more specifically between two separate processes that simultaneously both require the GPU to execute OpenCL kernels? If so, how is this done?
It depends what you call sharing.
In general, you can create 2 processes that both create an OpenCL device, on the same GPU. It's then the driver/OS/GPU's responsibility to make sure things just work.
That said, most implementations will time-slice the GPU execution to make that happen (just like it happens for graphics).
I sense this is not exactly what you're after though. Can you expand your question with a use case ?
Current GPUs (except NVidia's Fermi) do not support simultaneous execution of more than one kernel. Moreover, to this date GPUs do not support preemptive multitasking; it's completely cooperative! A kernel's execution cannot be suspended and continued later on. So the granularity of any time-based GPU sharing depends on the kernels' execution times.
If you have multiple programs running that require GPU access, you should therefore make sure that your kernels have short runtimes (< 100ms is a rule of thumb), so that GPU time can be timesliced among the kernels that want GPU cycles. It's also important to do that since otherwise the host system's graphics will become very unresponsive as they need GPU access too. This can go as far that a kernel in an endless or long loop will apparently crash the system.

Resources