I want to run heterogeneous kernels that execute on a single GPU asynchronously. I think this is possible in Nvidia Kepler K20(Or any device having compute capability 3.5+) by launching each of this kernels to a different stream and the runtime system maps them to different hardware queues based on the resource availability.
Is this feature accessible in OpenCL?
If it is so, what is the equivalent of a CUDA 'Stream' in OpenCL?
Do Nvidia drivers support such an execution on their K20 cards through OpenCL?
Is their any AMD GPU that has similar feature(or is there anything on development)?
Answer for any of these questions will help me a lot.
In principle, you can use OpenCL command queues to achieve CKE (Concurrent Kernel Execution). You can launch them from different CPU threads. Here are few links that might help you get started:
How do I know if the kernels are executing concurrently?
http://devgurus.amd.com/thread/142485
I am not sure how would it work with NVIDIA Kepler GPUs as we are having strange issues using OpenCL on K20 GPU.
Related
CUDA MPS allows you to run multiple processes in parallel on the GPU, thus fully utilizing the GPU for operations that don't take full advantage. Is there an equivalent for OpenCL? Or is there a different approach in OpenCL?
If you use multiple OpenCL command queues that don't have event interdependencies, an OpenCL runtime could keep the GPU cores busy with varied work from each queue. It's really up to the implementation as to whether this actually happens. You'd need to check each vendor's OpenCL guide to see if they support concurrent GPU kernels.
I am using OpenCL to execute a procedure on different GPUs and CPUs simultaneously to get a high performance results. The Intel OpenCL is always showing a message that the Kernel is not vectorized, so it will only run on different cores but will not run using SIMD instructions. My question is, if I rewrite the code so that the SIMD instruction can be exploit with the OpenCL code, will it increase the GPU Performance also?
Yes - but beware that this is not necessary on AMD GCN based APU/GPU or Nvidia Fermi or higher GPU hardware for good performance -they do scalar operations with great utilization. CPUs and Intels GPU however can greatly benefit via SIMD instructions which is what the vector operations boil down to.
When we have multi-core CPU, OpenCL treats it as a single device with multiple compute units and for every device we can create some command queues.How can CPU as a host, create a command queue on itself? I think in this situation it become multithreading rather than parallel computing.
Some devices, including most CPU devices can be segmented into sub-devices using the extension "cl_ext_device_fission". When you use device fission, you are still getting parallel processing in that the host thread can do other tasks while the kernel is running on some CPU cores.
When not using device fission, the CPU device will block essentially block the host program while a kernel is running. Even if some opencl implementations are non-blocking during kernel execution, the performance hit to the host would be too great to allow much work to be done by the host thread.
So it's still parallel computation, but I guess the host application's core is technically multi-threading during kernel execution.
It's parallel computing using multithreading. When you use an OpenCL CPU driver and enqueue kernels for the CPU device, the driver uses threads to execute the kernel, in order to fully leverage all of the CPUs cores (and also typically uses vector instructions like SSE to fully utilize each core).
In case of CUDA, NSIGHT would give us detail time lines of each kernel.
Is there similar tool for Intel Opencl? Basically I want to see if my three kernels are running in concurrently or not.
Little disclaimer: This is more the kind of theoretical / academic question than an actual problem I've got.
The usual way of setting up a parallel program in OpenCL is to write a C/C++ program, which sets up the devices (GPU and/or other CPUs), kernel and data buffers for executing the kernel on the device.
This program gets launched from the host, which used to be a CPU.
Would it be possible to write a OpenCL program where the host is a GPU and the devices other GPUs and/or CPUs?
What would be the prerequisites for such a scenario?
Do one need a special GPU or would it be possible to use any OpenCL-capable GPU?
Are you looking for a complete host or just a kernel launcher?
Up coming CUDA (v 5.0) introduces a feature to launch a kernel inside a kernel. Therefore, a device can be used for launching a kernel on itself. May be this feature will be supported by OpenCL too in near future.