How to launch multiple kernel in OpenCL, inside the program? - opencl

I'm trying to see the performance of the Opencl Programming model on GPUs, now while testing the Programming model, i have to launch the kernel by using clEnqueueNDkernel(), I'm trying to call this function multiple times, so that I can see how it performs when two or four concurrent kernels are launched.
I observe that the program is taking the same amount of time as launching one kernel, now I'm assuming that it is just running the kernel once, cause there is no way it takes the same amount of time to run two or four concurrent kernels.
Now I want to know how to launch multiple kernels on one GPU.
eg: I want to launch something like :
clEnqueueNDkernel()
clEnqueueNDkernel()
How can I do this?

First of all, check if your Device supports concurrent kernel execution. Latest AMD & Nvidia cards do.
Then, create multiple command queues. If you enqueue kernels into same queue, they will be executed consequently one after another.
Finally, check that kernels were indeed executed in parallel. Use profilers from SDK or OpenCL events to gather profiling info.

Related

User mode and kernel mode: different program at same time

Is it possible that one process is running in kernel mode and another in user mode at the same time?
I know, it's not a coding question but please guide me if someone knows answer.
For two processes to actually be running at the same time, you must have multiple CPUs. And indeed, when you have multiple CPUs, what runs on the different CPUs is very loosly coupled and you can definitely have one process running user code on one CPU, while another process runs kernel code (e.g., doing some work inside a system call) on another CPU.
If you are asking about just one CPU, in that case you can't have two running processes at the same time. But what you can have is two runnable processes, which mean two processes which are both ready to run but since there is just one CPU, only one of the can actually run. One of the runnable processes might be in user mode - e.g., consider a long-running tight loop that was preempted after its time quota was over. Another runnable process might be in kernel mode - e.g., consider a process that did a read() system call from disk, the kernel sent the read request to the disk, but the read request completed so now the process is ready to run again in kernel mode and complete the read() call.
Yes, it is possible. Even multiple processes can be in the kernel mode at the same time.
Just that a single process cannot be in both the modes at the same time.
correct me but i suppose there is no any processes in kernel mode , there are only threads.

Using of the same GPU memeoy object

Suppose you create two threads and making both of them entering a loop there both of them start the same kernel which uses same opencl memory object (Buffer in cl.hpp in my case). Will it work properly? Do opencl allow to run in the same time different kernels with the same memory object?
(I am using opencl C++ wrapper cl.hpp and beignet Intel open source library.)
If both threads are using the same in-order command queue, it will work just fine; it just becomes a race as to which thread enqueues their work first. From the OpenCL runtime point of view, it's just commands in a queue.
OpenCL 1.1 (and newer) is threadsafe except for clSetKernelArg and clEnqueueNDRangeKernel for a given kernel; you'll need to lock around that.
If however your threads are using two different command queues then you shouldn't be using the same memory object without then using OpenCL Event objects to synchronize. Unless it is read-only; that should be fine.
Read operation on same OpenCL memory objects, by concurrent kernels, wouldn't cause any functionality issue. In case of write operation, it sure will cause functionality issues.
What is the objective of running multiple kernels concurrently? Please check this answer to similar question.

How to implement a program in openCL using MPI on a single cpu machine

I'm new to GPU programming , I have laptop without graphics card,i want to develop a matrix multiplication program on intel openCL, and implement this application using MPI..
any guidelines and helpfull links can be posted.
I'm confused about the MPI thing, do we have to write code for MPI , or do we have to use some developed MPIs to run our application?
this is the project proposal of what i want to do
GPU cluster computation (C++, OpenCL and MPI)
Study MPI for distributing the problem
Implement OpenCL apps on a single machine (matrix multiplication/ 2D image processing)
Implement apps with MPI (e.g. large 2D image processing)
So the thing to understand is that MPI and OpenCL for your purposes are completely orthogonal. MPI is for communicating between your GPU nodes; OpenCL is for accelerating your local computation on a single node by using the GPU (or multiple CPU cores). For any of these problems, you'd start with writing a serial C++ version of the code. The next step would be to (in any order) work on an OpenCL implementation for a single node, and work on an MPI version which decomposes the problems (you don't want to user master-slave for any of the above listed problems) onto multiple processes, with each process doing their local part of the computation which contributes to the global solution. Once both of those parts are done, you'd merge the two and have a distributed-memory (the MPI part) GPU (the OpenCL part) version of a code to solve this problem.
It won't quite be that easy, of course, and combining the two will take a fair bit of work, but that's the basic approach to keep in mind. Start with one problem, get it working on a single processor in C++, then try it with one or the other. Don't try to do everything at once or you'll never get anywhere.
For problems like matrix multiplication, there are many many examples on the internet of both GPU and MPI implementations to learn from.
Simplified:
MPI is a library for communicating proccesses, but also a platform for running applications in a cluster. You write a program that use MPI library and then that program should be executed with MPI. MPI fork that application N times in the cluster and allow to communicate that applicacion instances with messages.
The tasks that the make the instances, if they are the same or different workers, and the topology is up to you.
I think 3 ways to use (OpenCL and MPI):
MPI start (K+1) instances, one master and K slaves. The master split the data in chunks and the slaves proccess the data in the GPUS using OpenCL. All slaves are the same.
MPI start (k+1) instances, one master and k slaves. Each slave compute a specialized problem (slave 1 matrix multiplication, slave 2 block compression, ...etc) and the master direct the data in a workflow kind of task.
MPI start (k+1) instances, one master and k slaves. Same that case 1, but the master also send to the slaves the OpenCL program to proccess data.

OpenCL Execution model multiple queued kernels

I was curious as to how the GPU executes the same kernel multiple times.
I have a kernel which is being queued hundreds (possibly thousands) of times in a row, and using the AMD App Profiler I noticed that it would execute clusters of kernels extremely fast, then like clockwork every so often a kernel would "hang" (i.e. take orders of magnitude longer to execute). I think it's every 64th kernel that hangs.
This is odd because each time through the kernel performs the exact same operations with the same local and global sizes. I'm even re-using the same buffers.
Is there something about the execution model that I'm missing (perhaps other programs/the OS accessing the GPU or the timing frequency of the GPU memory). I'm testing this on an ATI HD5650 card under Windows 7 (64-bit), with AMD App SDK 2.5 with in-order queue execution.
As a side note, if the I don't have any global memory accesses in my kernel (a rather impractical prospect), the profiler puts a gap in between the quick executing kernels and where the slow executing kernels were before is now a large empty gap where none of my kernels are being executed.
As a follow-up question, is there anything that can be done to fix this?
It's probable you're seeing the effects of your GPU's maximum number of concurrent tasks. Each task enqueued is assigned to one or more multiprocessors, which are frequently capable of running hundreds of workitems at a time - of the same kernel, enqueued in the same call. Perhaps what you're seeing is the OpenCL runtime waiting for one of the multiprocessors to free up. This relates most directly to the occupancy issue - if the work size can't keep the multiprocessor busy, through memory latencies and all, it has idle cycles. The limit here depends on how many registers (local or private memory) your kernel requires. In summary, you want to write your kernel to operate on multiple pieces of data more so than queueing it many times.
Did your measurement include reading back results from the apparently fast executions?

Sharing the GPU between OpenCL capable programs

Is there a method to share the GPU between two separate OpenCL capable programs, or more specifically between two separate processes that simultaneously both require the GPU to execute OpenCL kernels? If so, how is this done?
It depends what you call sharing.
In general, you can create 2 processes that both create an OpenCL device, on the same GPU. It's then the driver/OS/GPU's responsibility to make sure things just work.
That said, most implementations will time-slice the GPU execution to make that happen (just like it happens for graphics).
I sense this is not exactly what you're after though. Can you expand your question with a use case ?
Current GPUs (except NVidia's Fermi) do not support simultaneous execution of more than one kernel. Moreover, to this date GPUs do not support preemptive multitasking; it's completely cooperative! A kernel's execution cannot be suspended and continued later on. So the granularity of any time-based GPU sharing depends on the kernels' execution times.
If you have multiple programs running that require GPU access, you should therefore make sure that your kernels have short runtimes (< 100ms is a rule of thumb), so that GPU time can be timesliced among the kernels that want GPU cycles. It's also important to do that since otherwise the host system's graphics will become very unresponsive as they need GPU access too. This can go as far that a kernel in an endless or long loop will apparently crash the system.

Resources