Repeated calling of enqueueNDRangeKernel in OpenCL - opencl

What other OpenCL functions should be called when enqueueNDRangeKernel is called repeatedly?
I have not been able to find a tutorial that shows the use of enqueueNDRangeKernel in this fashion and my coding attempts have unfortunately resulted in an unhandled exception error. A similar question has been asked before but the responses don't seem to apply to my situation.
I currently have a loop in which I call the OpenCL functions in the following sequence:
setArg
enqueueNDRangeKernel
enqueueMapBuffer
enqueueUnmapMemObject
I am calling setArg because the input to the kernel changes before each call to enqueueNDRangeKernel. I am calling enqueueMapBuffer and enqueueUnmapMemObject since the output from the kernel is used in the host code. The kernel runs ok the first time (the output is correct) but during the second pass through the loop I get an unhandled exception error when calling enqueueMapBuffer.
I am using the following set-up:
Intel OpenCL SDK with CL_DEVICE_TYPE_CPU (on an Intel i7 CPU)
Visual Studio 2010 IDE on Windows 7
Host Code is written in C++ with the OpenCL C++ bindings.
Thanks.

Problem Solved ... It turns out that I was using the correct sequence of OpenCL function calls. There was a problem in my kernel that only showed up after the first iteration of the loop.

I am trying make same thing as you but I am stuck at one point. I managed to make OpenCL program and Kernel, both working, but when I try loop it several times it works only when i loop whole code from creating an assigning device to dealloc all mem_...

Related

What is the difference between kernel and program object?

I've been through several resources: the OpenCL Khronos book, GATech tutorial, NYU tutorial, and I could go through more. But I still don't understand fully. What is the difference between a kernel and a program object?
So far the best explanation is this for me, but this is not enough for me to fully understand:
PROGRAM OBJECT: A program object encapsulates some source code (with potentially several kernel functions) and its last successful build.
KERNEL: A kernel object encapsulates the values of the kernel’s
arguments used when the kernel is executed.
Maybe a program object is the code? And the kernel is the compiled executable? Is that it? Because I could understand something like that.
Thanks in advance!
A program is a collection of one or more kernels plus optionally supporting functions. A program could be created from source or from several types of binaries (e.g. SPIR, SPIR-V, native). Some program objects (created from source or from intermediate binaries) need to be built for one or more devices (with clBuildProgram or clCompileProgram and clLinkProgram) prior to selecting kernels from them. The easiest way to think about programs is that they are like DLLs and export kernels for use by the programmer.
Kernel is an executable entity (not necessarily compiled, since you can have built-in kernels that represent piece of hardware (e.g. Video Motion Estimation kernels on Intel hardware)), you can bind its arguments and submit them to various queues for execution.
For an OpenCL context, we can create multiple Program objects. First, I will describe the uses of program objects in the OpenCL application.
To facilitate the compilation of the kernels for the devices to which the program is
attached
To provide facilities for determining build errors and querying the program for information
An OpenCL application uses kernel objects to execute a function parallelly on the device. Kernel objects are created from program objects. A program object can have multiple kernel objects.
As we know, to execute kernel we need to pass arguments to it. The primary purpose of kernel objects are this.
To get more clear about it here is an analogy which is given in the book "OpenCL Programming Guide" by Aaftab Munshi et al
An analogy that may be helpful in understanding the distinction between kernel objects and program objects is that the program object is like a dynamic library in that it holds a collection of kernel functions. The kernel object is like a handle to a function within the dynamic library. The program object is created from either source code (OpenCL C) or a compiled program binary (more on this later). The program gets built for any of the devices to which the program object is attached. The kernel object is then used to access properties of the compiled kernel function, enqueue calls to it, and set its arguments.

Using of the same GPU memeoy object

Suppose you create two threads and making both of them entering a loop there both of them start the same kernel which uses same opencl memory object (Buffer in cl.hpp in my case). Will it work properly? Do opencl allow to run in the same time different kernels with the same memory object?
(I am using opencl C++ wrapper cl.hpp and beignet Intel open source library.)
If both threads are using the same in-order command queue, it will work just fine; it just becomes a race as to which thread enqueues their work first. From the OpenCL runtime point of view, it's just commands in a queue.
OpenCL 1.1 (and newer) is threadsafe except for clSetKernelArg and clEnqueueNDRangeKernel for a given kernel; you'll need to lock around that.
If however your threads are using two different command queues then you shouldn't be using the same memory object without then using OpenCL Event objects to synchronize. Unless it is read-only; that should be fine.
Read operation on same OpenCL memory objects, by concurrent kernels, wouldn't cause any functionality issue. In case of write operation, it sure will cause functionality issues.
What is the objective of running multiple kernels concurrently? Please check this answer to similar question.

How to launch multiple kernel in OpenCL, inside the program?

I'm trying to see the performance of the Opencl Programming model on GPUs, now while testing the Programming model, i have to launch the kernel by using clEnqueueNDkernel(), I'm trying to call this function multiple times, so that I can see how it performs when two or four concurrent kernels are launched.
I observe that the program is taking the same amount of time as launching one kernel, now I'm assuming that it is just running the kernel once, cause there is no way it takes the same amount of time to run two or four concurrent kernels.
Now I want to know how to launch multiple kernels on one GPU.
eg: I want to launch something like :
clEnqueueNDkernel()
clEnqueueNDkernel()
How can I do this?
First of all, check if your Device supports concurrent kernel execution. Latest AMD & Nvidia cards do.
Then, create multiple command queues. If you enqueue kernels into same queue, they will be executed consequently one after another.
Finally, check that kernels were indeed executed in parallel. Use profilers from SDK or OpenCL events to gather profiling info.

Call OpenCL CPU Kernel via function pointer

I want to use OpenCL as a simple C runtime JIT on the CPU. Because the kernels are ASCII, i can modify them at runtime, and compile/execute the code. This part is straightforward enough.
However, I'd like to have function pointer access to the resulting compiled kernel, so that it can be called conventionally from C code, rather then having to access the kernel through openCL API.
Obviously this only works on the CPU where the memory is shared.
It seems this should be possible, any thoughts?
No, it can't be done. You need to use clEnqueueTask. If you were somehow able to get the address of the CPU kernel and reverse engineer the parameters passed, it would be subject to change with a driver update.
If you need runtime compilation look at linking to LLVM or similar.

which signal does gdb send when attaching to a process?

Which signal does gdb send when attaching to a process? Does this work the same for different UNIXes. E.g. Linux and Mac OS X?
So far I only found out, that SIGTRAP is used to implement breakpoints. Is it used for attaching aswell?
AFAIK it does not need any signals to attach. It just suspends the "inferior" by calling ptrace. It also reads debugged process memory and registers using this calls and it can request instruction single stepping (provided it's implemented on that port of linux), etc.
Software breakpoints are implemented by placing at right location instruction that triggers "trap" or something similar when reached, but debugged process can run full speed until then.
Also (next to reading man ptrace, as already mentioned) see ptrace explanation on wikipedia.

Resources