What is the difference between kernel and program object? - opencl

I've been through several resources: the OpenCL Khronos book, GATech tutorial, NYU tutorial, and I could go through more. But I still don't understand fully. What is the difference between a kernel and a program object?
So far the best explanation is this for me, but this is not enough for me to fully understand:
PROGRAM OBJECT: A program object encapsulates some source code (with potentially several kernel functions) and its last successful build.
KERNEL: A kernel object encapsulates the values of the kernel’s
arguments used when the kernel is executed.
Maybe a program object is the code? And the kernel is the compiled executable? Is that it? Because I could understand something like that.
Thanks in advance!

A program is a collection of one or more kernels plus optionally supporting functions. A program could be created from source or from several types of binaries (e.g. SPIR, SPIR-V, native). Some program objects (created from source or from intermediate binaries) need to be built for one or more devices (with clBuildProgram or clCompileProgram and clLinkProgram) prior to selecting kernels from them. The easiest way to think about programs is that they are like DLLs and export kernels for use by the programmer.
Kernel is an executable entity (not necessarily compiled, since you can have built-in kernels that represent piece of hardware (e.g. Video Motion Estimation kernels on Intel hardware)), you can bind its arguments and submit them to various queues for execution.

For an OpenCL context, we can create multiple Program objects. First, I will describe the uses of program objects in the OpenCL application.
To facilitate the compilation of the kernels for the devices to which the program is
attached
To provide facilities for determining build errors and querying the program for information
An OpenCL application uses kernel objects to execute a function parallelly on the device. Kernel objects are created from program objects. A program object can have multiple kernel objects.
As we know, to execute kernel we need to pass arguments to it. The primary purpose of kernel objects are this.
To get more clear about it here is an analogy which is given in the book "OpenCL Programming Guide" by Aaftab Munshi et al
An analogy that may be helpful in understanding the distinction between kernel objects and program objects is that the program object is like a dynamic library in that it holds a collection of kernel functions. The kernel object is like a handle to a function within the dynamic library. The program object is created from either source code (OpenCL C) or a compiled program binary (more on this later). The program gets built for any of the devices to which the program object is attached. The kernel object is then used to access properties of the compiled kernel function, enqueue calls to it, and set its arguments.

Related

Using of the same GPU memeoy object

Suppose you create two threads and making both of them entering a loop there both of them start the same kernel which uses same opencl memory object (Buffer in cl.hpp in my case). Will it work properly? Do opencl allow to run in the same time different kernels with the same memory object?
(I am using opencl C++ wrapper cl.hpp and beignet Intel open source library.)
If both threads are using the same in-order command queue, it will work just fine; it just becomes a race as to which thread enqueues their work first. From the OpenCL runtime point of view, it's just commands in a queue.
OpenCL 1.1 (and newer) is threadsafe except for clSetKernelArg and clEnqueueNDRangeKernel for a given kernel; you'll need to lock around that.
If however your threads are using two different command queues then you shouldn't be using the same memory object without then using OpenCL Event objects to synchronize. Unless it is read-only; that should be fine.
Read operation on same OpenCL memory objects, by concurrent kernels, wouldn't cause any functionality issue. In case of write operation, it sure will cause functionality issues.
What is the objective of running multiple kernels concurrently? Please check this answer to similar question.

Call OpenCL CPU Kernel via function pointer

I want to use OpenCL as a simple C runtime JIT on the CPU. Because the kernels are ASCII, i can modify them at runtime, and compile/execute the code. This part is straightforward enough.
However, I'd like to have function pointer access to the resulting compiled kernel, so that it can be called conventionally from C code, rather then having to access the kernel through openCL API.
Obviously this only works on the CPU where the memory is shared.
It seems this should be possible, any thoughts?
No, it can't be done. You need to use clEnqueueTask. If you were somehow able to get the address of the CPU kernel and reverse engineer the parameters passed, it would be subject to change with a driver update.
If you need runtime compilation look at linking to LLVM or similar.

OpenCL: Sending same cl_mem to multiple devices

I am writing a multi-GPU parallel algorithm. One of the issues I am facing is to find out what would happen if I push one cl_mem to multiple devices, and let them run the same kernel at the same time. The kernel will make change to the memory passed to device.
It is very time consuming to code and debug OpenCL code. So before I start doing it I want to take some advices from fellow Stackoverflow users - I want to know the consequence of doing such thing, in both of below scenarios (e.g will there be any exception raised during execution? Are data synchronized? When CL_MEM_COPY_HOST_PTR is used is the same region of memory pointed by this cl_mem get properly copied to device? etc.):
The memory is created with CL_MEM_COPY_HOST_PTR
The memory is created with CL_MEM_USE_HOST_PTR
I don't see anything explicit in the OpenCL specifications that guarantees that data will be synchronised across devices. I don't see how the OpenCL implementation would know how to distribute a buffer across multiple devices and how to aggregate those buffers again later.
The approach I've adopted is to create a separate context, read, write and kernel exec queues for each device. I then create separate buffers on each device and enqueue writes/reads to move data to/from the devices. Hence I explicitly handle all of that myself.
I'd like a better solution, but at least the above method works and doesn't rely on anything that is implementation specific.
Appendix A of the OpenCL Specification explains the required synchronization for objects shared between different command queues.
Basically it says you should use OpenCL events and clFlush to synchronize execution between the command queues. The OpenCL implementation will synchronize the contents of the memory objects between the different devices of the OpenCL context. USE/COPY _HOST_PTR does not make any difference, but USE_HOST_PTR will avoid a couple of extra copies of the data in host memory. Use clEnqueueMapBuffer to synchronize bits with the host at the end.

Can I use external OpenCl libraries?

I want to use some external libraries (http://trac.osgeo.org/geos/) to perform some analytical tasks on Geometry objects(GIS). I want to perform these task using OpenCL on Cuda so that I can use the paralel power of GPU to perform these tasks in parallel on large set of data.So my question is:
Can I write kernel using these libraries?
Also How can I pass the objects of complex data structures of these libraries as an argument to the kernel/(in specific How can I create buffer of these complex objects??
An OpenCL program mostly consists of two parts
Host code - This is regular C/C++ code that calls functions in the OpenCL runtime and works just like any other code. This code needs to interface with any third-party libraries that may provide your program with (complex) data. It will also need to translate these complex data types to a set of simple data types (scalar, vector, other) that can be processed by piece 2.
Kernel code - The consists of a compiler that can convert a text/binary representation of a restricted kernel language (based on C99) to object code that can run on the target platform. This language and compiler has many restrictions including the fact that you cannot include/link in external libraries (maybe possible with native kernel that is runnable on the host CPU)
It is upto your host code to compile/setup the kernel, fetch/set up the data from any library/source, translate it into the appropriate scalar, vector or other data types permissible in an OpenCL kernel, run the kernel(s) that process the data and get the results back from the compute device to the host (if necessary) and then translate those simple data types back to whatever form required for consumption by the rest of the code.
So no - you cannot directly use a regular C++ library from inside the kernel. But you can do whatever you want to in the host code.
No, you can't use external libraries in OpenCL kernels. Remember, any kernels is required to be compiled when the OpenCl application runs because it can't know what platform it is running on beforehand.

POSIX Threads: are pthreads_cond_wait() and others systemcalls?

The POSIX standard defines several routines for thread synchronization, based on concepts like mutexes and conditional variables.
my question is now: are these (like e.g. pthreads_cond_init(), pthreads_mutex_init(), pthreads_mutex_lock()... and so on) system calls or just library calls? i know they are included via "pthread.h", but do they finally result in a system call and therefore are implemented in the kernel of the operating system?
On Linux a pthread mutex makes a "futex" system call, but only if the lock is contended. That means that taking a lock no other thread wants is almost free.
In a similar way, sending a condition signal is only expensive when there is someone waiting for it.
So I believe that your answer is that pthread functions are library calls that sometimes result in a system call.
Whenever possible, the library avoids trapping into the kernel for performance reasons. If you already have some code that uses these calls you may want to take a look at the output from running your program with strace to better understand how often it is actually making system calls.
I never looked into all those library call , but as far as I understand they all involve kernel operations as they are supposed to provide synchronisations between process and/or threads at global level - I mean at the OS level.
The kernel need to maintain for a mutex, for instance, a thread list: threads that are currently sleeping, waiting that a locked mutex get released. When the thread that currently lock/owns that mutex invokes the kernel with pthread_mutex_release(), the kernel system call will browse that aforementioned list to get the higher priority thread that is waiting for the mutex release, flag the new mutex owner into the mutex kernel structure, and then will give away the cpu (aka "ontect switch") to the newly owner thread, thus this process will return from the posix library call pthread_mutex_lock().
I only see a cooperation with the kernel when it involves IPC between processes (I am not talking between threads at a single process level). Therefore I expect those library call to invoke the kernel, so.
When you compile a program on Linux that uses pthreads, you have to add -lphtread to the compiler options. by doing this, you tell the linker to link libpthreads. So, on linux, they are calls to a library.

Resources