Call OpenCL CPU Kernel via function pointer - opencl

I want to use OpenCL as a simple C runtime JIT on the CPU. Because the kernels are ASCII, i can modify them at runtime, and compile/execute the code. This part is straightforward enough.
However, I'd like to have function pointer access to the resulting compiled kernel, so that it can be called conventionally from C code, rather then having to access the kernel through openCL API.
Obviously this only works on the CPU where the memory is shared.
It seems this should be possible, any thoughts?

No, it can't be done. You need to use clEnqueueTask. If you were somehow able to get the address of the CPU kernel and reverse engineer the parameters passed, it would be subject to change with a driver update.
If you need runtime compilation look at linking to LLVM or similar.

Related

How does the host send OpenCL kernels and arguments to the GPU at the assembly level?

So you get a kernel and compile it. You set the cl_buffers for the arguments and then clSetKernelArg the two together.
You then enqueue the kernel to run and read back the buffer.
Now, how does the host program tell the GPU the instructions to run. e.g. I'm on a 2017 MBP with a Radeon Pro 460. At the assembly level what instructions are called in the host process to tell the GPU "here's what you're going to run." What mechanism lets the cl_buffers be read by the GPU?
In fact, if you can point me to an in detail explanation of all of this I'd be quite pleased. I'm a toolchain engineer and I'm curious about the toolchain aspects of GPU programming but I'm finding it incredibly hard to find good resources on it.
It pretty much all runs through the GPU driver. The kernel/shader compiler, etc. tend to live in a user space component, but when it comes down to issuing DMAs, memory-mapping, and responding to interrupts (GPU events), that part is at least to some extent covered by the kernel-based component of the GPU driver.
A very simple explanation is that the kernel compiler generates a GPU-model-specific code binary, this gets uploaded to VRAM via DMA, and then a request is added to the GPU's command queue to run a kernel with reference to the VRAM address where that kernel is stored.
With regard to OpenCL memory buffers, there are essentially 3 ways I can think of that this can be implemented:
A buffer is stored in VRAM, and when the CPU needs access to it, that range of VRAM is mapped onto a PCI BAR, which can then be memory-mapped by the CPU for direct access.
The buffer is stored entirely in System RAM, and when the GPU accesses it, it uses DMA to perform read and write operations.
Copies of the buffer are stored both in VRAM and system RAM; the GPU uses the VRAM copy and the CPU uses the system RAM copy. Whenever one processor needs to access the buffer after the other has made modifications to it, DMA is used to copy the newer copy across.
On GPUs with UMA (Intel IGP, AMD APUs, most mobile platforms, etc.) VRAM and system RAM are the same thing, so they can essentially use the best bits of methods 1 & 2.
If you want to take a deep dive on this, I'd say look into the open source GPU drivers on Linux.
The enqueue the kernel means ask an OpenCL driver to submit work to dedicated HW for execution. In OpenCL, for example, you would call the clEnqueueNativeKernel API, which will add the dispatch compute workload command to the command queue - cl_command_queue.
From the spec:
The command-queue can be used to queue a set of operations (referred to as commands) in order.
https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_command_queues
Next, the implementation of this API will trigger HW to process commands recorded into a command queue (which holds all actual commands in the format which particular HW understands). HW might have several queues and process them in parallel. Anyway after the workload from a queue is processed, HW will inform the KMD driver via an interrupt, and KMD is responsible to propagate this update to OpenCL driver via OpenCL supported event mechanism, which allows user to track workload execution status - see https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#clWaitForEvents.
To get better idea how the OpenCL driver interacts with a HW you could take a look into the opensource implementation, see:
https://github.com/pocl/pocl/blob/master/lib/CL/clEnqueueNativeKernel.c

What is the difference between kernel and program object?

I've been through several resources: the OpenCL Khronos book, GATech tutorial, NYU tutorial, and I could go through more. But I still don't understand fully. What is the difference between a kernel and a program object?
So far the best explanation is this for me, but this is not enough for me to fully understand:
PROGRAM OBJECT: A program object encapsulates some source code (with potentially several kernel functions) and its last successful build.
KERNEL: A kernel object encapsulates the values of the kernel’s
arguments used when the kernel is executed.
Maybe a program object is the code? And the kernel is the compiled executable? Is that it? Because I could understand something like that.
Thanks in advance!
A program is a collection of one or more kernels plus optionally supporting functions. A program could be created from source or from several types of binaries (e.g. SPIR, SPIR-V, native). Some program objects (created from source or from intermediate binaries) need to be built for one or more devices (with clBuildProgram or clCompileProgram and clLinkProgram) prior to selecting kernels from them. The easiest way to think about programs is that they are like DLLs and export kernels for use by the programmer.
Kernel is an executable entity (not necessarily compiled, since you can have built-in kernels that represent piece of hardware (e.g. Video Motion Estimation kernels on Intel hardware)), you can bind its arguments and submit them to various queues for execution.
For an OpenCL context, we can create multiple Program objects. First, I will describe the uses of program objects in the OpenCL application.
To facilitate the compilation of the kernels for the devices to which the program is
attached
To provide facilities for determining build errors and querying the program for information
An OpenCL application uses kernel objects to execute a function parallelly on the device. Kernel objects are created from program objects. A program object can have multiple kernel objects.
As we know, to execute kernel we need to pass arguments to it. The primary purpose of kernel objects are this.
To get more clear about it here is an analogy which is given in the book "OpenCL Programming Guide" by Aaftab Munshi et al
An analogy that may be helpful in understanding the distinction between kernel objects and program objects is that the program object is like a dynamic library in that it holds a collection of kernel functions. The kernel object is like a handle to a function within the dynamic library. The program object is created from either source code (OpenCL C) or a compiled program binary (more on this later). The program gets built for any of the devices to which the program object is attached. The kernel object is then used to access properties of the compiled kernel function, enqueue calls to it, and set its arguments.

Using of the same GPU memeoy object

Suppose you create two threads and making both of them entering a loop there both of them start the same kernel which uses same opencl memory object (Buffer in cl.hpp in my case). Will it work properly? Do opencl allow to run in the same time different kernels with the same memory object?
(I am using opencl C++ wrapper cl.hpp and beignet Intel open source library.)
If both threads are using the same in-order command queue, it will work just fine; it just becomes a race as to which thread enqueues their work first. From the OpenCL runtime point of view, it's just commands in a queue.
OpenCL 1.1 (and newer) is threadsafe except for clSetKernelArg and clEnqueueNDRangeKernel for a given kernel; you'll need to lock around that.
If however your threads are using two different command queues then you shouldn't be using the same memory object without then using OpenCL Event objects to synchronize. Unless it is read-only; that should be fine.
Read operation on same OpenCL memory objects, by concurrent kernels, wouldn't cause any functionality issue. In case of write operation, it sure will cause functionality issues.
What is the objective of running multiple kernels concurrently? Please check this answer to similar question.

accessing file system using cpu device in opencl

I am a newbie to opencl. I have a doubt about opencl functioning when kernel is running on a cpu device.Suppose we have a kernel running on a cpu device, can it read from a file on the disk. If yes,then how ? If no , then why not ?
Can you please suggest a source for detailed information ??
thanks in advance.
It can't. Simply because not every OpenCL device has a file system, or a disk respectively.
You can't. OpenCL is trying to unite access to computing power and file system is depending on OS. If you want this feature, there are threads (C++11 thread, pthread,...) or OpenMP should be able to handle this, because it's CPU-only thing.
It doesn't make sense to allow device kernels to access the filesystem, because most of the semantics of filesystem access are essentially incompatible with the massively parallel nature of device kernels.
There are two ways to work around this, considering you're only asking about CPU.
if you intend to use OpenCL as a way to do multithreading on CPU, consider using what OpenCL calls “native kernels”, which are essentially just plain C functions, called within an OpenCL context;
a more general approach that might work on GPU too would be to mmap the files you want to operate on, and pass the resulting pointers to clCreateBuffer with CL_USE_HOST_PTR flags.

Memory Object Assignation to Context Mechanism In OpenCL

I'd like to know what exactly happens when we assign a memory object to a context in OpenCL.
Does the runtime copies the data to all of the devices which are associated with the context?
I'd be thankful if you help me understand this issue :-)
Generally and typically the copy happens when the runtime handles the clEnqueueWriteBuffer / clEnqueueReadBuffer commands.
However, if you created the memory object using certain combinations of flags, the runtime can choose to copy the memory sooner than that (like right after creation) or later (like on-demand before running a kernel or even on-demand as it needs it). Vendor documentation often indicates if they take special advantage of any of these flags.
A couple of the "interesting" variations:
Shared memory (Intel Ingrated Graphics GPUs, AMD APUs, and CPU drivers): You can allocate a buffer and never copy it to the device because the device can access host memory.
On-demand paging: Some discrete GPUs can copy buffer memory over PCIe as it is read or written by a kernel.
Those are both "advanced" usage of OpenCL buffers. You should probably start with "regular" buffers and work your way up if they don't do what you need.
This post describes the extra flags fairly well.

Resources