OpenCL creates new threads upon first function call, why?

OpenCL creates new threads upon first function call, why? - opencl

When I call an OpenCL function I wouldn't imagine would create new threads, in this case to simply get platform IDs, my program creates 8 new threads.
cl_platform_id platforms[10] = {0};
cl_uint numberofplatforms = 0;
clGetPlatformIDs(10, platforms, &numberofplatforms);//this creates 8 threads
Due to me not creating a context, but simply asking for platform IDs to see what is available, why does this function create all these threads? I'm using Windows 7 64 bit, i7 920 with HT (I am suspicious it is creating 8 threads because I have 8 cores), both the Intel and Nvidia SDK( I have a GTS 250 and GTX 560), while I'm linking with the Nvidia OpenCL libraries and using its headers.
This isn't a big concern, but what if I decide to not use OpenCL after analyzing the devices, only to have 8 useless threads "lying around". Anyone know why this happens?

A lot of the OpenCL functions are non-blocking, meaning that they issue commands to the device in the form of a queue and I'm pretty sure the threads are used to control the device while the host program continues to run the rest of the code.
To illustrate: When you call clEnqueueNDRangeKernel, the kernel is not necessarily run at once, but the program continues to run code after the clEnqueueNDRangeKernel call. So I guess this functions passes some information to separate threads that controls the compute device and makes sure that the kernel is run eventually.

Related

How does the host send OpenCL kernels and arguments to the GPU at the assembly level?

So you get a kernel and compile it. You set the cl_buffers for the arguments and then clSetKernelArg the two together.
You then enqueue the kernel to run and read back the buffer.
Now, how does the host program tell the GPU the instructions to run. e.g. I'm on a 2017 MBP with a Radeon Pro 460. At the assembly level what instructions are called in the host process to tell the GPU "here's what you're going to run." What mechanism lets the cl_buffers be read by the GPU?
In fact, if you can point me to an in detail explanation of all of this I'd be quite pleased. I'm a toolchain engineer and I'm curious about the toolchain aspects of GPU programming but I'm finding it incredibly hard to find good resources on it.

It pretty much all runs through the GPU driver. The kernel/shader compiler, etc. tend to live in a user space component, but when it comes down to issuing DMAs, memory-mapping, and responding to interrupts (GPU events), that part is at least to some extent covered by the kernel-based component of the GPU driver.
A very simple explanation is that the kernel compiler generates a GPU-model-specific code binary, this gets uploaded to VRAM via DMA, and then a request is added to the GPU's command queue to run a kernel with reference to the VRAM address where that kernel is stored.
With regard to OpenCL memory buffers, there are essentially 3 ways I can think of that this can be implemented:
A buffer is stored in VRAM, and when the CPU needs access to it, that range of VRAM is mapped onto a PCI BAR, which can then be memory-mapped by the CPU for direct access.
The buffer is stored entirely in System RAM, and when the GPU accesses it, it uses DMA to perform read and write operations.
Copies of the buffer are stored both in VRAM and system RAM; the GPU uses the VRAM copy and the CPU uses the system RAM copy. Whenever one processor needs to access the buffer after the other has made modifications to it, DMA is used to copy the newer copy across.
On GPUs with UMA (Intel IGP, AMD APUs, most mobile platforms, etc.) VRAM and system RAM are the same thing, so they can essentially use the best bits of methods 1 & 2.
If you want to take a deep dive on this, I'd say look into the open source GPU drivers on Linux.

The enqueue the kernel means ask an OpenCL driver to submit work to dedicated HW for execution. In OpenCL, for example, you would call the clEnqueueNativeKernel API, which will add the dispatch compute workload command to the command queue - cl_command_queue.
From the spec:
The command-queue can be used to queue a set of operations (referred to as commands) in order.
https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_command_queues
Next, the implementation of this API will trigger HW to process commands recorded into a command queue (which holds all actual commands in the format which particular HW understands). HW might have several queues and process them in parallel. Anyway after the workload from a queue is processed, HW will inform the KMD driver via an interrupt, and KMD is responsible to propagate this update to OpenCL driver via OpenCL supported event mechanism, which allows user to track workload execution status - see https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#clWaitForEvents.
To get better idea how the OpenCL driver interacts with a HW you could take a look into the opensource implementation, see:
https://github.com/pocl/pocl/blob/master/lib/CL/clEnqueueNativeKernel.c

How to create read-only memory buffer across multiple devices in OpenCL?

This is really an issue just for NVIDIA devices. For AMD cards, multiple command queues within a single context can be executed simultaneously, however, NVIDIA OpenCL does not support this. One has to create multiple context objects in multiple threads in order to run the kernel on multiple devices simultaneously.
the downside of creating multiple context is that all cl_mem objects have to be created multiple times, one for each context. something like
gmedia=(cl_mem *)malloc(workdevice*sizeof(cl_mem));
for(i=0;i<workdevice;i++){
OCL_ASSERT(((gmedia[i]=clCreateBuffer(mcxcontext[i],RO_MEM, sizeof(cl_uint)*(dimxyz),media,&status),status)));
...
}
this becomes quite time and memory consuming if one has a large array to copy. In my case, the memory copying overhead became dominant and consumes many more GB memory before launching the kernel. When I am trying to launch this kernel over 8 GPUs (I have a total 11 GPUs), the code crashes due to memory limit.
I am wondering if there is a way to effectively share read-only buffers across multiple devices with OpenCL?
alternatively, is there a way to launch simultaneous executions on multiple devices with NVIDIA OpenCL?
thanks

Using of the same GPU memeoy object

Suppose you create two threads and making both of them entering a loop there both of them start the same kernel which uses same opencl memory object (Buffer in cl.hpp in my case). Will it work properly? Do opencl allow to run in the same time different kernels with the same memory object?
(I am using opencl C++ wrapper cl.hpp and beignet Intel open source library.)

If both threads are using the same in-order command queue, it will work just fine; it just becomes a race as to which thread enqueues their work first. From the OpenCL runtime point of view, it's just commands in a queue.
OpenCL 1.1 (and newer) is threadsafe except for clSetKernelArg and clEnqueueNDRangeKernel for a given kernel; you'll need to lock around that.
If however your threads are using two different command queues then you shouldn't be using the same memory object without then using OpenCL Event objects to synchronize. Unless it is read-only; that should be fine.

Read operation on same OpenCL memory objects, by concurrent kernels, wouldn't cause any functionality issue. In case of write operation, it sure will cause functionality issues.
What is the objective of running multiple kernels concurrently? Please check this answer to similar question.

OpenCL. How to identify which compute device is free and submit jobs accordingly?

I am running my OpenCL C codes on our institution's GPU cluster, which has 8 nodes and each node has an Intel Xeon 8C proc with 3 NVIDIA Tesla M2070 GPUs (in total 24 GPUs). I need to find a way from my host code to identify which of the GPUs are already occupied and which are free and to submit my jobs to those available GPUs. The closest answer that i could find was in
How to programmatically discover specific GPU on platform with multiple GPUs (OpenCL 1.1)?
How to match OpenCL devices with a specific GPU given PCI vendor, device and bus IDs in a multi-GPU system?.
Can anyone help me out with how to choose a node and choose a GPU which is free for computation. I am writing in OpenCL C.
Gerald

Unfortunately, there is no standard way to do such a thing.
If you want to squeeze the full power of GPUs for computations and your problem is not a memory hog, I can suggest to use two contexts per device: as kernels at the first one end computation, kernels of the second one are still working and you have time to fill the buffers with data and start next task in the first context, and vice versa. In my case (AMD GPU, OpenCL 1.2) if saves from 0 to 20 % of computational time. Three contexts provide sometimes slower execution, sometimes faster, so I do not recommend this as a standard technique, but you can try. Four and more contexts are useless, from my experience.

Have a command queue for each device, then use OpenCL Events with each kernel submission, and check the state of them before submitting a new kernel for execution. Whichever command queue has the least unfinished kernels is the one you should enqueue to.

Use GPU and CPU wisely

I'm newbie for OpenCL, just started learning. I wanted to know whether it is possible to execute few threads on GPU and remaining threads on CPU? In other words, if I launch 100 threads and assume that I've 8 core CPU then is it possible that 8 threads out of 100 threads will execute on CPU and remaining 92 threads will run on GPU?Can OpenCL help me to do this job smoothly?

I wanted to know whether it is possible to execute few threads on GPU and remaining threads on CPU?
Yes
In other words, if I launch 100 threads and assume that I've 8 core CPU then is it possible that 8 threads out of 100 threads will execute on CPU and remaining 92 threads will run on GPU?
No. That description suggests that you'd be viewing the GPU & CPU as a single compute resource. You can't do that.
That doesn't mean you can't have both working on the same task.
The GPU and CPU will be considered to be separate OpenCL devices.
You can write code that can talk to multiple devices.
You can compile the same kernel for multiple devices.
You can ask for multiple devices to do work at the same time.
...but...
None of this is automatic.
OpenCL won't split a single NDRange (or equivalent) call between multiple devices.
This means you'd have to schedule tasks between the two devices yourself.
There's going to be quite a large disparity in speed, so keeping it optimal will require more than "92 here, 8 there".
What I've found works better is having the CPU work on a different task whilst the GPU is working. Maybe preparing the next piece of work for the GPU, or post-processing the results from the GPU. Sometimes this is normal code. Sometimes it's OpenCL.

You can use multiple openCL devices to work on your algorithm, but the workload needs to be partitioned subtly enough so the work across devices is balanced properly, or else the overhead may make your runtime worse.
It is stated clearly in the AMD OpenCL Programming Guide section 4.7 about using multiple OpenCL devices, so my answer is, yes, you can divide the work to be executed with multiple devices, smoothly, if and only if your scheduling algorithm is smart enough to balance the whole thing.

openCL code is compiled at run time for the selected device (CPU, model of GPU)
You can switch which target you use for different tasks but you can't (with any implementation I know of) split the same task between CPU and GPU

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex