Using of the same GPU memeoy object - opencl

Suppose you create two threads and making both of them entering a loop there both of them start the same kernel which uses same opencl memory object (Buffer in cl.hpp in my case). Will it work properly? Do opencl allow to run in the same time different kernels with the same memory object?
(I am using opencl C++ wrapper cl.hpp and beignet Intel open source library.)

If both threads are using the same in-order command queue, it will work just fine; it just becomes a race as to which thread enqueues their work first. From the OpenCL runtime point of view, it's just commands in a queue.
OpenCL 1.1 (and newer) is threadsafe except for clSetKernelArg and clEnqueueNDRangeKernel for a given kernel; you'll need to lock around that.
If however your threads are using two different command queues then you shouldn't be using the same memory object without then using OpenCL Event objects to synchronize. Unless it is read-only; that should be fine.

Read operation on same OpenCL memory objects, by concurrent kernels, wouldn't cause any functionality issue. In case of write operation, it sure will cause functionality issues.
What is the objective of running multiple kernels concurrently? Please check this answer to similar question.

Related

accessing file system using cpu device in opencl

I am a newbie to opencl. I have a doubt about opencl functioning when kernel is running on a cpu device.Suppose we have a kernel running on a cpu device, can it read from a file on the disk. If yes,then how ? If no , then why not ?
Can you please suggest a source for detailed information ??
thanks in advance.
It can't. Simply because not every OpenCL device has a file system, or a disk respectively.
You can't. OpenCL is trying to unite access to computing power and file system is depending on OS. If you want this feature, there are threads (C++11 thread, pthread,...) or OpenMP should be able to handle this, because it's CPU-only thing.
It doesn't make sense to allow device kernels to access the filesystem, because most of the semantics of filesystem access are essentially incompatible with the massively parallel nature of device kernels.
There are two ways to work around this, considering you're only asking about CPU.
if you intend to use OpenCL as a way to do multithreading on CPU, consider using what OpenCL calls “native kernels”, which are essentially just plain C functions, called within an OpenCL context;
a more general approach that might work on GPU too would be to mmap the files you want to operate on, and pass the resulting pointers to clCreateBuffer with CL_USE_HOST_PTR flags.

Call OpenCL CPU Kernel via function pointer

I want to use OpenCL as a simple C runtime JIT on the CPU. Because the kernels are ASCII, i can modify them at runtime, and compile/execute the code. This part is straightforward enough.
However, I'd like to have function pointer access to the resulting compiled kernel, so that it can be called conventionally from C code, rather then having to access the kernel through openCL API.
Obviously this only works on the CPU where the memory is shared.
It seems this should be possible, any thoughts?
No, it can't be done. You need to use clEnqueueTask. If you were somehow able to get the address of the CPU kernel and reverse engineer the parameters passed, it would be subject to change with a driver update.
If you need runtime compilation look at linking to LLVM or similar.

Memory Object Assignation to Context Mechanism In OpenCL

I'd like to know what exactly happens when we assign a memory object to a context in OpenCL.
Does the runtime copies the data to all of the devices which are associated with the context?
I'd be thankful if you help me understand this issue :-)
Generally and typically the copy happens when the runtime handles the clEnqueueWriteBuffer / clEnqueueReadBuffer commands.
However, if you created the memory object using certain combinations of flags, the runtime can choose to copy the memory sooner than that (like right after creation) or later (like on-demand before running a kernel or even on-demand as it needs it). Vendor documentation often indicates if they take special advantage of any of these flags.
A couple of the "interesting" variations:
Shared memory (Intel Ingrated Graphics GPUs, AMD APUs, and CPU drivers): You can allocate a buffer and never copy it to the device because the device can access host memory.
On-demand paging: Some discrete GPUs can copy buffer memory over PCIe as it is read or written by a kernel.
Those are both "advanced" usage of OpenCL buffers. You should probably start with "regular" buffers and work your way up if they don't do what you need.
This post describes the extra flags fairly well.

OpenCL: Sending same cl_mem to multiple devices

I am writing a multi-GPU parallel algorithm. One of the issues I am facing is to find out what would happen if I push one cl_mem to multiple devices, and let them run the same kernel at the same time. The kernel will make change to the memory passed to device.
It is very time consuming to code and debug OpenCL code. So before I start doing it I want to take some advices from fellow Stackoverflow users - I want to know the consequence of doing such thing, in both of below scenarios (e.g will there be any exception raised during execution? Are data synchronized? When CL_MEM_COPY_HOST_PTR is used is the same region of memory pointed by this cl_mem get properly copied to device? etc.):
The memory is created with CL_MEM_COPY_HOST_PTR
The memory is created with CL_MEM_USE_HOST_PTR
I don't see anything explicit in the OpenCL specifications that guarantees that data will be synchronised across devices. I don't see how the OpenCL implementation would know how to distribute a buffer across multiple devices and how to aggregate those buffers again later.
The approach I've adopted is to create a separate context, read, write and kernel exec queues for each device. I then create separate buffers on each device and enqueue writes/reads to move data to/from the devices. Hence I explicitly handle all of that myself.
I'd like a better solution, but at least the above method works and doesn't rely on anything that is implementation specific.
Appendix A of the OpenCL Specification explains the required synchronization for objects shared between different command queues.
Basically it says you should use OpenCL events and clFlush to synchronize execution between the command queues. The OpenCL implementation will synchronize the contents of the memory objects between the different devices of the OpenCL context. USE/COPY _HOST_PTR does not make any difference, but USE_HOST_PTR will avoid a couple of extra copies of the data in host memory. Use clEnqueueMapBuffer to synchronize bits with the host at the end.

Sharing the GPU between OpenCL capable programs

Is there a method to share the GPU between two separate OpenCL capable programs, or more specifically between two separate processes that simultaneously both require the GPU to execute OpenCL kernels? If so, how is this done?
It depends what you call sharing.
In general, you can create 2 processes that both create an OpenCL device, on the same GPU. It's then the driver/OS/GPU's responsibility to make sure things just work.
That said, most implementations will time-slice the GPU execution to make that happen (just like it happens for graphics).
I sense this is not exactly what you're after though. Can you expand your question with a use case ?
Current GPUs (except NVidia's Fermi) do not support simultaneous execution of more than one kernel. Moreover, to this date GPUs do not support preemptive multitasking; it's completely cooperative! A kernel's execution cannot be suspended and continued later on. So the granularity of any time-based GPU sharing depends on the kernels' execution times.
If you have multiple programs running that require GPU access, you should therefore make sure that your kernels have short runtimes (< 100ms is a rule of thumb), so that GPU time can be timesliced among the kernels that want GPU cycles. It's also important to do that since otherwise the host system's graphics will become very unresponsive as they need GPU access too. This can go as far that a kernel in an endless or long loop will apparently crash the system.

Resources