Checking if gpu is integrated or not - opencl

I couldn't find any query command about device being integrated/embedded in cpu or using system ram or its own dedicated gddr memory? I can benchmark mapping/unmapping versus reading/writing to get a conclusion but that device can be under load at that time and behave wrong and it would add complexity to already complex load balancing algorithm that I'm using.
Is there a simple way to check if a gpu is using same memory with cpu so I can choose directly mapping/unmapping instead of reading/writing?
Edit: there is CL_DEVICE_LOCAL_MEM_TYPE
CL_GLOBAL or CL_LOCAL
is this an indication of integratedness?

OpenCL 1.x has the device query CL_DEVICE_HOST_UNIFIED_MEMORY:
Is CL_TRUE if the device and the host have a unified memory subsystem
and is CL_FALSE otherwise.
This query is deprecated as of OpenCL 2.0, but should probably still work on OpenCL 2.x platforms for now. Otherwise, you may be able to produce a heuristic from the result of CL_DEVICE_SVM_CAPABILITIES instead.

Related

How does the host send OpenCL kernels and arguments to the GPU at the assembly level?

So you get a kernel and compile it. You set the cl_buffers for the arguments and then clSetKernelArg the two together.
You then enqueue the kernel to run and read back the buffer.
Now, how does the host program tell the GPU the instructions to run. e.g. I'm on a 2017 MBP with a Radeon Pro 460. At the assembly level what instructions are called in the host process to tell the GPU "here's what you're going to run." What mechanism lets the cl_buffers be read by the GPU?
In fact, if you can point me to an in detail explanation of all of this I'd be quite pleased. I'm a toolchain engineer and I'm curious about the toolchain aspects of GPU programming but I'm finding it incredibly hard to find good resources on it.
It pretty much all runs through the GPU driver. The kernel/shader compiler, etc. tend to live in a user space component, but when it comes down to issuing DMAs, memory-mapping, and responding to interrupts (GPU events), that part is at least to some extent covered by the kernel-based component of the GPU driver.
A very simple explanation is that the kernel compiler generates a GPU-model-specific code binary, this gets uploaded to VRAM via DMA, and then a request is added to the GPU's command queue to run a kernel with reference to the VRAM address where that kernel is stored.
With regard to OpenCL memory buffers, there are essentially 3 ways I can think of that this can be implemented:
A buffer is stored in VRAM, and when the CPU needs access to it, that range of VRAM is mapped onto a PCI BAR, which can then be memory-mapped by the CPU for direct access.
The buffer is stored entirely in System RAM, and when the GPU accesses it, it uses DMA to perform read and write operations.
Copies of the buffer are stored both in VRAM and system RAM; the GPU uses the VRAM copy and the CPU uses the system RAM copy. Whenever one processor needs to access the buffer after the other has made modifications to it, DMA is used to copy the newer copy across.
On GPUs with UMA (Intel IGP, AMD APUs, most mobile platforms, etc.) VRAM and system RAM are the same thing, so they can essentially use the best bits of methods 1 & 2.
If you want to take a deep dive on this, I'd say look into the open source GPU drivers on Linux.
The enqueue the kernel means ask an OpenCL driver to submit work to dedicated HW for execution. In OpenCL, for example, you would call the clEnqueueNativeKernel API, which will add the dispatch compute workload command to the command queue - cl_command_queue.
From the spec:
The command-queue can be used to queue a set of operations (referred to as commands) in order.
https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_command_queues
Next, the implementation of this API will trigger HW to process commands recorded into a command queue (which holds all actual commands in the format which particular HW understands). HW might have several queues and process them in parallel. Anyway after the workload from a queue is processed, HW will inform the KMD driver via an interrupt, and KMD is responsible to propagate this update to OpenCL driver via OpenCL supported event mechanism, which allows user to track workload execution status - see https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#clWaitForEvents.
To get better idea how the OpenCL driver interacts with a HW you could take a look into the opensource implementation, see:
https://github.com/pocl/pocl/blob/master/lib/CL/clEnqueueNativeKernel.c

OpenCL: Writing to pointer in main memory

Is it possible, using OpenCL's DMA capabilities, to write to a main memory address that is passed into the cl program? I understand doing so would likely break the program, but the intent here is to run a GPU process and then overwrite the address space of the CPU program used to run it, so breakage is expected.
Thanks!
Which version of the OpenCL API are you targeting?
In OpenCL 2.0 and above you can use Shared Virtual Memory (SVM) to share address between host and device(s) in platforms that support it.
You can get more information about it in the Intel OpenCL SVM overview.
If you are using previous versions, or your hardware does not support it, you can use pinned memory with the appropriate flags to clCreateBuffer. In particular, CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR, see clCreateBuffer in Khronos.
Note that, when using CL_MEM_USE_HOST_PTR has some alignment restrictions.
In general, in OpenCL, when and how the DMA is used depends on the hardware platform, so you should refer to the vendor documentation for details.

OpenCL 2.x pipes - how do they actually work?

I've read this description of the OpenCL 2.x pipe API and leaded through the Pipe API pages at khronos.org. I felt kind of jealous, working in CUDA almost exclusively, of this nifty feature available only in OpenCL (and sorry that CUDA functionality has not been properly subsumed by OpenCL, but that's a different issue), so I thought I'd ask "How come CUDA doesn't have a pipe mechanism". But then I realized I don't even know what that would mean exactly. So, instead, I'll ask:
How do OpenCL pipes work on AMD discrete GPUs / APUs? ...
What info gets written where?
How does the scheduling of kernel workgroups to cores effected by the use of pipes?
Do piped kernels get compiled together (say, their SPIR forms)?
Does the use of pipes allow passing data between different kernels via the core-specific cache ("local memory" in OpenCL parlance, "shared memory" in CUDA parlance)? That would be awesome.
Is there a way pipes are "supposed" to work on a GPU, generally? i.e. something the API authors envisioned or even put in writing?
How do OpenCL pipes work in CPU-based OpenCL implementations?
OpenCL pipes were introduced along with OpenCL 2.0. On GPUs the OpenCL pipes is just like a global memory buffer with controlled access i.e you can limit the number of workgroups that are allowed to write/read to/from a pipe simultaneously. This kind of allows us to re-use the same buffer or pipe without worrying about conflicting reads or writes from multiple workgroups. As far as i know OpenCL pipes do not use GPU local memory. But if you carefully adjust the size of the pipe then you can increase the cache hits thus achieving better overall performance. There is no general rule as to when pipes should be used. I use pipes to pass data between 2 concurrently running kernels so that my program achieves better overall performance due to better cache hit ratio. This is the same way OpenCL pipes work in CPU as well (its just a global buffer which might fit in the system cache if its small enough). But on devices like FPGAs they work in a different manner. The pipes makes use of the local memory instead of the global memory in these devices and hence achieves considerable higher performance over using a global memory buffer.

accessing file system using cpu device in opencl

I am a newbie to opencl. I have a doubt about opencl functioning when kernel is running on a cpu device.Suppose we have a kernel running on a cpu device, can it read from a file on the disk. If yes,then how ? If no , then why not ?
Can you please suggest a source for detailed information ??
thanks in advance.
It can't. Simply because not every OpenCL device has a file system, or a disk respectively.
You can't. OpenCL is trying to unite access to computing power and file system is depending on OS. If you want this feature, there are threads (C++11 thread, pthread,...) or OpenMP should be able to handle this, because it's CPU-only thing.
It doesn't make sense to allow device kernels to access the filesystem, because most of the semantics of filesystem access are essentially incompatible with the massively parallel nature of device kernels.
There are two ways to work around this, considering you're only asking about CPU.
if you intend to use OpenCL as a way to do multithreading on CPU, consider using what OpenCL calls “native kernels”, which are essentially just plain C functions, called within an OpenCL context;
a more general approach that might work on GPU too would be to mmap the files you want to operate on, and pass the resulting pointers to clCreateBuffer with CL_USE_HOST_PTR flags.

Sharing the GPU between OpenCL capable programs

Is there a method to share the GPU between two separate OpenCL capable programs, or more specifically between two separate processes that simultaneously both require the GPU to execute OpenCL kernels? If so, how is this done?
It depends what you call sharing.
In general, you can create 2 processes that both create an OpenCL device, on the same GPU. It's then the driver/OS/GPU's responsibility to make sure things just work.
That said, most implementations will time-slice the GPU execution to make that happen (just like it happens for graphics).
I sense this is not exactly what you're after though. Can you expand your question with a use case ?
Current GPUs (except NVidia's Fermi) do not support simultaneous execution of more than one kernel. Moreover, to this date GPUs do not support preemptive multitasking; it's completely cooperative! A kernel's execution cannot be suspended and continued later on. So the granularity of any time-based GPU sharing depends on the kernels' execution times.
If you have multiple programs running that require GPU access, you should therefore make sure that your kernels have short runtimes (< 100ms is a rule of thumb), so that GPU time can be timesliced among the kernels that want GPU cycles. It's also important to do that since otherwise the host system's graphics will become very unresponsive as they need GPU access too. This can go as far that a kernel in an endless or long loop will apparently crash the system.

Resources