Is it possible, using OpenCL's DMA capabilities, to write to a main memory address that is passed into the cl program? I understand doing so would likely break the program, but the intent here is to run a GPU process and then overwrite the address space of the CPU program used to run it, so breakage is expected.
Thanks!
Which version of the OpenCL API are you targeting?
In OpenCL 2.0 and above you can use Shared Virtual Memory (SVM) to share address between host and device(s) in platforms that support it.
You can get more information about it in the Intel OpenCL SVM overview.
If you are using previous versions, or your hardware does not support it, you can use pinned memory with the appropriate flags to clCreateBuffer. In particular, CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR, see clCreateBuffer in Khronos.
Note that, when using CL_MEM_USE_HOST_PTR has some alignment restrictions.
In general, in OpenCL, when and how the DMA is used depends on the hardware platform, so you should refer to the vendor documentation for details.
Related
So you get a kernel and compile it. You set the cl_buffers for the arguments and then clSetKernelArg the two together.
You then enqueue the kernel to run and read back the buffer.
Now, how does the host program tell the GPU the instructions to run. e.g. I'm on a 2017 MBP with a Radeon Pro 460. At the assembly level what instructions are called in the host process to tell the GPU "here's what you're going to run." What mechanism lets the cl_buffers be read by the GPU?
In fact, if you can point me to an in detail explanation of all of this I'd be quite pleased. I'm a toolchain engineer and I'm curious about the toolchain aspects of GPU programming but I'm finding it incredibly hard to find good resources on it.
It pretty much all runs through the GPU driver. The kernel/shader compiler, etc. tend to live in a user space component, but when it comes down to issuing DMAs, memory-mapping, and responding to interrupts (GPU events), that part is at least to some extent covered by the kernel-based component of the GPU driver.
A very simple explanation is that the kernel compiler generates a GPU-model-specific code binary, this gets uploaded to VRAM via DMA, and then a request is added to the GPU's command queue to run a kernel with reference to the VRAM address where that kernel is stored.
With regard to OpenCL memory buffers, there are essentially 3 ways I can think of that this can be implemented:
A buffer is stored in VRAM, and when the CPU needs access to it, that range of VRAM is mapped onto a PCI BAR, which can then be memory-mapped by the CPU for direct access.
The buffer is stored entirely in System RAM, and when the GPU accesses it, it uses DMA to perform read and write operations.
Copies of the buffer are stored both in VRAM and system RAM; the GPU uses the VRAM copy and the CPU uses the system RAM copy. Whenever one processor needs to access the buffer after the other has made modifications to it, DMA is used to copy the newer copy across.
On GPUs with UMA (Intel IGP, AMD APUs, most mobile platforms, etc.) VRAM and system RAM are the same thing, so they can essentially use the best bits of methods 1 & 2.
If you want to take a deep dive on this, I'd say look into the open source GPU drivers on Linux.
The enqueue the kernel means ask an OpenCL driver to submit work to dedicated HW for execution. In OpenCL, for example, you would call the clEnqueueNativeKernel API, which will add the dispatch compute workload command to the command queue - cl_command_queue.
From the spec:
The command-queue can be used to queue a set of operations (referred to as commands) in order.
https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_command_queues
Next, the implementation of this API will trigger HW to process commands recorded into a command queue (which holds all actual commands in the format which particular HW understands). HW might have several queues and process them in parallel. Anyway after the workload from a queue is processed, HW will inform the KMD driver via an interrupt, and KMD is responsible to propagate this update to OpenCL driver via OpenCL supported event mechanism, which allows user to track workload execution status - see https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#clWaitForEvents.
To get better idea how the OpenCL driver interacts with a HW you could take a look into the opensource implementation, see:
https://github.com/pocl/pocl/blob/master/lib/CL/clEnqueueNativeKernel.c
I couldn't find any query command about device being integrated/embedded in cpu or using system ram or its own dedicated gddr memory? I can benchmark mapping/unmapping versus reading/writing to get a conclusion but that device can be under load at that time and behave wrong and it would add complexity to already complex load balancing algorithm that I'm using.
Is there a simple way to check if a gpu is using same memory with cpu so I can choose directly mapping/unmapping instead of reading/writing?
Edit: there is CL_DEVICE_LOCAL_MEM_TYPE
CL_GLOBAL or CL_LOCAL
is this an indication of integratedness?
OpenCL 1.x has the device query CL_DEVICE_HOST_UNIFIED_MEMORY:
Is CL_TRUE if the device and the host have a unified memory subsystem
and is CL_FALSE otherwise.
This query is deprecated as of OpenCL 2.0, but should probably still work on OpenCL 2.x platforms for now. Otherwise, you may be able to produce a heuristic from the result of CL_DEVICE_SVM_CAPABILITIES instead.
I am a newbie to opencl. I have a doubt about opencl functioning when kernel is running on a cpu device.Suppose we have a kernel running on a cpu device, can it read from a file on the disk. If yes,then how ? If no , then why not ?
Can you please suggest a source for detailed information ??
thanks in advance.
It can't. Simply because not every OpenCL device has a file system, or a disk respectively.
You can't. OpenCL is trying to unite access to computing power and file system is depending on OS. If you want this feature, there are threads (C++11 thread, pthread,...) or OpenMP should be able to handle this, because it's CPU-only thing.
It doesn't make sense to allow device kernels to access the filesystem, because most of the semantics of filesystem access are essentially incompatible with the massively parallel nature of device kernels.
There are two ways to work around this, considering you're only asking about CPU.
if you intend to use OpenCL as a way to do multithreading on CPU, consider using what OpenCL calls “native kernels”, which are essentially just plain C functions, called within an OpenCL context;
a more general approach that might work on GPU too would be to mmap the files you want to operate on, and pass the resulting pointers to clCreateBuffer with CL_USE_HOST_PTR flags.
I got some problems with clCreateBuffer in OpenCL. I am working with an AMD Fusion processor (A10-5800k), so both devices (CPU and GPU) should be able to work on each others memory.
For the read and result buffer I do:
bufRead = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, size, data, &err);
bufWrite = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size, result, &err);
When I call my kernel, the "result" array doesn't change. I know that normal GPUs would copy the data to the device memory and work on that. Would normal GPUs copy the data back afterwards?
However, I did hope that the Fusion GPU does not copy the data, because it can work on the same pointer. Unfortunately, I don't see any change in the "result" array. When I read "bufWrite" with clEnqueueReadBuffer I see the changes. (I do clFinish before reading "result", so the data should be written)
Does anyone know how to truly work on the same array with CPU and GPU? I really want to avoid clEnqueueReadBuffer.
Thanks,
Tomas
OK, I searched quite a while for an answer. It is possible but only under certain circumstances.
You need a GPU that has VM (virtual memory) enabled. You can check this with clinfo. Look for the "VM" in Driver version, e.g.,
Driver version: CAL 1.4.1695 (VM)
I have a quite new APU under Linux and VM is not supported. I think it is not supported for all GPUs under Linux. I will try Windows next. It is plausible because it needs to interact with the OS on this. I hope Linux support will come in the future.
Anyway, to use it, you need to:
Create your buffers with CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR.
Access the buffer from the Host with clEnqueueMapBuffer and release it after reading/writing with clEnqueueUnmapMemObject.
When VM is enabled, nothing is copied and you have direct access / without VM it is working as well but it will copy the data.
check out the AMD APP OpenCL Programming Guide Section 4.5.2 - Placement
I'm not sure I understand you. In OpenCL (for any target platform type, CPU or GPU), a call to clCreateBuffer will allocate some memory on the device and will copy data from host pointer to newly allocated memory (althought this copy may done only when a kernel is invoked with this pointer as argument). I do not think it it possible for a host and a device to work on the same memory without "synchronization" (aka clEnqueueReadBuffer).
On some platforms/devices, a call to clFinish is enough to synchronize host memory from device memory. A call to clEnqueueReadBuffer, or clEnqueueMapBuffer is required in the general case. The pointer returned by clEnqueueMapBuffer should be related to the host ptr you provided when creating the buffer.
Little disclaimer: This is more the kind of theoretical / academic question than an actual problem I've got.
The usual way of setting up a parallel program in OpenCL is to write a C/C++ program, which sets up the devices (GPU and/or other CPUs), kernel and data buffers for executing the kernel on the device.
This program gets launched from the host, which used to be a CPU.
Would it be possible to write a OpenCL program where the host is a GPU and the devices other GPUs and/or CPUs?
What would be the prerequisites for such a scenario?
Do one need a special GPU or would it be possible to use any OpenCL-capable GPU?
Are you looking for a complete host or just a kernel launcher?
Up coming CUDA (v 5.0) introduces a feature to launch a kernel inside a kernel. Therefore, a device can be used for launching a kernel on itself. May be this feature will be supported by OpenCL too in near future.