opencl local memory and workgroup size - opencl

i am trying to use local memory in my OpenCL kernel.
Following lists are related information.
Device info
GPU: Qualcomm Adreno 420
local memory size: 32768Bytes = 32KB
max work group size: 1024
kernel info A (without local memory usage)
CL_KERNEL_WORK_GROUP_SIZE=1024
CL_KERNEL_LOCAL_MEM_SIZE=0 Bytes
kernel info B (with local memory usage)
CL_KERNEL_WORK_GROUP_SIZE=224
CL_KERNEL_LOCAL_MEM_SIZE=2048 Bytes
Difference between kernel A and B is just usage of local memory.
what makes this situation?
if this is register problem, then CL_KERNEL_WORK_GROUP_SIZEof kernel A should be lower than 1024.
I really want to check register usage, but i can't.
I believe that i have enough local memory, global memory.
Please help.

Related

Where does OpenCL context physically lie?

From my understanding, OpenCL context is an abstraction layer. But where does it physically lie (host RAM, GPU memory... or somewhere in the air)?
I just want to understand how it would be possible for the multiple GPUs using the same OpenCL platform can access the memory buffer objects without having to explicitly transferring files in between the host.
The OpenCL cl::Context object, like all Objects of the OpenCL headers, is located in CPU RAM.
The cl::Buffer object itself also resides in CPU RAM, but allocates a region of memory in GPU VRAM and stores a pointer to it to allow host<->device data transfer over PCIe.
Multiple GPUs within a platform can - if supported by extensions - access each other's memory through remote direct memory access (RDMA). If supported, a GPU then knows the pointer to the array in the other GPU's memory.

Why does clCreateBuffer with CL_MEM_ALLOC_HOST_PTR use discrete device memory?

I have a piece of code in which I use clCreateBuffer with the CL_MEM_ALLOC_HOST_PTR flag and I realised that this allocates memory from the device. Is that correct and I'm missing something from the standard?
CL_MEM_ALLOC_HOST_PTR: This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory.
Personally I understood that that buffer should be a host-side buffer that, later on, can be mapped using clEnqueueMapBuffer.
Follows some info about the device I'm using:
Device: Tesla K40c
Hardware version: OpenCL 1.2 CUDA
Software version: 352.63
OpenCL C version: OpenCL C 1.2
It is described as
OpenCL implementations are allowed to cache the buffer contents
pointed to by host_ptr in device memory. This cached copy can be used
when kernels are executed on a device.
in
https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clCreateBuffer.html
The description is for CL_MEM_USE_HOST_PTR but it is only different by its allocator from CL_MEM_ALLOC_HOST_PTR. USE uses host-given pointer, ALLOC uses opencl implementation's own allocators return value.
The caching is not doable for some integrated-gpu types so its not always true.
The key phrase from the spec is host accessible:
This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory.
It doesn't say it'll be allocated in host memory: it says it'll be accessible by the host.
This includes any memory that can be mapped into CPU-visible memory addresses. Typically some, if not all VRAM in a discrete graphics device will be available through a PCI memory range exposed in one of the BARs - these get mapped into the CPU's physical memory address space by firmware or the OS. They can be used similarly to system memory in page tables and thus made available to user processes by mapping them to virtual memory addresses.
The spec even goes on to mention this possibility, at least in combination with another flag:
CL_MEM_COPY_HOST_PTR can be used with CL_MEM_ALLOC_HOST_PTR to initialize the contents of the cl_mem object allocated using host-accessible (e.g. PCIe) memory.
If you definitely want to use system memory for a buffer (may be a good choice if GPU access to it is sparse or less frequent than CPU acccess), allocate it yourself and wrap it in a buffer with CL_MEM_USE_HOST_PTR. (Which may still end up being cached in VRAM, depending on the implementation.)

OpenCL memory allocation limit on GPU

When a memory allocation happens in OpenCL using clCreateBuffer and writing happen with clEnqueueWriteBuffer, how to decide which memory to allocate (CPU memory or GPU memory)
If the GPU memory is being allocated, will the program fail if the allocation is greater than the memory limit? (or will there be something like paging)
clCreateBuffer() will return a null buffer and set the error code to CL_INVALID_BUFFER_SIZE if the requested buffer size is greater than the device's CL_DEVICE_MAX_MEM_ALLOC_SIZE (which can be queried with the clGetDeviceInfo() function).
See the documentation for clCreateBuffer() for more information.

Memory transfer between host and device in OpenCL?

Consider the following code which creates a buffer memory object from an array of double's of size size:
coef_mem = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, (sizeof(double) * size), arr, &err);
Consider it is passed as an arg for a kernel. There are 2 possibilities depending on the device on which the kernel is running:
The device is same as host device
The device is other than host device
Here are my questions for both the possibilities:
At what step is the memory transferred to the device from the host?
How do I measure the time required for transferring the memory from host to device?
How do I measure the time required for transferring the memory from
device's global memory to private memory?
Is the memory still transferred if the device is same as host device?
Will the time required to transfer from host to device be greater
than the time required for transferring from device's global memory
to private memory?
At what step is the memory transferred to the device from the host?
The only guarantee you have is that the data will be on the device by the time the kernel begins execution. The OpenCL specification deliberately doesn't mandate when these data transfers should happen, in order to allow different OpenCL implementations to make decisions that are suitable for their own hardware. If you only have a single device in the context, the transfer could be performed as soon as you create the buffer. In my experience, these transfers usually happen when the kernel is enqueued (or soon after), because that is when the implementation knows that it really needs the buffer on a particular device. But it really is completely up to the implementation.
How do I measure the time required for transferring the memory from host to device?
Use a profiler, which usually shows when these transfers happen and how long they take. If you transfer the data with clEnqueueWriteBuffer instead, you could use the OpenCL event profiling system.
How do I measure the time required for transferring the memory from device's global memory to private memory?
Again, use a profiler. Most profilers will have a metric for the achieved bandwidth when reading from global memory, or something similar. It's not really an explicit transfer from global to private memory though.
Is the memory still transferred if the device is same as host device?
With CL_MEM_COPY_HOST_PTR, yes. If you don't want a transfer to happen, use CL_MEM_USE_HOST_PTR instead. With unified memory architectures (e.g. integrated GPU), the typical recommendation is to use CL_MEM_ALLOC_HOST_PTR to allocate a device buffer in host-accessible memory (usually pinned), and access it with clEnqueueMapBuffer.
Will the time required to transfer from host to device be greater than the time required for transferring from device's global memory to private memory?
Probably, but this will depend on the architecture, whether you have a unified memory system, and how you actually access the data in kernel (memory access patterns and caches will have a big effect).

Benchmark of CL_MEM_USE_HOST_PTR and CL_MEM_COPY_HOST_PTR in OpenCL

I've a vector on the host and I want to halve it and send to the device. Doing a benchmark shows that CL_MEM_ALLOC_HOST_PTR is faster than CL_MEM_USE_HOST_PTR and much faster than CL_MEM_COPY_HOST_PTR. Also memory analysis on device doesn't show any difference in the buffer size created on device. This differs from the documentation of the mentioned flag on Khronos- clCreateBuffer. Does anyone know what's going on?
The answer by Pompei 2 is incorrect. The specification makes no guarantee as to where the memory is allocated but only how it is allocated. CL_MEM_ALLOC_HOST_PTR makes the clCreateBuffer allocate the host side memory for you. You can then map this into a host pointer using clEnqueueMapBuffer. CL_MEM_USE_HOST_PTR will cause the runtime to scoop up the data you give it into a OpenCL buffer.
Pinned memory is achieved through the use of CL_MEM_ALLOC_HOST_PTR: the runtime is able to allocate the memory as it can.
All this performance is implementation dependant. Reading section 3.1.1 more carefully will show that in one of the calls (with no CL_MEM flag) NVIDIA is able to preallocate a device side buffer whilst the other calls merely get the pinned data mapped into a host pointer ready for writing to the device.
First off and if I understand you correctly, clCreateSubBuffer is probably not what you want, as it creates a sub-buffer from an existing OpenCL buffer object. The documentation you linked also tells us that:
The CL_MEM_USE_HOST_PTR, CL_MEM_ALLOC_HOST_PTR and CL_MEM_COPY_HOST_PTR values cannot be specified in flags but are inherited from the corresponding memory access qualifiers associated with buffer.
You said you have a vector on the host and want to send half of it to the device. For this, I would use a regular buffer of half the vector's size (in bytes) on the device.
Then, with a regular buffer, the performance you see is expected.
CL_MEM_ALLOC_HOST_PTR only allocates memory on the host, which does not incur any transfer at all: it is like doing a malloc and not filling the memory.
CL_MEM_COPY_HOST_PTR will allocate a buffer on the device, most probably the RAM on GPUs, and then copy your whole host buffer over to the device memory.
On GPUs, CL_MEM_USE_HOST_PTR most likely allocates so-called page-locked or pinned memory. This kind of memory is the fastest for host->GPU memory transfer and this is the recommended way to do the copy.
To read how to correctly use pinned memory on NVidia devices, refer to chapter 3.1.1 of NVidia's OpenCL best practices guide. Note that if you use too much pinned memory, performance may drop below a host copied memory.
The reason why pinned memory is faster than copied device memory is well-explained in this SO question aswell as this forum thread it points to.
Pompei2, you says CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR allocates memory on the device while OpenCL 1.1 Specification says that with CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR specified memory will be allocated (in first case) on or will be used from (in second) host memory? Im newble in OpenCL, but want know where is true?)

Resources