OpenCL - change buffer flags - opencl

Is there a way to change the flags of a opencl buffer once allocated?
My use case is the following:
1) create data on device
2) do large amounts of work on device with said data
I want to mark the data as CL_MEM_READ_ONLY to enable possible optimisations during 2, but of course it can't be read-only when it's being created in 1.
It would be acceptable to copy the data to a new read-only buffer, but I can't see any way of doing that without going via host memory.

As pointed out in the the other answers, I also believe there not likely to be any significant performance gains to be had from using CL_MEM_READ_ONLY, as opposed to simply marking the buffer as const (or putting it in the constant address space, if small enough) inside your kernel.
However, you can achieve this using sub-buffers. If you create your buffer with CL_MEM_READ_WRITE, you can then create a sub-buffer that has the CL_MEM_READ_ONLY flag set.
cl_mem buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &err);
cl_buffer_region = {0, size};
cl_mem robuffer = clCreateSubBuffer(buffer, CL_MEM_READ_ONLY,
CL_BUFFER_CREATE_TYPE_REGION,
(const void*)&region, &err);

You can't mutate the flags of an existing buffer. However, I think you can create two buffers that wrap the same host memory. If you are on an integrated graphics platform like Intel or AMD and use CL_MEM_USE_HOST_PTR, you can create a read-write buffer that wraps a piece of host memory. (The usual constraints apply: has to be page-aligned and even cacheline length on Intel, not sure about AMD's). You can create a second buffer wrapping the same region with different options (read only) and use it separately.
It's definitely illegal to use overlapped regions in different enqueues at the same time.
The result of OpenCL commands that operate on multiple buffer objects created with the same host_ptr or overlapping host regions is considered to be undefined.
(from CreateBuffer) But barring that, it should work.
However, in the end, I strongly suspect you won't really gain anything. Implementations are free to ignore these flags. And I suspect that the overlap case above will force the implementation to ignore them (set the page access to the least restrictive combination of buffers mapping it). Integrated GPUs almost certainly will ignore those flags (I think Intel does).
What sort of optimizations were you hoping for?

My feeling is that it should depend on how you allocate the buffer initially. For some flags, you may reuse (you can try with alloc_host). Some may not allow you to do so.

Is there a way to change the flags of a opencl buffer once allocated?
No, it is not. You will have to create another buffer and call a copybuffer from one to another.
However I really doubt of the need of this. The memory flags affect (mainly) how the sync operation between host and device is performed. But when the memory is in the device, I doubt any optimizations can be done at all. (unless the memory consist of just some KB of data).
Even if optimizations are possible, the compiler should be clever enough to do it as well if the memory is declared in the kernel as constant or read_only. Regardless of the flags set to the memory buffer.

Related

OpenCL vector data type usage

I'm using a GPU driver that is optimized to work with 16-element vector data type.
However, I'm not sure how to use it properly.
Should I declare it as, for example, cl_float16 on host with a size 16 times less than the original array?
What is the better way to access this type on the OpenCL kernel?
Thanks in advance.
In host code you can use cl_float16 host type. Access it like an array (e.g., value.s[5]). Pass as kernel argument. In kernel, access like value.s5.
How you declare it on the host is pretty much irrelevant. What matters is how you allocate it, and even that only if plan on creating the buffer with CL_MEM_USE_HOST_PTR and your GPU uses system memory. This is because your memory needs to be properly aligned for GPU zero-copy, otherwise the driver will create a background copy. If your GPU doesn't use system memory for buffers, or you don't use CL_MEM_USE_HOST_PTR, then it doesn't matter - the driver will allocate a proper buffer on the GPU.
Your bigger issue is that your GPU needs to work with 16-element vectors. You will have to vectorize every kernel you want to run on it. IOW every part of our algorithms need to work with float16 types. If you just use simple floats, or you declare the buffer as global float16* X but then use element access (X.s0, X.w and such) and work with those, the performance will be the same as if you declared the buffer global float* X - very likely crap.

Got completely confused on how to OpenCL data transfer

I'm learning OpenCL and attempt to utilize it on some low-latency scenario, so I'm really concerned with the memory transferring delay.
According to NVidia's OpenCL Best Practices Guide, and also by many other places, direct read/write on buffer object should be avoided. Instead, we should use map/unmap utility. In that guide, a demonstrative code is given like this:
cl_mem cmPinnedBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, memSize, NULL, NULL);
cl_mem cmDevBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, memSize, NULL, NULL);
unsigned char* cDataIn = (unsigned char*) clEnqueueMapBuffer(cqCommandQue, cmPinnedBufIn, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, NULL);
for(unsigned int i = 0; i < memSize; i++)
{
cDataIn[i] = (unsigned char)(i & 0xff);
}
clEnqueueWriteBuffer(cqCommandQue, cmDevBufIn, CL_FALSE, 0, szBuffBytes, cDataIn , 0, NULL, NULL);
In this code snippet, two buffer objects are generated explicitly, and a write-to-device operation is also explicitly called.
If my understanding is correct, when you call clCreateBuffer with CL_MEM_ALLOC_HOST_PTR OR CL_MEM_USE_HOST_PTR, the storage of buffer object is created in on host side, probably in DMA memory, and no storage is allocated on device side. So the above code actually creates two separated storage. If so:
What would happen if I call map buffer on cmDevBufIn, which do not have host side memory?
For CPU-integrated GPUs, there is no separate graphics memory. Especially, for new version of AMD APUs, the memory address is also homologus. So it seems create two buffer objects is not good. What is the best practice for integrated platforms?
Is there any way to write single lines of memory transfer code for different platforms? Or I must write several different suits of memory transfer codes to achieve best performance for Nvidia, AMD separate GPU, AMD old APU, AMD new APU and Intel HD graphics......
Unfortunately, it's different for each vendor.
NVIDIA claims their best bandwidth is when you use read/write buffer where the host memory is "pinned", which can be achieved by creating a buffer with CL_MEM_ALLOC_HOST_PTR and mapping it (I think your example is that). You should also compare that to just mapping and unmapping the device memory; their more recent drivers have gotten better at that.
With AMD you can just map/unmap the device buffer to get full speed. They also have a bunch of vendor-specific buffer flags which can make certain scenarios faster; you should study them but more importantly create benchmarks that try out everything to see what actually works best for your task.
With both discrete devices you should use separate command queues for the transfer operations so they can overlap with other (non-dependent) compute operations (look up various compute overlap examples). Furthermore, some higher end discrete GPUs can be downloading one buffer at the same time they are uploading another (using dual DMA engines), so you could be uploading one batch of work while you're computing another while you're downloading the result of a third. When written elegantly, this isn't even much more code than the strictly sequential version, but you have to use OpenCL events to synchronize between command queues. NVIDIA has a GTC talk you can watch that shows how to do this for video frames every 16 ms.
With AMD's APU and with Intel's Integrated Graphics, the map/unmap of the "device" buffer is "free" since it is in main memory. Don't use read/write buffer here or you'll be paying for unneeded transfers.

How to allocate memory for OpenCL result data?

What is the best way (in any sense) of allocating memory for OpenCL output data? Is there a solution what works reasonably with both discrete and integrated graphics?
As a super-simplified example, consider the following C++ (host) code:
std::vector<float> generate_stuff(size_t num_elements) {
std::vector<float> result(num_elements);
for(int i = 0; i < num_elements; ++i)
result[i] = i;
return result;
}
This can be implemented using an OpenCL kernel:
__kernel void gen_stuff(float *result) {
result[get_global_id(0)] = get_global_id(0);
}
The most straightforward solution is to allocate an array on both the device and host, then copy after kernel finished:
std::vector<float> generate_stuff(size_t num_elements) {
//global context/kernel/queue objects set up appropriately
cl_mem result_dev = clCreateBuffer(context, CL_MEM_WRITE_ONLY, num_elements*sizeof(float) );
clSetKernelArg(kernel, 0, sizeof(cl_mem), result_dev);
clEnqueueNDRangeKernel(queue, kernel, 1, nullptr, &num_elements, nullptr, 0, nullptr, nullptr);
std::vector<float> result(num_elements);
clEnqueueReadBuffer( queue, result_dev, CL_TRUE, 0, num_elements*sizeof(float), result_host.data(), 0, nullptr, nullptr );
return result;
}
This works reasonably with discrete cards. But with shared memory graphics, this means allocating double and an extra copy. How can one avoid this? One thing for sure, one should drop clEnqueuReadBuffer and use clEnqueueMapBuffer/clUnmapMemObject instead.
Some alternative scenarios:
Deal with an extra memory copy. Acceptable if memory bandwidth is not an issue.
Allocate a normal memory array on host, use CL_MEM_USE_HOST_PTR when creating the buffer. Should allocate with device-specific alignment - it is 4k with Intel HD Graphics: https://software.intel.com/en-us/node/531272 I am not aware if this is possible to query from the OpenCL environment. Results should be mapped (with CL_MAP_READ) after kernel finishes to flush caches. But when is it possible to unmap? Immediately after mapping is finished (it seems that does not work with AMD discrete graphics)? Deallocation of the array also requires modification of client code on Windows (due to _aligned_free being different from free).
Allocate using CL_MEM_ALLOCATE_HOST_PTR and map after kernel finishes. The cl_mem object has to be kept alive till the buffer is used (and probably even mapped?), so it requires polluting client code. Also this keeps the array in a pinned memory, what might be undesirable.
Allocate on device without CL_MEM_*_HOST_PTR, and map it after kernel finishes. This is the same thing as option 2 from deallocation's perspective, it's just avoiding pinned memory. (Actually, not sure if memory that is mapped isn't pinned.)
???
How are you dealing with this problem? Is there any vendor-specific solution?
You can do it with a single buffer, for both discrete and integrated hardware:
Allocate with CL_MEM_WRITE_ONLY (since your kernel only writes to the buffer). Optionally also use CL_MEM_ALLOCATE_HOST_PTR or vendor-specific (e.g., AMD) flags if it helps performance on certain platforms (read the vendor guidance and do benchmarking).
Enqueue your kernel that writes to the buffer.
clEnqueueMapBuffer with CL_MAP_READ and blocking. On discrete hardware this will copy over PCIe; on integrated hardware it's "free".
Use the results on the CPU using the returned pointer.
clEnqueueUnmapMemObject.
Depends on the use case:
For minimal memory footprint and IO efficiency: (Dithermaster's answer)
Create with CL_MEM_WRITE_ONLY flags, or maybe CL_MEM_ALLOCATE_HOST_PTR (depending on platforms). Blocking map for reading, use it, un-map it. This option requires that the data handler (consumer), knows about the CL existance, and unmaps it using CL calls.
For situations where you have to provide a buffer data to a third party (ie: libraries that need a C pointer, or class buffer, agnostic to CL):
In this case it may not be good to use mapped memory. Mapped memory access time is typically longer compared to normal CPU memory. So, instead of mapping, then memcpy() and the unmap; it is easier to directly perform a clEnqueueReadBuffer() to the CPU address where the output should be copied. In some vendor cases, this does not provide pinned memory and the copy is slow, so is better to revert to the option "1". But for some other cases where there is no pinned memory I found it faster.
Any other different condition for reading the kernel output? I think not...

how does clEnqueueMapBuffer work

Could anybody talk about the function clEnqueueMapBuffer work mechanism. Actually I mainly concern what benefits on speed I can get from this function over clEnqueueRead/WriteBuffer.
PS:
Does clEnqueueMapBuffer/clEnqueueMapImage also alloc a buffer from the CPU automatically?
If yes.
I want to manage my CPU buffer. I mean I malloc a big buffer first. Then if I need buffer. I can allocate it from the big buffer which I allocate first. How to make the clEnqueueMapBuffer/clEnqueueMapImage allocate buffer from the big buffer.
clEnqueueMapBuffer/clEnqueueMapImage
OpenCL mechanism for accessing memory objects instead of using clEnqueueRead/Write. we can map a memory object on a device to a memory region on host. Once we have mapped the object we can read/write or modify anyway we like.
One more difference between Read/Write buffer and clEnqueueMapBuffer is the map_flags argument. If map_flags is set to CL_MAP_READ, the mapped memory will be read only, and if it is set as CL_MAP_WRITE the mapped memory will be write only, if you want both read + write then make the flag CL_MAP_READ | CL_MAP_WRITE.
Compared to read/write fns, memory mapping requires three step process>
Map the memory using clEnqueueMapBuffer.
transfer the memory from device to/from host via memcpy.
Unmap using clEnqueueUnmapObject.
It is common consensus that memory mapping gives significant improvement in performance compared to regular read/write, see here: what's faster - AMD devgurus forum link
If you want to copy a image or rectangular region of image then you can make use of clEnqueueMapImage call as well.
References:
OpenCL in Action
Heterogeneous computing with OpenCL
Devgurus forum
No, the map functions don't allocate memory. You'd do that in your call to clCreateBuffer.
If you allocate memory on the CPU and then try to use it, it will need to be copied to GPU accessible memory. To get memory accessible by both it's best to use CL_MEM_ALLOC_HOST_PTR
clCreateBuffer(context, flags, size, host_ptr, &error);
context - Context for the device you're using.
flags - CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_WRITE
size - Size of the buffer in bytes, usually N * sizeof(data type)
host_ptr - Can be NULL or 0 meaning we have no existing data. You could add CL_MEM_COPY_HOST_PTR to flags and pass in a pointer to the values you want copied to the buffer. This would save you having to copy via the mapped pointer. Beneficial if the values won't change.

OpenCL allcoation flag CL_MEM_USE_HOST_PTR usage not referencing my pointer

I was trying to use the flag CL_MEM_USE_HOST_PTR with the OpenCL function clCreateBuffer() in order to avoid multiple memory allocation. After a little research (reverse engineering), I found that the framework calls the operating system allocation function no matter what flag I use.
Maybe my concept is wrong? But from documentation it's supposed to use DMA to access the host memory instead of allocating new memory.
I am using opencl 1.2 on an Intel device (HD5500)
On Intel GPUs ensure the allocated host pointer is page aligned and page length*. In fact I think the buffer size can actually be an even number of cache lines, but I always round up.
Use something like:
void *host_ptr = _aligned_malloc(align_to(size,4096),4096));
Here's a good article for this:
In the "Key Takeaways".
If you already have the data and want to load the data into an OpenCL
buffer object, then use CL_MEM_USE_HOST_PTR with a buffer allocated at
a 4096 byte boundary (aligned to a page and cache line boundary) and a
total size that is a multiple of 64 bytes (cache line size).
You can also use CL_MEM_ALLOC_HOST_PTR and let the driver handle the allocation. But to get at the pointer you'll have to map and unmap it (but at no copy cost).

Resources