What is the difference between Buffer object and image buffer object in opencl? It is evident that image buffer is faster but to what extent? Where they must be used?
An OpenCL Buffer is a 1D or 2D or 3D array in global memory. Its an abstract object
that can be addressed thru a pointer. Buffers are Read-Only or Write_only or Read-Write.
An Image buffer represents GPU Texture memory. It represents an array of pixels that
can be access via functions specifying pixel x,y,z coordinates. There is no pointer access to Image Pixels on the GPU.
The hardware treats these two type of buffers differently. A OpenCL Buffer is either in
Host RAM or GPU RAM and transferred between the two. A OpenCL Image Buffer has analogous
characteristics of a OpenCL Buffer. But the differences are Image Buffer are either Read-only or Write-only. For Read-only Image buffers, the GPU can cache copies of the image pixels in every compute unit (= 32 or 64 ALU ). Typical the cache size is 8K (bytes or pixels?).
Also, since image pixels cannot be accessed via a pointer on the GPU. Their mapping from
x,y,z coordinates to physical address can be mapped in several ways. One way is to a Z-ordering. This clusters pixels in two dimensions so that neighboring pixels in
x,y directions are store linearly. This helps speed access neighboring pixels in
image filters.
OpenCL Buffers are used for general arrays and especially for arrays that are read-write,
or double precision.
OpenCL Image Buffers are used for image processing or other signal processing algos
where the input image/signal can treated as read-only.
Related
I am implementing a solution using OpenCL and I want to do the following thing, say for example you have a large array of data that you want to copy in the GPU once and have many kernels process batches of it and store the results in their specific output buffers.
The actual question is here which way is faster? En-queue each kernel with the portion of the array it needs to have or pass out the whole array before hand an let each kernel (in the same context) process the required batch, since they would have the same address space and could each map the array concurrently. Of course the said array is read-only but is not constant as it changes every time I execute the kernel(s)... (so I could cache it using a global memory buffer).
Also if the second way is actually faster could you point me with direction on how this could be implemented, as I haven't found anything concrete yet (although I am still searching :)).
Cheers.
I use the second memory normally. Sharing the memory is easy. Just pass the same buffer to each kernel. I do this in my real-time ray-tracer. I render with one kernel and post-process (image process) with another.
Using the C++ bindings it looks something like this
cl_input_mem = cl::Buffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_uchar4)*npixels, NULL, &err);
kernel_render.setArg(0, cl_input_mem);
kernel_postprocess.setArg(0, cl_input_mem);
If you want one kernel to operate on a different segment of the array/memory you can pass an offset value to the kernel arguments and add that to e.g. the global memory pointer for each kernel.
I would use the first method if the array (actually the sum of each buffer - including output) does not fit in memory. Another reason to use the first method is if you're running on multiple devices. In my ray tracer I use the first method when I render on multiple devices. For example I have one GTX 580 render the upper half of the screen and the other GTX 580 rendering the lower half (actually I do this dynamically so one device may render 30% while the other 70% but that's besides the point). I have each device only render it's fraction of the output and then I assemble the output on the CPU. With PCI 3.0 the transfer back and forth between CPU and GPU (multiple times) has a negligible effect on the frame rate even for 1920x1080 images.
In order to have a full speed communication with openCL, it is necessary to use pinned memory from the host side. Such memory will never be paginated and can be obtain by using clCreateBuffer() with the CL_MEM_ALLOC_HOST_PTR flag then clEnqueueMapBuffer.
But one may know an object is already in pinned memory (because it was created with, for example, those functions but in another context) and therefore want to use clEnqueueReadBuffer()/clEnqueueWriteBuffer() at full speed. Unfortunately, if the memory was not pinned in the current context, the object is not considered as pinned and the data-rate is not maximum.
How to say that an object is already in pinned memory to OpenCL?
My conclusion on this question is the OpenCL SDKs must maintain their own set of flags to know if they allocated the buffer or not, and therefore if it is safe to assume if it is pinned or not. They seem to conservatively assume that an externally allocated buffer is not pinned nor aligned.
I tried to match the benchmark's bandwidth for buffers allocated using clCreateBuffer and buffers using memory allocated externally, either using clCreateBuffer on a different context or manually pinned and aligned memory, and the first always seems to perform better for both AMD and nvidia.
Could anybody talk about the function clEnqueueMapBuffer work mechanism. Actually I mainly concern what benefits on speed I can get from this function over clEnqueueRead/WriteBuffer.
PS:
Does clEnqueueMapBuffer/clEnqueueMapImage also alloc a buffer from the CPU automatically?
If yes.
I want to manage my CPU buffer. I mean I malloc a big buffer first. Then if I need buffer. I can allocate it from the big buffer which I allocate first. How to make the clEnqueueMapBuffer/clEnqueueMapImage allocate buffer from the big buffer.
clEnqueueMapBuffer/clEnqueueMapImage
OpenCL mechanism for accessing memory objects instead of using clEnqueueRead/Write. we can map a memory object on a device to a memory region on host. Once we have mapped the object we can read/write or modify anyway we like.
One more difference between Read/Write buffer and clEnqueueMapBuffer is the map_flags argument. If map_flags is set to CL_MAP_READ, the mapped memory will be read only, and if it is set as CL_MAP_WRITE the mapped memory will be write only, if you want both read + write then make the flag CL_MAP_READ | CL_MAP_WRITE.
Compared to read/write fns, memory mapping requires three step process>
Map the memory using clEnqueueMapBuffer.
transfer the memory from device to/from host via memcpy.
Unmap using clEnqueueUnmapObject.
It is common consensus that memory mapping gives significant improvement in performance compared to regular read/write, see here: what's faster - AMD devgurus forum link
If you want to copy a image or rectangular region of image then you can make use of clEnqueueMapImage call as well.
References:
OpenCL in Action
Heterogeneous computing with OpenCL
Devgurus forum
No, the map functions don't allocate memory. You'd do that in your call to clCreateBuffer.
If you allocate memory on the CPU and then try to use it, it will need to be copied to GPU accessible memory. To get memory accessible by both it's best to use CL_MEM_ALLOC_HOST_PTR
clCreateBuffer(context, flags, size, host_ptr, &error);
context - Context for the device you're using.
flags - CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_WRITE
size - Size of the buffer in bytes, usually N * sizeof(data type)
host_ptr - Can be NULL or 0 meaning we have no existing data. You could add CL_MEM_COPY_HOST_PTR to flags and pass in a pointer to the values you want copied to the buffer. This would save you having to copy via the mapped pointer. Beneficial if the values won't change.
I have an application which manipulates high resolution images (something around 100+ megapixels), and I'm having some memory issues. When the BitmapData object is created, it allocates memory to store this image. The problem, is that I already have a ByteArray with this image's pixels (which have something around 400+ MB), so when the BitmapData is created, it allocates memory to store the same data that I have on the ByteArray.
After its creation, I can set the pixels from the ByteArray to the BitmapData and free the ByteArray. But this memory peak is, sometimes, causing the runtime to raise an exception, telling that the system is out of memory.
Is there any way to tell the BitmapData to use my own ByteArray? Or any other solution that I don't have to use double the memory that I need?
In case anyone needs this, here's what I did:
I get the ByteArray, which contains the pixels of the image, from a socket. I read these pixels from the sockets in tiny parts, so I, instead of waiting for the whole image to be loaded from the socket, I put these small parts directly into the BitmapData. This prevents the application to allocate double the memory I actually need.
I was trying to use the flag CL_MEM_USE_HOST_PTR with the OpenCL function clCreateBuffer() in order to avoid multiple memory allocation. After a little research (reverse engineering), I found that the framework calls the operating system allocation function no matter what flag I use.
Maybe my concept is wrong? But from documentation it's supposed to use DMA to access the host memory instead of allocating new memory.
I am using opencl 1.2 on an Intel device (HD5500)
On Intel GPUs ensure the allocated host pointer is page aligned and page length*. In fact I think the buffer size can actually be an even number of cache lines, but I always round up.
Use something like:
void *host_ptr = _aligned_malloc(align_to(size,4096),4096));
Here's a good article for this:
In the "Key Takeaways".
If you already have the data and want to load the data into an OpenCL
buffer object, then use CL_MEM_USE_HOST_PTR with a buffer allocated at
a 4096 byte boundary (aligned to a page and cache line boundary) and a
total size that is a multiple of 64 bytes (cache line size).
You can also use CL_MEM_ALLOC_HOST_PTR and let the driver handle the allocation. But to get at the pointer you'll have to map and unmap it (but at no copy cost).