Can access to image memory be coalesed, like buffer memory? - opencl

I am thinking of converting my kernel from buffer to 2d image.
Suppose 16 threads in a workgroup access 16 consecutive pixels
in one row of an image. Is this access coalesced?
Also, what is the best accesss pattern to read in a (n x m )
rectangular strip, where m is 8 or 16?

On a GPU, OpenCL images are read through the texture cache. Details are implementation-dependent and not usually documented, but typically they are stored in tiles for locality of reference. So if adjacent work items are accessing nearby pixels, you have a good chance the read will be fast.
Because of the texture cache, the term "coalesced" only applies to buffer reads.
Compared to coalesced buffer reads, images may be slightly slower; however, compared to un-coalesced buffer reads but with some amount of locality, they can be faster.
A good example is a Gaussian blur decomposed into a vertical pass and horizontal pass: with buffers when you do the vertical pass in columns you get coalesced reads but when you to do the horizontal pass you don't get coalesced reads so it is very slow. So much so that all of the examples have a transpose step that uses shared local memory with coalesced reads and writes so you can re-use the vertical pass kernel to do the horizontal pass, and then transpose back. All well and good, but with images you can skip the transpose because both the vertical and horizontal passes are the same speed (which is slightly slower then the coalesced buffer reads, but way faster than the uncoalesced buffer reads). Overall it is faster because you can skip the two transpose kernels.
I hope the part about tiles, texture caching, and locality of reference help answer your question about access patterns.
Caveat: There are ways of creating an image from a buffer, but the memory layout is then linear and not tiled so the above is out the window (you can expect horizontally adjacent reads to be cached but not vertically adjacent reads).

Related

OpenCL doesn't allowes late initializtion of variable in constant space

I want to generate a matrix which will be read by many thread after its generation so I declared it with program scope. It has to be constant so I am just assigning values once so
1) why openCl asking for initialization while declaration only?
2) How can I fix this issue?
1) Because you can't tell the gpu which elements are written by which threads. Constants are prepared by preprocessor using scalar engine, not parallel one. Parallel engine would need N x N times synchronizations to achieve that, where N is number of threads participating in building constant buffer.
2-a) If you want to work with constant memory, prepare a simple(__global, not constant) buffer in a kernel, use it as constant buffer in the next kernel(engine puts it in constant memory space). But constant space is small so the matrix should be small. This needs 2 kernels, means kernel overhead.
2-b) If cache performance is enough, just use a buffer. So it can be in a single kernel(first thread group prepares matrix, remaining ones compute using it, not starting until first group gives signal using atomic functions)
2-c) If local memory is bigger than constant memory, you can use local memory and build that matrix for each compute unit by themselves so it should take same amount of cycles(maybe even less if you use all cores) and probably faster than constant memory. This doesn't need communication between thread groups so would be fast.
2-d)If matrix is big and you need most of bandwidth, distribute it to all memory spaces. Example: put 1/4 of matrix to constant memory (5x bandwidth), put 1/4 of matrix to local memory (10x bandwidth), put 1/4 of matrix to global memory(2x from cache performance), put remaining data to instruction space(instructions themselves) so multiple threads would be working on 4 different places concurrently, using all bandwidth (constant + local + cache + instruction cache).

Append OpenCL result to list / Reduce solution room

I have an OpenCL Kernel with multiple work items. Let's assume for discussion, that I have a 2-D Workspace with x*y elements working on an equally sized, but sparce, array of input elements. Few of these input elements produce a result, that I want to keep, most don't. I want to enqueue another kernel, that only takes the kept results as an input.
Is it possible in OpenCL to append results to some kind of list to pass them as input to another Kernel or is there a better idea to reduce the volume of the solution space? Furthermore: Is this even a good question to ask with the programming model of OpenCL in mind?
What I would do if the amount of result data is a small percentage (ie: 0-10%) is use local atomics and global atomics, with a global counter.
Data interface between kernel 1 <----> Kernel 2:
int counter //used by atomics to know where to write
data_type results[counter]; //used to store the results
Kernel1:
Create a kernel function that does the operation on the data
Work items that do produce a result:
Save the result to local memory, and ensure no data races occur using local atomics in a local counter.
Use the work item 0 to save all the local results back to global memory using global atomics.
Kernel2:
Work items lower than "counter" do work, the others just return.

Reduce 1024 images to one

I read the paper about reducing a 1d array to one value in openCL ( http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/ ) and I understood the concept of associative operators. Extending this concept to ONE 2d array should also be possible.
But my problem is somewhat different: I have ~1000 images of 256x256 pixels with 16bit each and I would like to sum all these images to finally have the average image of them all. The usual GPU should have enough memory (~130Mb) to perform this task, but I don't really see how to implement the kernel.
Just as the 1D problem extends to 2D, it can also extend to 3D (which is what you have: 1000x256x256).
Exactly the same principles would apply:
1. Try to do as much work in parallel as you can without contention with other work groups.
2. Do the reduction in stages so each can be parallel.
Your likely going to be bandwidth limited, churning through 131 MB of memory, but that's not really a problem. Just write the kernels to do coalesced reads for maximum performance.

OpenCL clEnqueueCopyImageToBuffer with stride

I have an OpenCL buffer containing an 2D image.
This image have stride bigger than its width.
I need to make OpenCL image from this buffer.
The problem is that function clEnqueueCopyImageToBuffer does not contain stride as an input parameter.
Is it possible to make OpenCL image from OpenCL buffer(with stride bigger than width), with only one copying or faster?
The one way to solve this problem is to write own kernel, but maybe there are much more neat solutions?
Unfortunately, there is no method in the OpenCL specification which allows you to directly create an image from a buffer when the buffer data has a stride not equal to the image width. The most efficient solution would probably be to write your own kernel to do this.
The simplest solution that doesn't involve writing your own kernel would be to copy one line at a time with clEnqueueCopyBufferToImage. If your image is big enough, it might be that the performance of this technique would be reasonably comparable to the hand-written kernel, but you would have to try it out to see.
I didn't include the clEnqueueCopyBufferRect approach in my original answer because my first instinct was that the extra copy would kill performance. However, the comments above got me thinking about it further, and I was interested enough to implement all three approaches to see what the performance was actually like.
As I suspected, the fastest approach was to implement a kernel to do this directly. However, copying the data over line-by-line was significantly slower than I had anticipated. Copying the buffer into an intermediate buffer with clEnqueueCopyBufferRect is actually a pretty good compromise of performance and simplicity, although is still a couple of times slower than the kernel implementation.
The source code for this little experiment can be found here. I was copying a 1020x1020 image with a stride of 1024, and the timings are averaged over 8 runs.

OpenCL image histogram

I'm trying to write a histogram kernel in OpenCL to compute 256 bin R, G, and B histograms of an RGBA32F input image. My kernel looks like this:
const sampler_t mSampler = CLK_NORMALIZED_COORDS_FALSE |
CLK_ADDRESS_CLAMP|
CLK_FILTER_NEAREST;
__kernel void computeHistogram(read_only image2d_t input, __global int* rOutput,
__global int* gOutput, __global int* bOutput)
{
int2 coords = {get_global_id(0), get_global_id(1)};
float4 sample = read_imagef(input, mSampler, coords);
uchar rbin = floor(sample.x * 255.0f);
uchar gbin = floor(sample.y * 255.0f);
uchar bbin = floor(sample.z * 255.0f);
rOutput[rbin]++;
gOutput[gbin]++;
bOutput[bbin]++;
}
When I run it on an 2100 x 894 image (1,877,400 pixels) i tend to only see in or around 1,870,000 total values being recorded when I sum up the histogram values for each channel. It's also a different number each time. I did expect this since once in a while two kernels probably grab the same value from the output array and increment it, effectively cancelling out one increment operation (I'm assuming?).
The 1,870,000 output is for a {1,1} workgroup size (which is what seems to get set by default if I don't specify otherwise). If I force a larger workgroup size like {10,6}, I get a drastically smaller sum in my histogram (proportional to the change in workgroup size). This seemed strange to me, but I'm guessing what happens is that all of the work items in the group increment the output array value at the same time, and so it just counts as a single increment?
Anyways, I've read in the spec that OpenCL has no global memory syncronization, only syncronization within local workgroups using their __local memory. The histogram example by nVidia breaks up the histogram workload into a bunch of subproblems of a specific size, computes their partial histograms, then merges the results into a single histogram after. This doesn't seem like it'll work all that well for images of arbitrary size. I suppose I could pad the image data out with dummy values...
Being new to OpenCL, I guess I'm wondering if there's a more straightforward way to do this (since it seems like it should be a relatively straightforward GPGPU problem).
Thanks!
As stated before, you write into a shared memory unsynchronized and non atomic. This leads to errors. If the picture is big enough, I have a suggestion:
Split your work group into a one dimensional one for cols or rows. Use each kernel to sum up the histogram for the col or row and afterwards sum it globally with atomic atom_inc. This brings the most sum ups in private memory which is much faster and reduces atomic ops.
If you work in two dimensions you can do it on parts of the picture.
[EDIT:]
I think, I have a better answer: ;-)
Have a look to: http://developer.download.nvidia.com/compute/opencl/sdk/website/samples.html#oclHistogram
They have an interesting implementation there...
Yes, you're writing to a shared memory from many work-items at the same time, so you will lose elements if you don't do the updates in a safe way (or worse ? Just don't do it). The increase in group size actually increases the utilization of your compute device, which in turn increases the likelihood of conflicts. So you end up losing more updates.
However, you seem to be confusing synchronization (ordering thread execution order) and shared memory updates (which typically require either atomic operations, or code synchronization and memory barriers, to make sure the memory updates are visible to other threads that are synchronized).
the synchronization+barrier is not particularly useful for your case (and as you noted is not available for global synchronization anyways. Reason is, 2 thread-groups may never run concurrently so trying to synchronize them is nonsensical). It's typically used when all threads start working on generating a common data-set, and then all start to consume that data-set with a different access pattern.
In your case, you can use atomic operations (e.g. atom_inc, see http://www.cmsoft.com.br/index.php?option=com_content&view=category&layout=blog&id=113&Itemid=168). However, note that updating a highly contended memory address (say, because you have thousands of threads trying all to write to only 256 ints) is likely to yield poor performance. All the hoops typical histogram code goes through are there to reduce the contention on the histogram data.
You can check
The histogram example from AMD Accelerated Parallel Processing (APP) SDK.
Chapter 14 - Image Histogram of OpenCL Programming Guide book (ISBN-13: 978-0-321-74964-2).
GPU Histogram - Sample code from Apple

Resources