Reduce 1024 images to one - opencl

I read the paper about reducing a 1d array to one value in openCL ( http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/ ) and I understood the concept of associative operators. Extending this concept to ONE 2d array should also be possible.
But my problem is somewhat different: I have ~1000 images of 256x256 pixels with 16bit each and I would like to sum all these images to finally have the average image of them all. The usual GPU should have enough memory (~130Mb) to perform this task, but I don't really see how to implement the kernel.

Just as the 1D problem extends to 2D, it can also extend to 3D (which is what you have: 1000x256x256).
Exactly the same principles would apply:
1. Try to do as much work in parallel as you can without contention with other work groups.
2. Do the reduction in stages so each can be parallel.
Your likely going to be bandwidth limited, churning through 131 MB of memory, but that's not really a problem. Just write the kernels to do coalesced reads for maximum performance.

Related

OpenCL clEnqueueCopyImageToBuffer with stride

I have an OpenCL buffer containing an 2D image.
This image have stride bigger than its width.
I need to make OpenCL image from this buffer.
The problem is that function clEnqueueCopyImageToBuffer does not contain stride as an input parameter.
Is it possible to make OpenCL image from OpenCL buffer(with stride bigger than width), with only one copying or faster?
The one way to solve this problem is to write own kernel, but maybe there are much more neat solutions?
Unfortunately, there is no method in the OpenCL specification which allows you to directly create an image from a buffer when the buffer data has a stride not equal to the image width. The most efficient solution would probably be to write your own kernel to do this.
The simplest solution that doesn't involve writing your own kernel would be to copy one line at a time with clEnqueueCopyBufferToImage. If your image is big enough, it might be that the performance of this technique would be reasonably comparable to the hand-written kernel, but you would have to try it out to see.
I didn't include the clEnqueueCopyBufferRect approach in my original answer because my first instinct was that the extra copy would kill performance. However, the comments above got me thinking about it further, and I was interested enough to implement all three approaches to see what the performance was actually like.
As I suspected, the fastest approach was to implement a kernel to do this directly. However, copying the data over line-by-line was significantly slower than I had anticipated. Copying the buffer into an intermediate buffer with clEnqueueCopyBufferRect is actually a pretty good compromise of performance and simplicity, although is still a couple of times slower than the kernel implementation.
The source code for this little experiment can be found here. I was copying a 1020x1020 image with a stride of 1024, and the timings are averaged over 8 runs.

OpenCl and power iteration method (eigendecomposition)

I'm new in OpenCL and I'm trying to implement power iteration method (described over here)
matrix sizes over 100000x100000!
Actually I have no idea how to implement this.
It's because workgroup have restriction CL_DEVICE_MAX_WORK_GROUP_SIZE (so I can't make one workgoup with 1000000 work-items)
But on each step of iterating I need to synchronize and normalize vector.
1) So is it possible to make all calculations inside one kernel? (I think that answer is no if matrix sizes is more than CL_DEVICE_MAX_WORK_GROUP_SIZE)
2) Can I make "while" loop in the host code? and is it still profitable to use GPU in this case?
something like:
while (condition)
{
kernel calling
synchronization
}
2: Yes, you can make a while loop in host code. Whether this is still profitable in terms of performance depends on whether the kernel that is called achieves a good speedup. My personal preference is not to pack too much logic into a single kernel, because smaller kernels are easier to maintain and sometimes easier to optimize. But of course, invoking a kernel has a (small) overhead that has to be taken into account. And whether combining to kernels into one can bring a speedup (or new potential for optimizations) depends on what the kernels are actually doing. But in this case (Matrix Multiplation and Vector Normalization) I'd personally start with two different kernels that are invoked from the host in a while-loop.
1: Since a 100000x100000 matrix with float values will take at least 40GB of memory, you'll have to think about the approach in general anyhow. There is a vast amount of literature on Matrix operations, their parallelization, and the corresponding implementations on the GPU. One important aspect from the "high level" point of view is whether the matrices are dense or sparse ( http://en.wikipedia.org/wiki/Sparse_matrix ). Depending on the sparsity, it might even be possible to handle 100000x100000 matrices in main memory. Apart from that, you might consider having a look at a library for matrix operations (e.g. http://viennacl.sourceforge.net/ ) because implementing an efficient matrix multiplication is challenging, particularly for sparse matrices. But if you want to go the whole way on your own: Good luck ;-) and ... the CL_DEVICE_MAX_WORK_GROUP_SIZE imposes no limitation on the problem size. In fact, the problem size (that is, the total number of work-items) in OpenCL is virtually infinitely large. If your CL_DEVICE_MAX_WORK_GROUP_SIZE is 256, and you want to handle 10000000000 elements, then you create 10000000000/256 work groups and let OpenCL care about how they are actually dispatched and executed. For matrix operations, the CL_DEVICE_MAX_WORK_GROUP_SIZE is primarily relevant when you want to use local memory (and you will have to, in order to achieve good performance): The size of the work groups thus implicitly defines how large your chunks of local memory may be.

returning one result from a pyopencl kernel

My pyopencl kernel program is started with global size of (512,512), I assume it will run 512x512=262,144 times. I want to find the minimum value of a function in my 512x512 image but I don't want to return 262,144 floats to my CPU to calculate the min. I want to run another kernel (possibly waiting in the queue ) to find the min value of all 262,144 pixels then just send that one float to the CPU. I think this would be faster. Should my waiting kernel's global size be (1,1), ? I hope the large 262,144 Buffer of floats that I created using mf.COPY_HOST_PTR will not cross the GPU/CPU bus before I call the next kernel.
Thanks
Tim
Andreas is right: reduction is the solution. Here is a nice article from AMD explaining how to implement simple reduction. It discusses different approaches and the gain in terms of performance they bring. The example in the article is about summing all the elements and not to find the minimum, but it's fairly trivial to modify the given codes.
BTW, maybe I don't understand well you first sentence, but a kernel launched with a global size of (512, 512) will not run 262,144 times but only one time with 262,144 threads scheduled.
Use a reduction kernel to find the minimum.

OpenCL computation times much longer than CPU alternative

I'm taking my first steps in OpenCL (and CUDA) for my internship. All nice and well, I now have working OpenCL code, but the computation times are way too high, I think. My guess is that I'm doing too much I/O, but I don't know where that could be.
The code is for the main: http://pastebin.com/i4A6kPfn, and for the kernel: http://pastebin.com/Wefrqifh I'm starting to measure time after segmentPunten(segmentArray, begin, eind); has returned, and I end measuring time after the last clEnqueueReadBuffer.
Computation time on a Nvidia GT440 is 38.6 seconds, on a GT555M 35.5, on a Athlon II X4 5.6 seconds, and on a Intel P8600 6 seconds.
Can someone explain this to me? Why are the computation times are so high, and what solutions are there for this?
What is it supposed to do: (short version) to calculate how much noiseload there is made by an airplane that is passing by.
long version: there are several Observer Points (OP) wich are the points in wich sound is measured from an airplane thas is passing by. The flightpath is being segmented in 10.000 segments, this is done at the function segmentPunten. The double for loop in the main gives OPs a coordinate. There are two kernels. The first one calculates the distance from a single OP to a single segment. This is then saved in the array "afstanden". The second kernel calculates the sound load in an OP, from all the segments.
Just eyeballing your kernel, I see this:
kernel void SEL(global const float *afstanden, global double *totaalSEL,
const int aantalSegmenten)
{
// ...
for(i = 0; i < aantalSegmenten; i++) {
double distance = afstanden[threadID * aantalSegmenten + i];
// ...
}
// ...
}
It looks like aantalSegmenten is being set to 1000. You have a loop in each
kernel that accesses global memory 1000 times. Without crawling though the code,
I'm guessing that many of these accesses overlap when considering your
computation as a whole. It this the case? Will two work items access the same
global memory? If this is the case, you will see a potentially huge win on the
GPU from rewriting your algorithm to partition the work such that you can read
from a specific global memory only once, saving it in local memory. After that,
each work item in the work group that needs that location can read it quickly.
As an aside, the CL specification allows you to omit the leading __ from CL
keywords like global and kernel. I don't think many newcomers to CL realize
that.
Before optimizing further, you should first get an understanding of what is taking all that time. Is it the kernel compiles, data transfer, or actual kernel execution?
As mentioned above, you can get rid of the kernel compiles by caching the results. I believe some OpenCL implementations (the Apple one at least) already do this automatically. With other, you may need to do the caching manually. Here's instructions for the caching.
If the performance bottle neck is the kernel itself, you can probably get a major speed-up by organizing the 'afstanden' array lookups differently. Currently when a block of threads performs a read from the memory, the addresses are spread out through the memory, which is a real killer for GPU performance. You'd ideally want to index array with something like afstanden[ndx*NUM_THREADS + threadID], which would make accesses from a work group to load a contiguous block of memory. This is much faster than the current, essentially random, memory lookup.
First of all you are measuring not the computation time but the whole kernel read-in/compile/execute mumbo-jumbo. To do a fair comparison measure the computation time from the first "non-static" part of your program. (For example from between the first clSetKernelArgs to the last clEnqueueReadBuffer.)
If the execution time is still too high, then you can use some kind of profiler (such as VisualProfiler from NVidia), and read the OpenCL Best Practices guid which is included in the CUDA Toolkit documentation.
To the raw kernel execution time: Consider (and measure) that do you really need the double precision for your calculation, because the double precision calculations are artificially slowed down on the consumer grade NVidia cards.

OpenCL image histogram

I'm trying to write a histogram kernel in OpenCL to compute 256 bin R, G, and B histograms of an RGBA32F input image. My kernel looks like this:
const sampler_t mSampler = CLK_NORMALIZED_COORDS_FALSE |
CLK_ADDRESS_CLAMP|
CLK_FILTER_NEAREST;
__kernel void computeHistogram(read_only image2d_t input, __global int* rOutput,
__global int* gOutput, __global int* bOutput)
{
int2 coords = {get_global_id(0), get_global_id(1)};
float4 sample = read_imagef(input, mSampler, coords);
uchar rbin = floor(sample.x * 255.0f);
uchar gbin = floor(sample.y * 255.0f);
uchar bbin = floor(sample.z * 255.0f);
rOutput[rbin]++;
gOutput[gbin]++;
bOutput[bbin]++;
}
When I run it on an 2100 x 894 image (1,877,400 pixels) i tend to only see in or around 1,870,000 total values being recorded when I sum up the histogram values for each channel. It's also a different number each time. I did expect this since once in a while two kernels probably grab the same value from the output array and increment it, effectively cancelling out one increment operation (I'm assuming?).
The 1,870,000 output is for a {1,1} workgroup size (which is what seems to get set by default if I don't specify otherwise). If I force a larger workgroup size like {10,6}, I get a drastically smaller sum in my histogram (proportional to the change in workgroup size). This seemed strange to me, but I'm guessing what happens is that all of the work items in the group increment the output array value at the same time, and so it just counts as a single increment?
Anyways, I've read in the spec that OpenCL has no global memory syncronization, only syncronization within local workgroups using their __local memory. The histogram example by nVidia breaks up the histogram workload into a bunch of subproblems of a specific size, computes their partial histograms, then merges the results into a single histogram after. This doesn't seem like it'll work all that well for images of arbitrary size. I suppose I could pad the image data out with dummy values...
Being new to OpenCL, I guess I'm wondering if there's a more straightforward way to do this (since it seems like it should be a relatively straightforward GPGPU problem).
Thanks!
As stated before, you write into a shared memory unsynchronized and non atomic. This leads to errors. If the picture is big enough, I have a suggestion:
Split your work group into a one dimensional one for cols or rows. Use each kernel to sum up the histogram for the col or row and afterwards sum it globally with atomic atom_inc. This brings the most sum ups in private memory which is much faster and reduces atomic ops.
If you work in two dimensions you can do it on parts of the picture.
[EDIT:]
I think, I have a better answer: ;-)
Have a look to: http://developer.download.nvidia.com/compute/opencl/sdk/website/samples.html#oclHistogram
They have an interesting implementation there...
Yes, you're writing to a shared memory from many work-items at the same time, so you will lose elements if you don't do the updates in a safe way (or worse ? Just don't do it). The increase in group size actually increases the utilization of your compute device, which in turn increases the likelihood of conflicts. So you end up losing more updates.
However, you seem to be confusing synchronization (ordering thread execution order) and shared memory updates (which typically require either atomic operations, or code synchronization and memory barriers, to make sure the memory updates are visible to other threads that are synchronized).
the synchronization+barrier is not particularly useful for your case (and as you noted is not available for global synchronization anyways. Reason is, 2 thread-groups may never run concurrently so trying to synchronize them is nonsensical). It's typically used when all threads start working on generating a common data-set, and then all start to consume that data-set with a different access pattern.
In your case, you can use atomic operations (e.g. atom_inc, see http://www.cmsoft.com.br/index.php?option=com_content&view=category&layout=blog&id=113&Itemid=168). However, note that updating a highly contended memory address (say, because you have thousands of threads trying all to write to only 256 ints) is likely to yield poor performance. All the hoops typical histogram code goes through are there to reduce the contention on the histogram data.
You can check
The histogram example from AMD Accelerated Parallel Processing (APP) SDK.
Chapter 14 - Image Histogram of OpenCL Programming Guide book (ISBN-13: 978-0-321-74964-2).
GPU Histogram - Sample code from Apple

Resources