OpenCL image histogram - opencl

I'm trying to write a histogram kernel in OpenCL to compute 256 bin R, G, and B histograms of an RGBA32F input image. My kernel looks like this:
const sampler_t mSampler = CLK_NORMALIZED_COORDS_FALSE |
CLK_ADDRESS_CLAMP|
CLK_FILTER_NEAREST;
__kernel void computeHistogram(read_only image2d_t input, __global int* rOutput,
__global int* gOutput, __global int* bOutput)
{
int2 coords = {get_global_id(0), get_global_id(1)};
float4 sample = read_imagef(input, mSampler, coords);
uchar rbin = floor(sample.x * 255.0f);
uchar gbin = floor(sample.y * 255.0f);
uchar bbin = floor(sample.z * 255.0f);
rOutput[rbin]++;
gOutput[gbin]++;
bOutput[bbin]++;
}
When I run it on an 2100 x 894 image (1,877,400 pixels) i tend to only see in or around 1,870,000 total values being recorded when I sum up the histogram values for each channel. It's also a different number each time. I did expect this since once in a while two kernels probably grab the same value from the output array and increment it, effectively cancelling out one increment operation (I'm assuming?).
The 1,870,000 output is for a {1,1} workgroup size (which is what seems to get set by default if I don't specify otherwise). If I force a larger workgroup size like {10,6}, I get a drastically smaller sum in my histogram (proportional to the change in workgroup size). This seemed strange to me, but I'm guessing what happens is that all of the work items in the group increment the output array value at the same time, and so it just counts as a single increment?
Anyways, I've read in the spec that OpenCL has no global memory syncronization, only syncronization within local workgroups using their __local memory. The histogram example by nVidia breaks up the histogram workload into a bunch of subproblems of a specific size, computes their partial histograms, then merges the results into a single histogram after. This doesn't seem like it'll work all that well for images of arbitrary size. I suppose I could pad the image data out with dummy values...
Being new to OpenCL, I guess I'm wondering if there's a more straightforward way to do this (since it seems like it should be a relatively straightforward GPGPU problem).
Thanks!

As stated before, you write into a shared memory unsynchronized and non atomic. This leads to errors. If the picture is big enough, I have a suggestion:
Split your work group into a one dimensional one for cols or rows. Use each kernel to sum up the histogram for the col or row and afterwards sum it globally with atomic atom_inc. This brings the most sum ups in private memory which is much faster and reduces atomic ops.
If you work in two dimensions you can do it on parts of the picture.
[EDIT:]
I think, I have a better answer: ;-)
Have a look to: http://developer.download.nvidia.com/compute/opencl/sdk/website/samples.html#oclHistogram
They have an interesting implementation there...

Yes, you're writing to a shared memory from many work-items at the same time, so you will lose elements if you don't do the updates in a safe way (or worse ? Just don't do it). The increase in group size actually increases the utilization of your compute device, which in turn increases the likelihood of conflicts. So you end up losing more updates.
However, you seem to be confusing synchronization (ordering thread execution order) and shared memory updates (which typically require either atomic operations, or code synchronization and memory barriers, to make sure the memory updates are visible to other threads that are synchronized).
the synchronization+barrier is not particularly useful for your case (and as you noted is not available for global synchronization anyways. Reason is, 2 thread-groups may never run concurrently so trying to synchronize them is nonsensical). It's typically used when all threads start working on generating a common data-set, and then all start to consume that data-set with a different access pattern.
In your case, you can use atomic operations (e.g. atom_inc, see http://www.cmsoft.com.br/index.php?option=com_content&view=category&layout=blog&id=113&Itemid=168). However, note that updating a highly contended memory address (say, because you have thousands of threads trying all to write to only 256 ints) is likely to yield poor performance. All the hoops typical histogram code goes through are there to reduce the contention on the histogram data.

You can check
The histogram example from AMD Accelerated Parallel Processing (APP) SDK.
Chapter 14 - Image Histogram of OpenCL Programming Guide book (ISBN-13: 978-0-321-74964-2).

GPU Histogram - Sample code from Apple

Related

Operate only on a subset of buffer in OpenCL kernel

Newbie to OpenCL here. I'm trying to convert a numerical method I've written to OpenCL for acceleration. I'm using the PyOpenCL package as I've written this once in Python already and as far as I can tell there's no compelling reason to use the C version. I'm all ears if I'm wrong on this, though.
I've managed to translate over most of the functionality I need in to OpenCL kernels. My question is on how to (properly) tell OpenCL to ignore my boundary/ghost cells. The reason I need to do this is that my method (for example) for point i accesses cells at [i-2:i+2], so if i=1, I'll run off the end of the array. So - I add some extra points that serve to prevent this, and then just tell my algorithm to only run on points [2:nPts-2]. It's easy to see how to do this with a for loop, but I'm a little more unclear on the 'right' way to do this for a kernel.
Is it sufficient to do, for example (pseudocode)
__kernel void myMethod(...) {
gid = get_global_id(0);
if (gid < nGhostCells || gid > nPts-nGhostCells) {
retVal[gid] = 0;
}
// Otherwise perform my calculations
}
or is there another/more appropriate way to enforce this constraint?
It looks sufficient.
Branching is same for nPts-nGhostCells*2 number of points and it is predictable if nPts and nGhostCells are compile-time constants. Even if it is not predictable, sufficiently large nPts vs nGhostCells (1024 vs 3) should not be distinctively slower than zero-branching version, except the latency of "or" operation. Even that "or" latency must be hidden behind array access latency, thanks to thread level parallelism.
At those "break" points, mostly 16 or 32 threads would lose some performance and only for several clock cycles because of the lock-step running of SIMD-like architectures.
If you happen to code some chaotic branching, like data-driven code path, then you should split them into different kernels(for different regions) or sort them before the kernel so that average branching between neighboring threads are minimized.

OpenCL doesn't allowes late initializtion of variable in constant space

I want to generate a matrix which will be read by many thread after its generation so I declared it with program scope. It has to be constant so I am just assigning values once so
1) why openCl asking for initialization while declaration only?
2) How can I fix this issue?
1) Because you can't tell the gpu which elements are written by which threads. Constants are prepared by preprocessor using scalar engine, not parallel one. Parallel engine would need N x N times synchronizations to achieve that, where N is number of threads participating in building constant buffer.
2-a) If you want to work with constant memory, prepare a simple(__global, not constant) buffer in a kernel, use it as constant buffer in the next kernel(engine puts it in constant memory space). But constant space is small so the matrix should be small. This needs 2 kernels, means kernel overhead.
2-b) If cache performance is enough, just use a buffer. So it can be in a single kernel(first thread group prepares matrix, remaining ones compute using it, not starting until first group gives signal using atomic functions)
2-c) If local memory is bigger than constant memory, you can use local memory and build that matrix for each compute unit by themselves so it should take same amount of cycles(maybe even less if you use all cores) and probably faster than constant memory. This doesn't need communication between thread groups so would be fast.
2-d)If matrix is big and you need most of bandwidth, distribute it to all memory spaces. Example: put 1/4 of matrix to constant memory (5x bandwidth), put 1/4 of matrix to local memory (10x bandwidth), put 1/4 of matrix to global memory(2x from cache performance), put remaining data to instruction space(instructions themselves) so multiple threads would be working on 4 different places concurrently, using all bandwidth (constant + local + cache + instruction cache).

Iterating results back into an OpenCL kernel

I have written an openCL kernel that takes 25million points and checks them relative to two lines, (A & B). It then outputs two lists; i.e. set A of all of the points found to be beyond line A, and vice versa.
I'd like to run the kernel repeatedly, updating the input points with each of the line results sets in turn (and also updating the checking line). I'm guessing that reading the two result sets out of the kernel, forming them into arrays and then passing them back in one at a time as inputs is quite a slow solution.
As an alternative, I've tested keeping a global index in the kernel that logs which points relate to which line. This is updated at each line checking cycle. During each iteration, the index for each point in the overall set is switched to 0 (no line), A or B or so forth (i.e. the related line id). In subsequent iterations only points with an index that matches the 'live' set being checked in that cycle (i.e. tagged with A for set A) are tested further.
The problem is that, in each iteration, the kernels still have to check through the full index (i.e. all 25m points) to discover wether or not they are in the 'live' set. As a result, the speed of each cycle does not significantly improve as the size of the results set decrease over time. Again, this seems a slow solution; whilst avoiding passing too much information between GPU and CPU it instead means that a large number of the work items aren't doing very much work at all.
Is there an alternative solution to what I am trying to do here?
You could use atomics to sort the outputs into two arrays. Ie if we're in A then get my position by incrementing the A counter and put me into A, and do the same for B
Using global atomics on everything might be horribly slow (fast on amd, slow on nvidia, no idea about other devices) - instead you can use a local atomic_inc in a 0'd local integer to do exactly the same thing (but for only the local set of x work-items), and then at the end do an atomic_add to both global counters based on your local counters
To put this more clearly in code (my explanation is not great)
int id;
if(is_a)
id = atomic_inc(&local_a);
else
id = atomic_inc(&local_b);
barrier(CLK_LOCAL_MEM_FENCE);
__local int a_base, b_base;
int lid = get_local_id(0);
if(lid == 0)
{
a_base = atomic_add(a_counter, local_a);
b_base = atomic_add(b_counter, local_b);
}
barrier(CLK_LOCAL_MEM_FENCE);
if(is_a)
a_buffer[id + a_base] = data;
else
b_buffer[id + b_base] = data;
This involves faffing around with atomics which are inherently slow, but depending on how quickly your dataset reduces it might be much faster. Additionally if B data is not considered live, you can omit getting the b ids and all the atomics involving b, as well as the write back

OpenCL computation times much longer than CPU alternative

I'm taking my first steps in OpenCL (and CUDA) for my internship. All nice and well, I now have working OpenCL code, but the computation times are way too high, I think. My guess is that I'm doing too much I/O, but I don't know where that could be.
The code is for the main: http://pastebin.com/i4A6kPfn, and for the kernel: http://pastebin.com/Wefrqifh I'm starting to measure time after segmentPunten(segmentArray, begin, eind); has returned, and I end measuring time after the last clEnqueueReadBuffer.
Computation time on a Nvidia GT440 is 38.6 seconds, on a GT555M 35.5, on a Athlon II X4 5.6 seconds, and on a Intel P8600 6 seconds.
Can someone explain this to me? Why are the computation times are so high, and what solutions are there for this?
What is it supposed to do: (short version) to calculate how much noiseload there is made by an airplane that is passing by.
long version: there are several Observer Points (OP) wich are the points in wich sound is measured from an airplane thas is passing by. The flightpath is being segmented in 10.000 segments, this is done at the function segmentPunten. The double for loop in the main gives OPs a coordinate. There are two kernels. The first one calculates the distance from a single OP to a single segment. This is then saved in the array "afstanden". The second kernel calculates the sound load in an OP, from all the segments.
Just eyeballing your kernel, I see this:
kernel void SEL(global const float *afstanden, global double *totaalSEL,
const int aantalSegmenten)
{
// ...
for(i = 0; i < aantalSegmenten; i++) {
double distance = afstanden[threadID * aantalSegmenten + i];
// ...
}
// ...
}
It looks like aantalSegmenten is being set to 1000. You have a loop in each
kernel that accesses global memory 1000 times. Without crawling though the code,
I'm guessing that many of these accesses overlap when considering your
computation as a whole. It this the case? Will two work items access the same
global memory? If this is the case, you will see a potentially huge win on the
GPU from rewriting your algorithm to partition the work such that you can read
from a specific global memory only once, saving it in local memory. After that,
each work item in the work group that needs that location can read it quickly.
As an aside, the CL specification allows you to omit the leading __ from CL
keywords like global and kernel. I don't think many newcomers to CL realize
that.
Before optimizing further, you should first get an understanding of what is taking all that time. Is it the kernel compiles, data transfer, or actual kernel execution?
As mentioned above, you can get rid of the kernel compiles by caching the results. I believe some OpenCL implementations (the Apple one at least) already do this automatically. With other, you may need to do the caching manually. Here's instructions for the caching.
If the performance bottle neck is the kernel itself, you can probably get a major speed-up by organizing the 'afstanden' array lookups differently. Currently when a block of threads performs a read from the memory, the addresses are spread out through the memory, which is a real killer for GPU performance. You'd ideally want to index array with something like afstanden[ndx*NUM_THREADS + threadID], which would make accesses from a work group to load a contiguous block of memory. This is much faster than the current, essentially random, memory lookup.
First of all you are measuring not the computation time but the whole kernel read-in/compile/execute mumbo-jumbo. To do a fair comparison measure the computation time from the first "non-static" part of your program. (For example from between the first clSetKernelArgs to the last clEnqueueReadBuffer.)
If the execution time is still too high, then you can use some kind of profiler (such as VisualProfiler from NVidia), and read the OpenCL Best Practices guid which is included in the CUDA Toolkit documentation.
To the raw kernel execution time: Consider (and measure) that do you really need the double precision for your calculation, because the double precision calculations are artificially slowed down on the consumer grade NVidia cards.

Detecting and fixing overflows

we have a particle detector hard-wired to use 16-bit and 8-bit buffers. Every now and then, there are certain [predicted] peaks of particle fluxes passing through it; that's okay. What is not okay is that these fluxes usually reach magnitudes above the capacity of the buffers to store them; thus, overflows occur. On a chart, they look like the flux suddenly drops and begins growing again. Can you propose a [mostly] accurate method of detecting points of data suffering from an overflow?
P.S. The detector is physically inaccessible, so fixing it the 'right way' by replacing the buffers doesn't seem to be an option.
Update: Some clarifications as requested. We use python at the data processing facility; the technology used in the detector itself is pretty obscure (treat it as if it was developed by a completely unrelated third party), but it is definitely unsophisticated, i.e. not running a 'real' OS, just some low-level stuff to record the detector readings and to respond to remote commands like power cycle. Memory corruption and other problems are not an issue right now. The overflows occur simply because the designer of the detector used 16-bit buffers for counting the particle flux, and sometimes the flux exceeds 65535 particles per second.
Update 2: As several readers have pointed out, the intended solution would have something to do with analyzing the flux profile to detect sharp declines (e.g. by an order of magnitude) in an attempt to separate them from normal fluctuations. Another problem arises: can restorations (points where the original flux drops below the overflowing level) be detected by simply running the correction program against the reverted (by the x axis) flux profile?
int32[] unwrap(int16[] x)
{
// this is pseudocode
int32[] y = new int32[x.length];
y[0] = x[0];
for (i = 1:x.length-1)
{
y[i] = y[i-1] + sign_extend(x[i]-x[i-1]);
// works fine as long as the "real" value of x[i] and x[i-1]
// differ by less than 1/2 of the span of allowable values
// of x's storage type (=32768 in the case of int16)
// Otherwise there is ambiguity.
}
return y;
}
int32 sign_extend(int16 x)
{
return (int32)x; // works properly in Java and in most C compilers
}
// exercise for the reader to write similar code to unwrap 8-bit arrays
// to a 16-bit or 32-bit array
Of course, ideally you'd fix the detector software to max out at 65535 to prevent wraparound of the sort that is causing your grief. I understand that this isn't always possible, or at least isn't always possible to do quickly.
When the particle flux exceeds 65535, does it do so quickly, or does the flux gradually increase and then gradually decrease? This makes a difference in what algorithm you might use to detect this. For example, if the flux goes up slowly enough:
true flux measurement
5000 5000
10000 10000
30000 30000
50000 50000
70000 4465
90000 24465
60000 60000
30000 30000
10000 10000
then you'll tend to have a large negative drop at times when you have overflowed. A much larger negative drop than you'll have at any other time. This can serve as a signal that you've overflowed. To find the end of the overflow time period, you could look for a large jump to a value not too far from 65535.
All of this depends on the maximum true flux that is possible and on how rapidly the flux rises and falls. For example, is it possible to get more than 128k counts in one measurement period? Is it possible for one measurement to be 5000 and the next measurement to be 50000? If the data is not well-behaved enough, you may be able to make only statistical judgment about when you have overflowed.
Your question needs to provide more information about your implementation - what language/framework are you using?
Data overflows in software (which is what I think you're talking about) are bad practice and should be avoided. While you are seeing (strange data output) is only one side effect that is possible when experiencing data overflows, but it is merely the tip of the iceberg of the sorts of issues you can see.
You could quite easily experience more serious issues like memory corruption, which can cause programs to crash loudly, or worse, obscurely.
Is there any validation you can do to prevent the overflows from occurring in the first place?
I really don't think you can fix it without fixing the underlying buffers. How are you supposed to tell the difference between the sequences of values (0, 1, 2, 1, 0) and (0, 1, 65538, 1, 0)? You can't.
How about using an HMM where the hidden state is whether you are in an overflow and the emissions are observed particle flux?
The tricky part would be coming up with the probability models for the transitions (which will basically encode the time-scale of peaks) and for the emissions (which you can build if you know how the flux behaves and how overflow affects measurement). These are domain-specific questions, so there probably aren't ready-made solutions out there.
But one you have the model, everything else---fitting your data, quantifying uncertainty, simulation, etc.---is routine.
You can only do this if the actual jumps between successive values are much smaller than 65536. Otherwise, an overflow-induced valley artifact is indistinguishable from a real valley, you can only guess. You can try to match overflows to corresponding restorations, by simultaneously analysing a signal from the right and the left (assuming that there is a recognizable base line).
Other than that, all you can do is to adjust your experiment by repeating it with different original particle flows, so that real valleys will not move, but artifact ones move to the point of overflow.

Resources