OpenCL histogram with many bins - opencl

I am using the code presented in Chapter 14 of the OpenCL Progamming Guide to calculate a histogram. It works fine for 256 bins, but unfortunately I need 65536 bins for my application. This leads to the problem that if I use this approach, the local array gets too big.
local uint tmp_histogram[256 * 256];
As a result, the program is not built (CL_BUILD_PROGRAM_FAILURE).
Do you have any ideas how this issue can be solved? I thought of using multiple kernels to compute the values for the different bins (i.e. to split the histogram, so that I first compute the values for the bins 0-255, then for 256-511, etc.). However, in this case I will have to check if a value is within that range before incrementing, which means that I will need conditionals...

Using global memory would solve the problem, but would not result in a very fast kernel. I suggest creating multiple work groups, and using each group to count a range of values only.
#define RANGE_SIZE 8192
kernel void histo(__global uint data,__constant int dataSize){
int wid = get_local_id(0);
int wSize = get_local_size(0);
int gid = get_group_id(0);
int numGroups = get_num_groups(0);
int rangeStart = gid * RANGE_SIZE / numGroups;
int rangeEnd = (gid+1) * RANGE_SIZE / numGroups;
local uint tmp_histogram[RANGE_SIZE];
uint value;
for(int i=wid; i< dataSize; i+= wSize){
value = data[i];
if(value >= rangeStart && value < rangeEnd){
atomic_inc(tmp_histogram[value - rangeStart]);
//use the local data here
Assumes 32kb local memory available. If you reduce RANGE_SIZE, it does not have to be a power of two, but you do need to make sure you are calling the kernel with enough work groups to hit all values up to 64k.

Move your histogram to global storage.
A further solution could be to use unsigned short, if your application suits this size.
At last you could run your code twice. first time for lower 32000 values, second time for the upper half.


Understanding Performance Behavior of Random Writes to Global Memory

I'm running experiments aiming to understand the behavior of random read and write access to global memory.
The following kernel reads from an input vector (groupColumn) with a coalesced access pattern and reads random entries from a hash table in global memory.
struct Entry {
uint group;
uint payload;
typedef struct Entry Entry;
__kernel void global_random_write_access(__global const uint* restrict groupColumn,
__global Entry* globalHashTable,
__const uint HASH_TABLE_SIZE,
__const uint HASH_TABLE_SIZE_BITS,
__const uint BATCH,
__const uint STRIDE) {
int global_id = get_global_id(0);
int local_id = get_local_id(0);
uint end = BATCH * STRIDE;
uint sum = 0;
for (int i = 0; i < end; i += STRIDE) {
uint idx = global_id + i;
// hash keys are pre-computed
uint hash_key = groupColumn[idx]; // coalesced read access
__global Entry* entry = &globalHashTable[hash_key]; // pointer arithmetic
sum += entry->payload; // random read
if (local_id < HASH_TABLE_SIZE) {
globalHashTable[local_id].payload = sum; // rare coalesced write
I ran this kernel on a NVIDIA V100 card with multiple iterations. The variance of the results is very low, thus, I only plot one dot per group configuration. The input data size is 1 GiB and each thread processes 128 entries (BATCH = 128). Here are the results:
So far so good. The V100 has a max memory bandwidth of roughly 840GiB/sec and the measurements are close enough, given the fact that there are random memory reads involved.
Now I'm testing random writes to global memory with the following kernel:
__kernel void global_random_write_access(__global const uint* restrict groupColumn,
__global Entry* globalHashTable,
__const uint HASH_TABLE_SIZE,
__const uint HASH_TABLE_SIZE_BITS,
__const uint BATCH,
__const uint STRIDE) {
int global_id = get_global_id(0);
int local_id = get_local_id(0);
uint end = BATCH * STRIDE;
uint sum = 0;
for (int i = 0; i < end; i += STRIDE) {
uint idx = global_id + i;
// hash keys are pre-computed
uint hash_key = groupColumn[idx]; // coalesced read access
__global Entry* entry = &globalHashTable[hash_key]; // pointer arithmetic
sum += i;
entry->payload = sum; // random write
if (local_id < HASH_TABLE_SIZE) {
globalHashTable[local_id].payload = sum; // rare coalesced write
Godbolt: OpenCL -> PTX
The performance drops significantly to a few GiB/sec for few groups.
I can't make any sense of the behavior. As soon as the hash table reaches the size of L1 the performance seems to be limited by L2. For fewer groups the performance is way lower. I don't really understand what the limiting factors are.
The CUDA documentation doesn't say much about how store instructions are handled internally. The only thing I could find is that the st.wb PTX instruction (Cache Operations) might cause a hit on stale L1 cache if another thread would try to read the same addess via However, there are no reads to the hash table involved here.
Any hints or links to understanding the performance behavior are much appreciated.
I actually found a bug in my code that didn't pre-compute the hash keys. The access to global memory wasn't random, but actually coalesced due to how I generated the values. I further simplified my experiments by removing the hash table. Now I only have one integer input column and one interger output column. Again, I want to see how the writes to global memory actually behave for different memory ranges. Ultimately, I want to understand which hardware properties influence the performance of writes to global memory and see if I can predict based on the code what performance to expect.
I tested this with two kernels that do the following:
Read from input, write to output
Read from input, read from output and write to output
I also applied two different access patterns, by generating the values in the group column:
SEQUENTIAL: sequentially increasing numbers until current group's size is reached. This pattern leads to a coalesced memory access when reading and writing from the output column.
RANDOM: uni-distributed random numbers within the current group's size. This pattern leads to a misaligned memory access when reading and writing from the output column.
(1) Read & Write
__kernel void global_write_access(__global const uint* restrict groupColumn,
__global uint *restrict output,
__const uint BATCH,
__const uint STRIDE) {
int global_id = get_global_id(0);
int local_id = get_local_id(0);
uint end = BATCH * STRIDE;
uint sum = 0;
for (int i = 0; i < end; i += STRIDE) {
uint idx = global_id + i;
uint group = groupColumn[idx]; // coalesced read access
sum += i;
output[group] = sum; // write (coalesced | random)
PTX Code:
(2) Read, Read & Write
__kernel void global_read_write_access(__global const uint* restrict groupColumn,
__global uint *restrict output,
__const uint BATCH,
__const uint STRIDE) {
int global_id = get_global_id(0);
int local_id = get_local_id(0);
uint end = BATCH * STRIDE;
for (int i = 0; i < end; i += STRIDE) {
uint idx = global_id + i;
uint group = groupColumn[idx]; // coalesced read access
output[group] += 1; // read & write (coalesced | random)
PTX Code:
As ProjectPhysX pointed out, the access pattern makes a huge difference. However, for small groups the performance is quite similar for both access patterns. In general, I would like to better understand the shape of the curves and which hardware properties, architectural features etc. influence this shape.
From the cuda programming guide I learned that global memory accesses are conducted via 32-, 64-, or 128-byte transactions. Accesses to L2 are done via 32-byte transactions. So up to 8 integer words can be accessed via a single transaction. This might explain the plateau with a bump at 8 groups at the beginning of the curve. After that more transactions are needed and performance drops.
One cache line is 128 bytes long (both on L1 and L2), hence, 32 intergers fit into a single cache line. For more groups more cache lines are required which can be potentially processed in parallel by more memory controllers. That might be the reason for the performance to increase here. 8 controllers are available on the V100 So I would expect the performance to peak at 256 groups. Though, it doesn't. Instead it will steadily increase performance until reaching 4096 groups and plateau there with roughly 750 GiB/sec.
The plateauing in your second performane plot is GPU saturation: For only a few work groups, the GPU is partly idle and the latencies involved in launching the kernel significantly reduce performance. Above 8192 groups, the GPU fully saturates its memory bandwidth. The plateau only is at ~520GB/s because of the misaligned writes (have low performance on the V100) and also the "rare coalesced write" in the if-block, which happens at least once per group. For branching within the group, all other threads have to wait for the single write operation to finish. Also this write is not coalesced, because it is not happening for every thread in the group. On the V100, misaligned write performance is very poor at max. ~120GB/s, see the benchmark here.
Note that if you would comment the if-part, the compiler sees that you do not do anything with sum and optimizes everything out, leaving you with a blank kernel in PTX.
The first performance graph to me is a bit more confusing. The only difference in the first kernel to the second is that the random wrtite in the loop is replaced by a random read. Generally, read performance on the V100 is much better (~840GB/s, regardless of coalesced/misaligned) than misaligned write performance, so performance is expected to be much better overall and indeed it is. However I can't make sense of the performance dropping for more groups, where saturation should theoretically be better. But the performance drop isn't really that significant at ~760GB/s vs. 730GB/s.
To summarize, you are observing that the performance penalty for misaligned writes (~120GB/s vs. ~900GB/s for coalesced writes) is much larger than for reads, where performance is about the same for coalesced/misaligned at ~840GB/s. This is common thing for GPUs, with some variance of course between microarchitectures. Typically there is at least some performance penalty for misaligned reads, but not as large as for misaligned writes.

OpenCL: 3D array processing - Globale size limit

I'm working with an 3D array of dimension xdim=49, ydim=1024 and zdim=64. my DEVICE_MAX_WORK_ITEM_SIZES is only 512/512/512. If I declare my
size_t global_work_size = {xdim, ydim, zdim}; and launch an 3D kernel,
I'm getting wrong results since my ydim > 512. If all my dimensions are below 512, I'm getting the expected results. Please let me know if there's an alternative for this?
CL_DEVICE_MAX_WORK_ITEM_SIZES only limits the size of work groups, not the global work item size (yea, it's a terrible name for the constant). You are much more tightly restricted by CL_DEVICE_MAX_WORK_GROUP_SIZE which is the total number of items allowed in a work group (you'd typically hit this far sooner than CL_DEVICE_MAX_WORK_ITEM_SIZES because of multiplication.
So go ahead an launch your global work size of 49, 1024, 64. It should work. If it's not, you're using get_local_id instead of get_global_id or have some other bug. We regularly launch 2D kernels with 4096 x 4096 global work size.
See also Questions about global and local work size
If you don't use shared local memory, you don't need to worry about local work group sizes. In fact, you can pass NULL instead of a pointer to an array of sizes for local_work_size and let the runtime pick something (it helps if your global dimensions are easily divisible by small numbers).
Assuming the dimensions you provided are the size of your data, you can decrease the global work size by making each GPU thread calculate more data. What I mean is, every thread in your case will do one calculation and if you change your kernels to do let's say 2 calculations in y dimension, than you could cut the number of threads you are firing into half. The global_work_size decides how many threads in each direction you are executing. Let me give you an example:
Let's assume you have an array you want to do some calculations with and the array size you have is 2048. If you write your kernel in the following way, you are going to need 2048 as the global_work_size:
__kernel void calc (__global int *A, __global int *B)
int i = get_global_id(0);
B[i] = A[i] * 5;
The global work size in this case will be:
size_t global_work_size = {2048, 1, 1};
However, if you change your kernel into the following kernel, you can lower your global work size as well: ()
__kernel void new_calc (__global int *A, __global int *B)
int i = get_global_id(0);
for (int ind = 0; ind < 8; ind++)
B[i*8 + ind] = A[i*8 + ind] * 5;
Then this way, you can use global size as:
size_t global_work_size = {256, 1, 1};
Also with the second kernel, each of your threads will execute more work, resulting in more utilisation.

Random NaN and incorrect results with OpenCL kernel

I am trying to implement a general matrix-matrix multiplication OpenCL kernel, one that conforms to C = α*A*B + β*C.
The Kernel
I did some research online and decided to use a modified kernel from this website as a starting point. The main modification I have made is that allocation of local memory as working space is now dynamic. Below is the kernel I have written:
void clkernel_gemm(const uint M, const uint N, const uint K, const float alpha,
__global const float* A, __global const float* B, const float beta,
__global float* C, __local float* Asub, __local float* Bsub) {
const uint row = get_local_id(0);
const uint col = get_local_id(1);
const uint TS = get_local_size(0); // Tile size
const uint globalRow = TS * get_group_id(0) + row; // Row ID of C (0..M)
const uint globalCol = TS * get_group_id(1) + col; // Row ID of C (0..N)
// Initialise the accumulation register
float acc = 0.0f;
// Loop over all tiles
const int numtiles = K / TS;
for (int t = 0; t < numtiles; t++) {
const int tiledRow = TS * t + row;
const int tiledCol = TS * t + col;
Asub[col * TS + row] = A[tiledCol * M + globalRow];
Bsub[col * TS + row] = B[globalCol * K + tiledRow];
for(int k = 0; k < TS; k++) {
acc += Asub[k * TS + row] * Bsub[col * TS + k] * alpha;
C[globalCol * M + globalRow] = fma(beta, C[globalCol * M + globalRow], acc);
Tile Size (TS) is now a value defined in the calling code, which looks like this:
// A, B and C are 2D matrices, their cl::Buffers have already been set up
// and values appropriately set.
kernel.setArg(0, (cl_int)nrowA);
kernel.setArg(1, (cl_int)ncolB);
kernel.setArg(2, (cl_int)ncolA);
kernel.setArg(3, alpha);
kernel.setArg(4, A_buffer);
kernel.setArg(5, B_buffer);
kernel.setArg(6, beta);
kernel.setArg(7, C_buffer);
kernel.setArg(8, cl::Local(sizeof(float) * nrowA * ncolB));
kernel.setArg(9, cl::Local(sizeof(float) * nrowA * ncolB));
cl::NDRange global(nrowA, ncolB);
cl::NDRange local(nrowA, ncolB);
status = cmdq.enqueueNDRangeKernel(kernel, cl::NDRange(0), global, local);
The Problem
The problem I am encountering is, unit tests (written with Google's gtest) I have written will randomly fail, but only for this particular kernel. (I have 20 other kernels in the same .cl source file that pass tests 100% of the time)
I have a test that multiplies a 1x4 float matrix {0.0, 1.0, 2.0, 3.0} with a transposed version of itself {{0.0}, {1.0}, {2.0}, {3.0}}. The expected output is {14.0}.
However, I can get this correct result maybe just 75% of the time.
Sometimes, I can get 23.0 (GTX 970), 17.01 (GTX 750) or just -nan and 0.0 (all 3 devices). The curious part is, the respective incorrect results seem to be unique to the devices; I cannot seem to, for example, get 23.0 on the Intel CPU or the GTX 750.
I am baffled because if I have made an algorithmic or mathematical mistake, the mistake should be consistent; instead I am getting incorrect results only randomly.
What am I doing wrong here?
Things I have tried
I have verified that the data going into the kernels are correct.
I have tried to initialize both __local memory to 0.0, but this causes all results to become wrong (but frankly, I'm not really sure how to initialize it properly)
I have written a test program that only executes this kernel to rule out any race conditions interacting with the rest of my program, but the bug still happens.
Other points to note
I am using the C++ wrapper retrieved directly from the Github page.
To use the wrapper, I have defined CL_HPP_MINIMUM_OPENCL_VERSION 120 and CL_HPP_TARGET_OPENCL_VERSION 120.
I am compiling the kernels with the -cl-std=CL1.2 flag.
All cl::Buffers are created with only the CL_MEM_READ_WRITE flag.
I am testing this on Ubuntu 16.04, Ubuntu 14.04, and Debian 8.
I have tested this on Intel CPUs with the Intel OpenCL Runtime 16.1 for Ubuntu installed. The runtime reports that it supports up to OpenCL 1.2
I have tested this on both Nvidia GTX 760 and 970. Nvidia only supports up to OpenCL 1.2.
All 3 platforms exhibit the same problem with varying frequency.
This looks like a complicated one. There are several things to address and they won't fit into comments, so I'll post all this as an answer even though it does not solve your problem (yet).
I am baffled because if I have made an algorithmic or mathematical
mistake, the mistake should be consistent; instead I am getting
incorrect results only randomly.
Such a behavior is a typical indicator of race conditions.
I have tried to initialize both __local memory to 0.0, but this causes
all results to become wrong (but frankly, I'm not really sure how to
initialize it properly)
Actually this is a good thing. Finally we have some consistency.
Initializing local memory
Initializing local memory can be done using the work items, e.g. if you have a 1D workgroup of 16 items and your local memory consists of 16 floats, just do this:
local float* ptr = ... // your pointer to local memory
int idx = get_local_id(0); // get the index for the current work-item
ptr[idx] = 0.f; // init with value 0
barrier(CLK_LOCAL_MEM_FENCE); // synchronize local memory access within workgroup
If your local memory is larger, e.g. 64 floats, you will have to use a loop where each work item initializes 4 values, at least that is the most efficient way. However, no one will stop you from using every work item to initialize every value in the local memory, even though that is complete nonsense since you're essentially initializing it multiple times.
Your changes
The original algorithm looks like it is especially designed to use quadratic tiles.
__local float Asub[TS][TS];
__local float Bsub[TS][TS];
Not only that but the size of local memory matches the workgroup size, in their example 32x32.
When I look at your kernel parameters for local memory, I can see that you use parameters that are defined as M and N in the original algorithm. This doesn't seem correct.
Update 1
Since you have not described if the original algorithm works for you, this is what you should do to find your error:
Create a set of testdata. Make sure you only use data sizes that are actually supported by the original algorithm (e.g. minimum size, mulitples of x, etc.). Also, use large data sets since some errors only show if multiple workgroups are dispatched.
Use the original, unaltered algorithm with your testdata sets and verify the results.
Change the algorithm only that instead of fixed size local memory, dynamic local memory size is used, but make sure it has the same size as the fixed size approach. This is what you tried but I think it failed due to what I have described under "Your changes".

OpenCL, is out of bound checks important in kernels

I have seen solutions like this:
kernel dp_square (const float *a,
float *result)
int id = get_global_id(0);
result[id] = a[id] * a[id];
kernel dp_square (const float *a,
float *result, const unsigned int count)
int id = get_global_id(0);
if(id < count)
result[id] = a[id] * a[id];
Is the check for id< count important, what happens if a kernel work item tries to process an item not avalible?
Can the reason for it not being there in the first example be that programmer just ensures that the global size is equal the number of elements to be processed ( is this normal) ?
This is often done for two reasons --
To ensure that a developer-error doesn't kill the code or read bad memory
Because sometimes it is optimal to run more work-items than there are data points. For example, if the optimal work-group size for my device is 32 (not uncommon), and I have an array of 61 pieces of data, I'll run 64-work items, and the last three will simply "play dead."
In order to not include this check, you'd have to use a work-group size that divides the total number of work-items. In this case, that would leave you with a work-group size of 1 (as 61 is prime), which would be very slow!

Strange behaviour using local memory in OpenCL

I'm currently working on a project suing OpenCL on a NVIDIA Tesla C1060 (driver version 195.17). However I'm getting some strange behaviour I can't really explain. Here is the code which puzzles me (reduced for clarity and testing purpose):
kernel void TestKernel(global const int* groupOffsets, global float* result,
local int* tmpData, const int itemcount)
unsigned int groupid = get_group_id(0);
unsigned int globalsize = get_global_size(0);
unsigned int groupcount = get_num_groups(0);
for(unsigned int id = get_global_id(0); id < itemcount; id += globalsize, groupid += groupcount)
if(get_local_id(0) == 0)
tmpData[0] = groupOffsets[groupid];
int offset = tmpData[0];
result[id] = (float) offset;
This code should load the offset for each workgroup into local memory and then read it back and write it into the corresponding outputvector entry. For most workitems this is working, but for each workgroup the workitems with local ids 1 to 31 read an incorrect value.
My output vector (for workgroupsize=128) is as following:
index 0: 0
index 1- 31: 470400
index 32-127: 0
index 128: 640
index 129-159: 471040
index 160-255: 640
index 256: 1280
index 257-287: 471680
index 288-511: 1280
the output i expected would be
index 0-127: 0
index 128-255: 640
index 256-511: 1280
Strange thing is: the problem only occurs when I use less then itemcount workitems (so it works as expected when globalsize>=itemcount, meaning that every workitem processes only one entry). So I'm guessing it has something to do with the loop.
Does anyone know what I'm doing wrong and how to fix it?
I found out that it seems to work if I change
if(get_local_id(0) == 0)
tmpData[0] = groupOffsets[groupid];
if(get_local_id(0) < 32)
tmpData[0] = groupOffsets[groupid];
Which astonishes me even more, so while it might fix the problem, I'm don't feel comfortable fixing it this way (as in it might break some other time).
Besides I would rather avoid losing performance when running on Geforce 8xxx class hardware due to additional (uncoalesced for that hardware as far as I understand) memory accesses.
So the question still remains.
Firstly, and importantly, you need to be careful that itemcount is a multiple of the local work size to avoid divergence when executing the barrier.
All work-items in a work-group executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the barrier. This function must be encountered by all work-items in a work-group executing the kernel.
You could implement this as follows:
unsigned int itemcountrounded = get_local_size(0) * ((itemcount + get_local_size(0) - 1) / get_local_size(0));
for(unsigned int id = get_global_id(0); id < itemcountrounded; id += globalsize, groupid += groupcount)
// ...
if (id < itemcount)
result[id] = (float) offset;
You said the code was reduced for simplicity, what happens if you run what you posted? Just wondering whether you need to put the barrier on global memory as well.
