Strange behaviour using local memory in OpenCL - opencl

I'm currently working on a project suing OpenCL on a NVIDIA Tesla C1060 (driver version 195.17). However I'm getting some strange behaviour I can't really explain. Here is the code which puzzles me (reduced for clarity and testing purpose):
kernel void TestKernel(global const int* groupOffsets, global float* result,
local int* tmpData, const int itemcount)
{
unsigned int groupid = get_group_id(0);
unsigned int globalsize = get_global_size(0);
unsigned int groupcount = get_num_groups(0);
for(unsigned int id = get_global_id(0); id < itemcount; id += globalsize, groupid += groupcount)
{
barrier(CLK_LOCAL_MEM_FENCE);
if(get_local_id(0) == 0)
tmpData[0] = groupOffsets[groupid];
barrier(CLK_LOCAL_MEM_FENCE);
int offset = tmpData[0];
result[id] = (float) offset;
}
}
This code should load the offset for each workgroup into local memory and then read it back and write it into the corresponding outputvector entry. For most workitems this is working, but for each workgroup the workitems with local ids 1 to 31 read an incorrect value.
My output vector (for workgroupsize=128) is as following:
index 0: 0
index 1- 31: 470400
index 32-127: 0
index 128: 640
index 129-159: 471040
index 160-255: 640
index 256: 1280
index 257-287: 471680
index 288-511: 1280
...
the output i expected would be
index 0-127: 0
index 128-255: 640
index 256-511: 1280
...
Strange thing is: the problem only occurs when I use less then itemcount workitems (so it works as expected when globalsize>=itemcount, meaning that every workitem processes only one entry). So I'm guessing it has something to do with the loop.
Does anyone know what I'm doing wrong and how to fix it?
Update:
I found out that it seems to work if I change
if(get_local_id(0) == 0)
tmpData[0] = groupOffsets[groupid];
to
if(get_local_id(0) < 32)
tmpData[0] = groupOffsets[groupid];
Which astonishes me even more, so while it might fix the problem, I'm don't feel comfortable fixing it this way (as in it might break some other time).
Besides I would rather avoid losing performance when running on Geforce 8xxx class hardware due to additional (uncoalesced for that hardware as far as I understand) memory accesses.
So the question still remains.

Firstly, and importantly, you need to be careful that itemcount is a multiple of the local work size to avoid divergence when executing the barrier.
All work-items in a work-group executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the barrier. This function must be encountered by all work-items in a work-group executing the kernel.
You could implement this as follows:
unsigned int itemcountrounded = get_local_size(0) * ((itemcount + get_local_size(0) - 1) / get_local_size(0));
for(unsigned int id = get_global_id(0); id < itemcountrounded; id += globalsize, groupid += groupcount)
{
// ...
if (id < itemcount)
result[id] = (float) offset;
}
You said the code was reduced for simplicity, what happens if you run what you posted? Just wondering whether you need to put the barrier on global memory as well.

Related

OpenCL: 3D array processing - Globale size limit

I'm working with an 3D array of dimension xdim=49, ydim=1024 and zdim=64. my DEVICE_MAX_WORK_ITEM_SIZES is only 512/512/512. If I declare my
size_t global_work_size = {xdim, ydim, zdim}; and launch an 3D kernel,
I'm getting wrong results since my ydim > 512. If all my dimensions are below 512, I'm getting the expected results. Please let me know if there's an alternative for this?
CL_DEVICE_MAX_WORK_ITEM_SIZES only limits the size of work groups, not the global work item size (yea, it's a terrible name for the constant). You are much more tightly restricted by CL_DEVICE_MAX_WORK_GROUP_SIZE which is the total number of items allowed in a work group (you'd typically hit this far sooner than CL_DEVICE_MAX_WORK_ITEM_SIZES because of multiplication.
So go ahead an launch your global work size of 49, 1024, 64. It should work. If it's not, you're using get_local_id instead of get_global_id or have some other bug. We regularly launch 2D kernels with 4096 x 4096 global work size.
See also Questions about global and local work size
If you don't use shared local memory, you don't need to worry about local work group sizes. In fact, you can pass NULL instead of a pointer to an array of sizes for local_work_size and let the runtime pick something (it helps if your global dimensions are easily divisible by small numbers).
Assuming the dimensions you provided are the size of your data, you can decrease the global work size by making each GPU thread calculate more data. What I mean is, every thread in your case will do one calculation and if you change your kernels to do let's say 2 calculations in y dimension, than you could cut the number of threads you are firing into half. The global_work_size decides how many threads in each direction you are executing. Let me give you an example:
Let's assume you have an array you want to do some calculations with and the array size you have is 2048. If you write your kernel in the following way, you are going to need 2048 as the global_work_size:
__kernel void calc (__global int *A, __global int *B)
{
int i = get_global_id(0);
B[i] = A[i] * 5;
}
The global work size in this case will be:
size_t global_work_size = {2048, 1, 1};
However, if you change your kernel into the following kernel, you can lower your global work size as well: ()
__kernel void new_calc (__global int *A, __global int *B)
{
int i = get_global_id(0);
for (int ind = 0; ind < 8; ind++)
B[i*8 + ind] = A[i*8 + ind] * 5;
}
Then this way, you can use global size as:
size_t global_work_size = {256, 1, 1};
Also with the second kernel, each of your threads will execute more work, resulting in more utilisation.

Use Comment to avoid OpenCL Error on NVIDIA

I wrote the following code for my test NVIDIA and AMD GPUs
kernel void computeLayerOutput_Rolled(
global Layer* layers,
global float* weights,
global float* output,
constant int* restrict netSpec,
int layer)
{
const int n = get_global_size(0);
const int nodeNumber = get_global_id(0); //There will be an offset depending on the layer we are operating on
int numberOfWeights;
float t;
//getPosition(i, netSpec, &layer, &nodeNumber);
numberOfWeights = layers[layer].nodes[nodeNumber].numberOfWeights;
//if (sizeof(Layer) > 60000) // This is the extra code add for nvidia
// exit(0);
t = 0;
for (unsigned int j = 0; j != numberOfWeights; ++j)
t += threeD_access(weights, layer, nodeNumber, j, MAXSIZE, MAXSIZE) *
twoD_access(output, layer-1, j, MAXSIZE);
twoD_access(output, layer, nodeNumber, MAXSIZE) = sigmoid(t);
}
At the beginning, I did not add the code that checking the size of Layer, and it works on AMD Kalindi GPU, but crash and report an error code -36 on NVIDIA Tesla C2075.
Since I had rewritten the struct type Layer and decreased the size of it a lot before, I decided to check the size of Layer to determine whether this struct defined well in kernel code. Then I added this code
if (sizeof(Layer) > 60000)
exit(0);
Then it is OK on NVIDIA. However, the strange thing is, when I add // before this just as the given code above, it still works. (I believe I do not need to make clean && make when I rewrite something in kernel code, but I still did it) Nevertheless, when I roll back to the version not contains this comment, it fails and the error code -36 appears again. It really puzzles me. I think two versions of my code are identical, isn't it?

OpenCL histogram with many bins

I am using the code presented in Chapter 14 of the OpenCL Progamming Guide to calculate a histogram. It works fine for 256 bins, but unfortunately I need 65536 bins for my application. This leads to the problem that if I use this approach, the local array gets too big.
local uint tmp_histogram[256 * 256];
As a result, the program is not built (CL_BUILD_PROGRAM_FAILURE).
Do you have any ideas how this issue can be solved? I thought of using multiple kernels to compute the values for the different bins (i.e. to split the histogram, so that I first compute the values for the bins 0-255, then for 256-511, etc.). However, in this case I will have to check if a value is within that range before incrementing, which means that I will need conditionals...
Using global memory would solve the problem, but would not result in a very fast kernel. I suggest creating multiple work groups, and using each group to count a range of values only.
#define RANGE_SIZE 8192
kernel void histo(__global uint data,__constant int dataSize){
int wid = get_local_id(0);
int wSize = get_local_size(0);
int gid = get_group_id(0);
int numGroups = get_num_groups(0);
int rangeStart = gid * RANGE_SIZE / numGroups;
int rangeEnd = (gid+1) * RANGE_SIZE / numGroups;
local uint tmp_histogram[RANGE_SIZE];
uint value;
for(int i=wid; i< dataSize; i+= wSize){
value = data[i];
if(value >= rangeStart && value < rangeEnd){
atomic_inc(tmp_histogram[value - rangeStart]);
}
}
//barrier...
//use the local data here
}
Assumes 32kb local memory available. If you reduce RANGE_SIZE, it does not have to be a power of two, but you do need to make sure you are calling the kernel with enough work groups to hit all values up to 64k.
Move your histogram to global storage.
A further solution could be to use unsigned short, if your application suits this size.
At last you could run your code twice. first time for lower 32000 values, second time for the upper half.

OpenCL, is out of bound checks important in kernels

I have seen solutions like this:
kernel dp_square (const float *a,
float *result)
{
int id = get_global_id(0);
result[id] = a[id] * a[id];
}
and
kernel dp_square (const float *a,
float *result, const unsigned int count)
{
int id = get_global_id(0);
if(id < count)
result[id] = a[id] * a[id];
}
Is the check for id< count important, what happens if a kernel work item tries to process an item not avalible?
Can the reason for it not being there in the first example be that programmer just ensures that the global size is equal the number of elements to be processed ( is this normal) ?
This is often done for two reasons --
To ensure that a developer-error doesn't kill the code or read bad memory
Because sometimes it is optimal to run more work-items than there are data points. For example, if the optimal work-group size for my device is 32 (not uncommon), and I have an array of 61 pieces of data, I'll run 64-work items, and the last three will simply "play dead."
In order to not include this check, you'd have to use a work-group size that divides the total number of work-items. In this case, that would leave you with a work-group size of 1 (as 61 is prime), which would be very slow!

OpenCL kernel work-group size restriction

So I keep running into strange errors when I call my kernels; the stated max kernel work-group size is one, while the work group size of my device (my Macbook) is decidedly higher than that. What possible causes could there be for the kernels restricting the code to a single work group? Here's one of my kernels:
__kernel
void termination_kernel(const int Elements,
__global float* c_I,
__global float* c_Ihat,
__global float* c_rI,
__local float* s_a)
{
const int bdim = 128;
int n = get_global_id(0);
const int tx = get_local_id(0); // thread index in thread-block (0-indexed)
const int bx = get_group_id(0); // block index (0-indexed)
const int gx = get_num_groups(0);
// is thread in range for the addition
float d = 0.f;
while(n < Elements){
d += pow(c_I[n] - c_Ihat[n], 2);
n += gx * bdim;
}
// assume bx power of 2
int alive = bdim / 2;
s_a[tx] = d;
barrier(CLK_LOCAL_MEM_FENCE);
while(alive > 1){
if(tx < alive)
s_a[tx] += s_a[tx + alive];
alive /= 2;
barrier(CLK_LOCAL_MEM_FENCE);
}
if(tx == 0)
c_rI[bx] = s_a[0] + s_a[1];
}
and the error returned is
OpenCL Error (via pfn_notify): [CL_INVALID_WORK_GROUP_SIZE] : OpenCL Error : clEnqueueNDRangeKernel
failed: total work group size (128) is greater than the device can support (1)
OpenCL Error: 'clEnqueueNDRangeKernel(queue, kernel_N, dim, NULL, global_N, local_N, 0, NULL, NULL)'
I know it says the restriction is on the device, but debugging shows that
CL_DEVICE_MAX_WORK_GROUP_SIZE = 1024
and
CL_KERNEL_WORK_GROUP_SIZE = 1
The kernel construction is called by
char *KernelSource_T = readSource("Includes/termination_kernel.cl");
cl_program program_T = clCreateProgramWithSource(context, 1, (const char **) &KernelSource_T, NULL, &err);
clBuildProgram(program_T, 1, &device, flags, NULL, NULL);
cl_kernel kernel_T = clCreateKernel(program_T, "termination_kernel", &err);
I'd include the calling function, but I'm not sure if it's relevant; my intuition is that it's something in the kernel code that's forcing the restriction. Any ideas? Thanks in advance for the help!
Apple OpenCL doesn't support work-groups larger than [1, 1, 1] on the CPU. I have no idea why, but that's how it's been at least up to OSX 10.9.2. Larger work-groups are fine on the GPU, though.
CL_KERNEL_WORK_GROUP_SIZE tells you how large the maximum work group size can be for this particular kernel. OpenCL's runtime determines that by inspecting the kernel code. CL_KERNEL_WORK_GROUP_SIZE will be a number less or equal to CL_DEVICE_MAX_WORK_GROUP_SIZE.
Hope the amount of local memory avilable is less for that work group size . Please can you show the arguments? . You can try by reducing the work group size , start with 2,4,8,16,32,64,128 so on make sure its power of 2.
Time has passed since the answer of Tomi and it seems that Apple has become slightly more flexible on this aspect. On my OS X 10.12.3 (still OpenCL 1.2), it is possible to use up to CL_DEVICE_MAX_WORK_GROUP_SIZE in the first dimension.
According to the specification, it is also possible to get the maximum number of work-groups for each dimension through CL_DEVICE_MAX_WORK_ITEM_SIZES according to the documentation

Resources