OpenCL undefined behavior in parallel reduction algorithm - opencl

I am working on a simple parallel reduction algorithm to find the minimum value in an array and am coming across some interesting undefined behavior in my algorithm. I am running Intel's OpenCL 1.2 on Ubuntu 16.04.
The following kernel is what I am trying to run which is currently giving me the wrong answer:
__kernel void Find_Min(int arraySize, __global double* scratch_arr, __global double* value_arr, __global double* min_arr){
const int index = get_global_id(0);
int length = (int)sqrt((double)arraySize);
int start = index*length;
double min_val = INFINITY;
for(int i=start; i<start+length && i < arraySize; i++){
if(value_arr[i] < min_val)
min_val = value_arr[i];
}
scratch_arr[index] = min_val;
barrier(CLK_GLOBAL_MEM_FENCE);
if(index == 0){
double totalMin = min_val;
for(int i=1; i<length; i++){
if(scratch_arr[i] < totalMin)
totalMin = scratch_arr[i];
}
min_arr[0] = totalMin;
}
}
When in put in an array that is {0,-1,-2,-3,-4,-5,-6,-7,-8} it ends up returning -2.
Here is where the undefined behavior comes in. When I run the following kernel with a printf statement before the barrier I get the right answer (-8):
__kernel void Find_Min(int arraySize, __global double* scratch_arr, __global double* value_arr, __global double* min_arr){
const int index = get_global_id(0);
int length = (int)sqrt((double)arraySize);
int start = index*length;
double min_val = INFINITY;
for(int i=start; i<start+length && i < arraySize; i++){
if(value_arr[i] < min_val)
min_val = value_arr[i];
}
scratch_arr[index] = min_val;
printf("setting scratch[%i] to %f\n", index, min_val);
barrier(CLK_GLOBAL_MEM_FENCE);
if(index == 0){
double totalMin = min_val;
for(int i=1; i<length; i++){
if(scratch_arr[i] < totalMin)
totalMin = scratch_arr[i];
}
min_arr[0] = totalMin;
}
}
The only thing I can think of that could be happening is that I am using the barrier command incorrectly and all the printf is doing is causing a delay in the kernel that is somehow synchronizing the calls so they all complete before the final reduction step. But without the printf, the kernel 0 executes the final reduction before the other kernels are finished.
Does anyone else have any suggestions or tips on how to debug this issue?
Thanks in advance!!

The problem was that the kernel was being launched with one thread per workgroup and barriers only work within a work group. See this response to a similar question: Open CL no synchronization despite barrier

Related

Locking not working with OpenCL

I'm stuck with an issue in my OpenCL code where I try to synch inside a kernel:
__kernel void pdiffs (__global const long2 *inData, __global const long2 *inData2, __global long2 *outData) {
long2 diffSum = 0;
uint idx0 = get_local_size(0)*get_group_id(0);
for (uint idx=idx0; idx<idx0+get_local_size(0); idx += 1) {
diffSum += inData[idx] - inData2[idx];
outData[get_group_id(0)] = diffSum;
printf("%d %d %d %d/%d\n", get_group_id(0), get_num_groups(0), get_local_size(0), diffSum.x, diffSum.y);
barrier(CLK_GLOBAL_MEM_FENCE|CLK_LOCAL_MEM_FENCE);
if (get_group_id(0) == 0) {
for (size_t i = 1; i < get_num_groups(0); i++){
outData[0] += outData[i];
printf("v(%d): %d/%d\n", i, outData[i].x, outData[i].y);
}
}
}
(I know that this piece of code is simply bad...)
I just thought that the barrier will synch the single groups so the values in outData are defined. But my trace shows that some differences are not calculated and contain zero (I setup my data so all shall return the value 1, but some display as 0). Further it makes a difference whether I have printf statements or not. Without printf even more differences seem to be incorrect.

get_global_size(0) crashes the controller

everyone.
I got this kernel:
__kernel void FuncionCL(__global char* in, __global char* out, __global int* S2)
{
__private int op1, op2, op3;
__private int C;
__private uint WorkDim, C2;
op1 = 1;
op2 = 2;
WorkDim = get_global_size(0);
__private int ID;
ID = get_global_id(0);
for(C = 0; C < 1000000; C++)
{
for(C2 = ID; C2 < 1000; C2 += WorkDim)
{
op3 = op1 + op2;
}
}
out[0] = 90;
out[1] = 89;
*S2 = (int) WorkDim;
}
It crashes not only the application, the graphic controller too. I i change the for increment for the constant value '16' (the get_global_size() function returns) then the code runs fine. What's the problem?
If i run the code with:
WorkDim = 16;
in the line 8 instead of:
WorkDim = get_global_size(0);
The code runs 400 times faster, that's the problem. Why if the value is the same?
**EDIT: ** Well, now i know why, the code is so slow and there are multiply reasons:
1.- The occupancy.
2.- All the threads do the same iterations in the first loop, the right code looks like this:
__kernel void FuncionCL(__global char* in, __global char* out, __global int* S2)
{
__private int op1, op2, op3;
__private int C;
__private uint WorkDim, C2;
op1 = 1;
op2 = 2;
WorkDim = get_global_size(0);
__private int ID;
ID = get_global_id(0);
for(C = ID; C < 1000000; C += WorkDim)
{
for(C2 = ID; C2 < 1000; C2 += WorkDim)
{
op3 = op1 + op2;
}
}
out[0] = 90;
out[1] = 89;
*S2 = (int) WorkDim;
}
Now my code runs 6.1 times faster on the GPU than CPU.
Each item there is doing 1000000*1000 = 1Gop. Just too much, takes too long to do that and the driver restarts the GPU. (I am guessing global size is 1 in your example)
It is a total waste of resources to run a CL kernel with so little work items, it will make the GPU do almost-serial computation and take too long.
At least 1024 global items are needed in new GPUs to fully use their resources.
EDIT: The loop is probably optimized by the compiler when it has a static value. Therefore giving an "amazing" speedup.

OpenCL: Inserting local atomic_inc to reduction kernel

I am trying to include a local atomic similar to that described by DarkZeros here within a working reduction kernel. The kernel finds a largest value within a set of points; the aim of the local atomic is to allow me to filter selected point_ids into an output array without any gaps.
At present when I use the local atomic to increment the addition to a local array the kernel runs but produces a wrong overall highest point. If the atomic line is commented out then a correct result returns.
What is going on here and how do I fix it?
Simplified kernel code:
__kernel void reduce(__global const float4* dataSet, __global const int* input, const unsigned int items, //points and index
__global int* output, __local float4* shared, const unsigned int n, //finding highest
__global int* filtered, __global const float2* tri_input, const unsigned int pass, //finding filtered
__global int* global_count //global count
){
//set everything up
const unsigned int group_id = get_global_id(0) / get_local_size(0);
const unsigned int local_id = get_local_id(0);
const unsigned int group_size = items;
const unsigned int group_stride = 2 * group_size;
const int local_stride = group_stride * group_size;
__local float4 *zeroIt = &shared[local_id];
zeroIt->x = 0; zeroIt->y = 0; zeroIt->z = 0; zeroIt->w = 0;
volatile __local int local_count_set_1;
volatile __local int global_val_set_1;
volatile __local int filter_local[64];
if(local_id==0){
local_count_set_1 = 0;
global_val_set_1 = -1;
}
barrier(CLK_LOCAL_MEM_FENCE);
int i = group_id * group_stride + local_id;
while (i < n){
//load up a pair of points using the index to locate them within a massive dataSet
int ia = input[i];
float4 a = dataSet[ia-1];
int ib = input[i + group_size];
float4 b = dataSet[ib-1];
//on the first pass kernel increment a local count
if(pass == 0){
filter_local[atomic_inc(&local_count_set_1)] = 1; //including this line causes an erroneous highest point result
//filter_local[local_id] = 1; //but including this line does not
//atomic_inc(&local_count_set_1); //and neither does this one
}
//find the highest of the pair
float4 result;
if(a.z>b.z) result = a;
else result = b;
//load up the previous highest result locally
float4 s = shared[local_id];
//if the previous highest beat this, stick, else twist
if(s.z>result.z){ result = s; }
shared[local_id] = result;
i += local_stride;
}
barrier(CLK_LOCAL_MEM_FENCE);
if (group_size >= 512){
if (local_id < 256) {
__local float4 *a = &shared[local_id];
__local float4 *b = &shared[local_id+256];
if(b->z>a->z){ shared[local_id] = shared[local_id+256]; }
}}
//repeat barrier ops in increments down to group_size>=2 - this filters the highest result in shared
//finally, return the filtered highest result of shared to the global level
barrier(CLK_LOCAL_MEM_FENCE);
if(local_id == 0){
__local float4 *v = &shared[0];
int send = v->w ;
output[group_id] = send+1;
}}
[UPDATE]: When the atomic_inc line is included the 'wrong' highest point result is always a point near the end of the test dataset. I'm guessing that this means that the atomic_inc is affecting a latter comparison, but I'm not sure exactly what or where yet.
[UPDATE]: Edited code to simplify/clarify/update with debugging tweaks. Still not working and it is driving me loopy.
Total face-palm moment. In the setup phase of the kernel there are the lines:
if(local_id==0){
local_count_set_1 = 0;
global_val_set_1 = -1;
}
barrier(CLK_LOCAL_MEM_FENCE);
When these are split and the local_count_set_1 is included within the while loop, the error does not occur. i.e:
if(local_id==0) global_val_set_1 = -1;
barrier(CLK_LOCAL_MEM_FENCE);
while (i < n){
if(local_id==0) local_count_set_1 = 0;
barrier(CLK_LOCAL_MEM_FENCE);
....
if(pass = 0){
filter_local[atomic_inc(&local_count_set_1)] = 1;
}
....
I'm hoping this fixes the issue // will update if not.
Aaaand that's a weekend I'll never get back.

Parallel reduction using local memory in OpenCL

I implemented a reduce kernel in OpenCL to sum up all entries in the input vector of size N. For a easier testing I initialize the input vector with 1.0f. So the result should be N. But it is not!
Here is my reduce-kernel:
kernel void reduce(global float* input, global float* output, const unsigned int N, local float* cache)
{
const uint local_id = get_local_id(0);
const uint global_id = get_global_id(0);
const uint local_size = get_local_size(0);
cache[local_id] = (global_id < N) ? input[global_id] : 0.0f;
barrier(CLK_LOCAL_MEM_FENCE);
for (unsigned int s = local_size >> 1; s > 0; s >>= 1) {
if (local_id < s) {
cache[local_id] += cache[local_id + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_id == 0) output[local_size] = cache[0];
}
And here is the setting for OpenCL:
const uint N = 8196;
cl_float a[N];
cl_float b[N];
for (uint i=0; i<N; i++) {
a[i] = 1.0f;
b[i] = 0.0f;
}
cl::Buffer inputBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*N);
cl::Buffer resultBuffer(context, CL_MEM_READ_ONLY, sizeof(cl_float)*N);
queue.enqueueWriteBuffer(inputBuffer, CL_TRUE, 0, sizeof(cl_float)*N, a);
queue.enqueueWriteBuffer(resultBuffer, CL_TRUE, 0, sizeof(cl_float)*N, b);
cl::Kernel addVectorKernel = cl::Kernel(program, "reduce");
size_t localSize = addVectorKernel.getWorkGroupInfo<CL_KERNEL_WORK_GROUP_SIZE>(device); // e.g. => 512
size_t globalSize = roundUp(localSize, N); // rounds up to a multiple of localSize
addVectorKernel.setArg(0, inputBuffer);
addVectorKernel.setArg(1, resultBuffer);
addVectorKernel.setArg(2, N);
addVectorKernel.setArg(3, (sizeof(cl_float) * localSize), NULL);
queue.enqueueNDRangeKernel(
addVectorKernel,
cl::NullRange,
cl::NDRange(globalSize),
cl::NDRange(localSize)
);
queue.finish(); // wait for ending
queue.enqueueReadBuffer(resultBuffer, CL_TRUE, 0, sizeof(cl_float)*N, b); // e.g. => 1024
The result depends on the workgroup size. What am I doing wrong? Is it the kernel itself or is it the settings for OpenCL?
You should be using the group's id when writing the sum back to global memory.
if (local_id == 0) output[local_size] = cache[0];
That line will write to output[512] repeatedly. You need each work group to write to a dedicated location in the output.
kernel void reduce(global float* input, global float* output, const unsigned int N, local float* cache)
{
const uint local_id = get_local_id(0);
const uint global_id = get_global_id(0);
const uint group_id = get_group_id(0);
const uint local_size = get_local_size(0);
cache[local_id] = (global_id < N) ? input[global_id] : 0.0f;
barrier(CLK_LOCAL_MEM_FENCE);
for (unsigned int s = local_size >> 1; s > 0; s >>= 1) {
if (local_id < s) {
cache[local_id] += cache[local_id + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_id == 0) output[group_id] = cache[0];
}
Then you need to sum the values from the output on the host. Note that 'b' in the host code does not need to hold N elements. Only one element for each work group will be used.
//replace (globalSize/localSize) with the pre-calculated/known number of work groups
for (i=1; i<(globalSize/localSize); i++) {
b[0] += b[i];
}
Now b[0] is your grand total.
In the reduction for loop, you need this:
for(unsigned int s = localSize >> 1; s > 0; s >>= 1)
You are shifting one more bit than you should when initializing s.
After that's fixed, let's look at what your kernel is doing. The host code executes it with globalSize of 8192 and localSize of 512, which results in 16 work groups. Inside the kernel you first sum the data from the two consecutive memory locations at index 2*global_id. For work group with id 15, work item 0, that will be at index 15*512*2 = 15,360 and 15,361, which is outside the boundaries of your input array. I am surprised you don't get a crash. At the same time, this explains why you have double the values that you expect.
To fix it, you can do this:
cache[localID] = input[globalID];
Or specify a global size that's half of the number of the current one.

Optimal workgroup size for sum reduction in OpenCL

I am using the following kernel for sum reduciton.
__kernel void reduce(__global float* input, __global float* output, __local float* sdata)
{
// load shared mem
unsigned int tid = get_local_id(0);
unsigned int bid = get_group_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
unsigned int stride = gid * 2;
sdata[tid] = input[stride] + input[stride + 1];
barrier(CLK_LOCAL_MEM_FENCE);
// do reduction in shared mem
for(unsigned int s = localSize >> 2; s > 0; s >>= 1)
{
if(tid < s)
{
sdata[tid] += sdata[tid + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// write result for this block to global mem
if(tid == 0) output[bid] = sdata[0];
}
It works fine, but I don't know how to choose the optimal workgroup size or number of workgroups if I need more than one workgroup (for example if I want to calculate the sum of 1048576 elements). As far as I understand, the more workgroups I use, the more subresults I will get, which also means that I will need more global reductions at the end.
I've seen the answers to the general workgroup size question here. Are there any recommendations that concern reduction operations specifically?
This question is a possible duplicate of one I answered a while back:
What is the algorithm to determine optimal work group size and number of workgroup.
Experimentation will be the best way to know for sure for any given device.
Update:
I think you can safely stick to 1-dimensional work groups, as you have done in your sample code. On the host, you can try out the best values.
For each device:
1) query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.
2) loop over a few multiples and run the kernel with that group size. save the execution time for each test.
3) when you think you have an optimal value, hard code it into a new kernel for use with that specific device. This will give a further boost to performance. You can also eliminate your sdata parameter in the device-specific kernel.
//define your own context, kernel, queue here
int err;
size_t global_size; //set this somewhere to match your test data size
size_t preferred_size;
size_t max_group_size;
err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), preferred_size, NULL);
//check err
err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), max_group_size, NULL);
//check err
size_t test_size;
//your vars for hi-res timer go here
for (unsigned int i=preferred_size ; i<=max_group_size ; i+=preferred_size){
//reset timer
test_size = (size_t)i;
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size, &test_size, 0, NULL, NULL);
if(err){
fail("Unable to enqueue kernel"); //implement your own fail function somewhere..
}else{
clfinish(queue);
//stop timer, save value
//output timer value and test_size
}
}
The device-specific kernel can look like this, except the first line should have your optimal value substituted:
#define LOCAL_SIZE 32
__kernel void reduce(__global float* input, __global float* output)
{
unsigned int tid = get_local_id(0);
unsigned int stride = get_global_id(0) * 2;
__local float sdata[LOCAL_SIZE];
sdata[tid] = input[stride] + input[stride + 1];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int s = LOCAL_SIZE >> 2; s > 0; s >>= 1){
if(tid < s){
sdata[tid] += sdata[tid + s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(tid == 0) output[get_group_id(0)] = sdata[0];
}

Resources