OpenCL random kernel behaviour when certain system size is exceeded - opencl

I am having a problem like this. Basically, I have a 2D grid allocated on host:
double* grid = (double*)malloc(sizeof(double)*(ny*nx) * 9);
Folllowing normal openCL procedure to put it on the openCL device:
cl_mem cl_grid = clCreateBuffer(context, CL_MEM_COPY_HOST_PTR, sizeof(double) * (ny*nx) * 9, grid, &error);
And Enqueue and launch:
clEnqueueNDRangeKernel(queue, foo, 1, NULL, &global_ws, &local_ws, 0, NULL, NULL);
In the kernel function, simple arithmetic is performed on the 1st column of the grid:
__kernel void foo(__constant ocl_param* params, __global double* grid)
{
const int ii = get_global_id(0);
int jj;
jj=0;
if (ii < params->ny) {
grid[getIndexUsingMacro(ii,jj)] += params->someNumber;
}
}
And finally read back the buffer and check values.
clEnqueueReadBuffer(queue, cl_grid, CL_TRUE, 0, sizeof(double) * 9 * nx * ny, checkGrid, 0, NULL, NULL);
The problem is when the grid size (i.e. nx * ny * 9) exceeds 16384 * 9 * 8 bytes = 1152KB (* 8 since double precision is used).
if using openCL on CPU, an error CL_OUT_OF_RESOURCES is thrown when launching the kernel no matter what I set for global_ws and local_ws (I set them to 1 and the error is still thrown). The CPU is an Intel i5 2415m with 8GB of RAM and 3MB cache.
If using openCL on the GPU (NVIDIA TESLA M2050), no error is thrown. However, when reading back the value from the buffer, the grid is not changed at all. It means it returns the grid whose values are exactly the same as before it is sent to the kernel function.
For e.g. When I set nx = 30, ny = 546, nx*ny = 16380, everything runs fine. The grid returned with the results changed as expected. But when ny = 547, nx* ny = 16410, the problem occurs both on CPU and GPU as described above. The problem is the same if I swap nx and ny, hence, if nx = 547, ny = 30, it still happens. Can you guys suggest what might be the problem here ?
Many thanks

It looks like a synchronization issue. grid[index] += value with the same index value may be executed concurrently by several work items. This operation is not atomic, and all these work items will load grid[index], add their value, and store it back, possibly losing some increments in the process.
To solve this, you can synchronize these work items using barrier if they are in a single work group, or enqueuing more kernels otherwise.
Another possibility is to ensure only one work item is able to modify a given element of the grid (usually the best solution).
If several work items need to work on a common subset of the grid, using local memory and local memory barriers may be useful.

Related

OpenCL: 3D array processing - Globale size limit

I'm working with an 3D array of dimension xdim=49, ydim=1024 and zdim=64. my DEVICE_MAX_WORK_ITEM_SIZES is only 512/512/512. If I declare my
size_t global_work_size = {xdim, ydim, zdim}; and launch an 3D kernel,
I'm getting wrong results since my ydim > 512. If all my dimensions are below 512, I'm getting the expected results. Please let me know if there's an alternative for this?
CL_DEVICE_MAX_WORK_ITEM_SIZES only limits the size of work groups, not the global work item size (yea, it's a terrible name for the constant). You are much more tightly restricted by CL_DEVICE_MAX_WORK_GROUP_SIZE which is the total number of items allowed in a work group (you'd typically hit this far sooner than CL_DEVICE_MAX_WORK_ITEM_SIZES because of multiplication.
So go ahead an launch your global work size of 49, 1024, 64. It should work. If it's not, you're using get_local_id instead of get_global_id or have some other bug. We regularly launch 2D kernels with 4096 x 4096 global work size.
See also Questions about global and local work size
If you don't use shared local memory, you don't need to worry about local work group sizes. In fact, you can pass NULL instead of a pointer to an array of sizes for local_work_size and let the runtime pick something (it helps if your global dimensions are easily divisible by small numbers).
Assuming the dimensions you provided are the size of your data, you can decrease the global work size by making each GPU thread calculate more data. What I mean is, every thread in your case will do one calculation and if you change your kernels to do let's say 2 calculations in y dimension, than you could cut the number of threads you are firing into half. The global_work_size decides how many threads in each direction you are executing. Let me give you an example:
Let's assume you have an array you want to do some calculations with and the array size you have is 2048. If you write your kernel in the following way, you are going to need 2048 as the global_work_size:
__kernel void calc (__global int *A, __global int *B)
{
int i = get_global_id(0);
B[i] = A[i] * 5;
}
The global work size in this case will be:
size_t global_work_size = {2048, 1, 1};
However, if you change your kernel into the following kernel, you can lower your global work size as well: ()
__kernel void new_calc (__global int *A, __global int *B)
{
int i = get_global_id(0);
for (int ind = 0; ind < 8; ind++)
B[i*8 + ind] = A[i*8 + ind] * 5;
}
Then this way, you can use global size as:
size_t global_work_size = {256, 1, 1};
Also with the second kernel, each of your threads will execute more work, resulting in more utilisation.

Random NaN and incorrect results with OpenCL kernel

I am trying to implement a general matrix-matrix multiplication OpenCL kernel, one that conforms to C = α*A*B + β*C.
The Kernel
I did some research online and decided to use a modified kernel from this website as a starting point. The main modification I have made is that allocation of local memory as working space is now dynamic. Below is the kernel I have written:
__kernel
void clkernel_gemm(const uint M, const uint N, const uint K, const float alpha,
__global const float* A, __global const float* B, const float beta,
__global float* C, __local float* Asub, __local float* Bsub) {
const uint row = get_local_id(0);
const uint col = get_local_id(1);
const uint TS = get_local_size(0); // Tile size
const uint globalRow = TS * get_group_id(0) + row; // Row ID of C (0..M)
const uint globalCol = TS * get_group_id(1) + col; // Row ID of C (0..N)
// Initialise the accumulation register
float acc = 0.0f;
// Loop over all tiles
const int numtiles = K / TS;
for (int t = 0; t < numtiles; t++) {
const int tiledRow = TS * t + row;
const int tiledCol = TS * t + col;
Asub[col * TS + row] = A[tiledCol * M + globalRow];
Bsub[col * TS + row] = B[globalCol * K + tiledRow];
barrier(CLK_LOCAL_MEM_FENCE);
for(int k = 0; k < TS; k++) {
acc += Asub[k * TS + row] * Bsub[col * TS + k] * alpha;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
C[globalCol * M + globalRow] = fma(beta, C[globalCol * M + globalRow], acc);
}
Tile Size (TS) is now a value defined in the calling code, which looks like this:
// A, B and C are 2D matrices, their cl::Buffers have already been set up
// and values appropriately set.
kernel.setArg(0, (cl_int)nrowA);
kernel.setArg(1, (cl_int)ncolB);
kernel.setArg(2, (cl_int)ncolA);
kernel.setArg(3, alpha);
kernel.setArg(4, A_buffer);
kernel.setArg(5, B_buffer);
kernel.setArg(6, beta);
kernel.setArg(7, C_buffer);
kernel.setArg(8, cl::Local(sizeof(float) * nrowA * ncolB));
kernel.setArg(9, cl::Local(sizeof(float) * nrowA * ncolB));
cl::NDRange global(nrowA, ncolB);
cl::NDRange local(nrowA, ncolB);
status = cmdq.enqueueNDRangeKernel(kernel, cl::NDRange(0), global, local);
The Problem
The problem I am encountering is, unit tests (written with Google's gtest) I have written will randomly fail, but only for this particular kernel. (I have 20 other kernels in the same .cl source file that pass tests 100% of the time)
I have a test that multiplies a 1x4 float matrix {0.0, 1.0, 2.0, 3.0} with a transposed version of itself {{0.0}, {1.0}, {2.0}, {3.0}}. The expected output is {14.0}.
However, I can get this correct result maybe just 75% of the time.
Sometimes, I can get 23.0 (GTX 970), 17.01 (GTX 750) or just -nan and 0.0 (all 3 devices). The curious part is, the respective incorrect results seem to be unique to the devices; I cannot seem to, for example, get 23.0 on the Intel CPU or the GTX 750.
I am baffled because if I have made an algorithmic or mathematical mistake, the mistake should be consistent; instead I am getting incorrect results only randomly.
What am I doing wrong here?
Things I have tried
I have verified that the data going into the kernels are correct.
I have tried to initialize both __local memory to 0.0, but this causes all results to become wrong (but frankly, I'm not really sure how to initialize it properly)
I have written a test program that only executes this kernel to rule out any race conditions interacting with the rest of my program, but the bug still happens.
Other points to note
I am using the C++ wrapper retrieved directly from the Github page.
To use the wrapper, I have defined CL_HPP_MINIMUM_OPENCL_VERSION 120 and CL_HPP_TARGET_OPENCL_VERSION 120.
I am compiling the kernels with the -cl-std=CL1.2 flag.
All cl::Buffers are created with only the CL_MEM_READ_WRITE flag.
I am testing this on Ubuntu 16.04, Ubuntu 14.04, and Debian 8.
I have tested this on Intel CPUs with the Intel OpenCL Runtime 16.1 for Ubuntu installed. The runtime reports that it supports up to OpenCL 1.2
I have tested this on both Nvidia GTX 760 and 970. Nvidia only supports up to OpenCL 1.2.
All 3 platforms exhibit the same problem with varying frequency.
This looks like a complicated one. There are several things to address and they won't fit into comments, so I'll post all this as an answer even though it does not solve your problem (yet).
I am baffled because if I have made an algorithmic or mathematical
mistake, the mistake should be consistent; instead I am getting
incorrect results only randomly.
Such a behavior is a typical indicator of race conditions.
I have tried to initialize both __local memory to 0.0, but this causes
all results to become wrong (but frankly, I'm not really sure how to
initialize it properly)
Actually this is a good thing. Finally we have some consistency.
Initializing local memory
Initializing local memory can be done using the work items, e.g. if you have a 1D workgroup of 16 items and your local memory consists of 16 floats, just do this:
local float* ptr = ... // your pointer to local memory
int idx = get_local_id(0); // get the index for the current work-item
ptr[idx] = 0.f; // init with value 0
barrier(CLK_LOCAL_MEM_FENCE); // synchronize local memory access within workgroup
If your local memory is larger, e.g. 64 floats, you will have to use a loop where each work item initializes 4 values, at least that is the most efficient way. However, no one will stop you from using every work item to initialize every value in the local memory, even though that is complete nonsense since you're essentially initializing it multiple times.
Your changes
The original algorithm looks like it is especially designed to use quadratic tiles.
__local float Asub[TS][TS];
__local float Bsub[TS][TS];
Not only that but the size of local memory matches the workgroup size, in their example 32x32.
When I look at your kernel parameters for local memory, I can see that you use parameters that are defined as M and N in the original algorithm. This doesn't seem correct.
Update 1
Since you have not described if the original algorithm works for you, this is what you should do to find your error:
Create a set of testdata. Make sure you only use data sizes that are actually supported by the original algorithm (e.g. minimum size, mulitples of x, etc.). Also, use large data sets since some errors only show if multiple workgroups are dispatched.
Use the original, unaltered algorithm with your testdata sets and verify the results.
Change the algorithm only that instead of fixed size local memory, dynamic local memory size is used, but make sure it has the same size as the fixed size approach. This is what you tried but I think it failed due to what I have described under "Your changes".

Memory object allocation in Opencl for dynamic array in structure

I have created following structure 'data' in C
typedef struct data
{
double *dattr;
int d_id;
int bestCent;
}Data;
The 'dattr' is an array in above structure which is kept dynamic.
Suppose I have to create 10 objects of above structure. i.e.
dataNode = (Data *)malloc (sizeof(Data) * 10);
and for every object of this structure I have to reallocate the memory in C for array 'dattr' using:
for(i=0; i<10; i++)
dataNode[i].dattr = (double *)malloc(sizeof(double) * 3);
What should do to implement the same in OpenCL? How to allocate the memory for array 'dattr' once I allocate the memory for structure objects?
Memory allocation in OpenCL devices (for example, a GPU) must be performed in the host thread using clCreateBuffer (or clCreateImage2D/3D if you wish to use texture memory). These functions allow you automatically copy host data (created with malloc for example) to the device, but I usually prefer to explicitly use clEnqueueWriteBuffer/clEnqueueMapBuffer (or clEnqueueWriteImage/clEnqueueMapImage if using texture memory), so that I can profile the data transfers. Here's an example:
#define DATA_SIZE 1000
typedef struct data {
cl_uint id;
cl_uint x;
cl_uint y;
} Data;
...
// Allocate data array in host
size_t dataSizeInBytes = DATA_SIZE * sizeof(Data);
DATA * dataArrayHost = (DATA *) malloc(dataSizeInBytes);
// Initialize data
...
// Create data array in device
cl_mem dataArrayDevice = clCreateBuffer(context, CL_MEM_READ_ONLY, dataSizeInBytes, NULL, &status );
// Copy data array to device
status = clEnqueueWriteBuffer(queue, dataArrayDevice, CL_TRUE, 0, dataSizeInBytes, &dataArrayHost, 0, NULL, NULL );
// Make sure to pass dataArrayDevice as kernel parameter
// Run kernel
...
What you need to consider is that you need to know the memory requirements of an OpenCL kernel before you execute it. As such memory allocation can be dynamic if performed before kernel execution (i.e. in host). Nothing stops you from calling the kernel several times, and in each of those times adjusting (allocating) the kernel memory requirements.
Having this into account, I advise you to rethink the way your approaching the problem. To begin, it is simpler (but not necessarily more efficient) to work with arrays of structures, than with structures of arrays (in which case, the arrays would have to have a fixed size anyway).
This is just to give you an idea of how OpenCL works. Take a look at Khronos OpenCL resource page, it has plenty of OpenCL tutorials and examples, and Khronos OpenCL page, which has the official OpenCL references, man pages and quick references cards.
As suggested by Faken if you are concern with dynamic memory allocation and you are eager to change the algorithm a little bit, here is some hint:
The following code dynamically allocates local memory space and passes it as the 8th argument to the OpenCL kernel:
int N; //Number_of_data_points, which will keep on changing as per your requirement
size_t localMemSize = ( N* sizeof(int));
...
// Dynamically allocate local memory (allocated per workgroup)
clSetKernelArg(kernel, 8, localMemSize, NULL);

enqueueWriteImage fail on GPU

I am developing some kernels which works with image buffers. The problem is that when I create my Image2D by directly copying the data of the image, everything works well.
If I try to enqueue a write to my image buffer, it won't works for my GPU.
Here is a basic kernel :
__kernel void myKernel(__read_only image2d_t in, __write_only image2d_t out) {
const int x = get_global_id(0);
const int y = get_global_id(1);
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
uint4 pixel = read_imageui(in, sampler, (int2)(x, y));
write_imageui(out, (int2)(x, y), pixel);
}
Well, that simple kernel give me a black image on my GPU, but works well on my CPU.
To make it works, I have to do release the buffer image and creating a new one by directly passing data using CL_MEM_COPY_HOST_PTR.
I use the good data format : CL_RGBA, CL_UNSIGNED_INT8 and the size of my image is good.
The problem has been encountered with JOCL and the C++ binding of the API. (I didn't test the C API).
Finally, it runs by recreating the buffer, but is it a good idea ? Is it just normal ? Which actions can I perform to avoid it ?
By the way, I'm running on Intel SDK for OpenCL (Intel Core I7) and ATI AMD APP SDK (HD6800).
[edit]
Here is the code I use to write in my buffers.
First, the allocation part :
cl_image_format imageFormat = new cl_image_format();
imageFormat.image_channel_order = CL_RGBA;
imageFormat.image_channel_data_type = CL_UNSIGNED_INT8;
inputImageMem = clCreateImage2D(
context, CL_MEM_READ_ONLY,
new cl_image_format[]{imageFormat}, imageSizeX, imageSizeY,
0, null, null);
And when running, called for each frame, the part which doesn't work on GPU :
clEnqueueWriteImage(commandQueue, inputImageMem, CL_TRUE, new long[]{0, 0, 0},
new long[]{imageSizeX, imageSizeY, 1}, 0, 0,
Pointer.to(data), 0, null, null);
The part which works on both GPU and CPU but force me to recreate the buffer :
clReleaseMemObject(inputImageMem);
cl_image_format imageFormat = new cl_image_format();
imageFormat.image_channel_order = CL_RGBA;
imageFormat.image_channel_data_type = CL_UNSIGNED_INT8;
inputImageMem = clCreateImage2D(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, new cl_image_format[]{imageFormat}, imageSizeX, imageSizeY, 0, Pointer.to(data), null);
The data sent is an array of int of size imageSizeX*imageSizeY. I get it by this code :
DataBufferInt dataBuffer = (DataBufferInt)image.getRaster().getDataBuffer();
int data[] = dataBuffer.getData();
The code above is in java using JOCL, the same problem appear in an another C++ program using the C++ OpenCL Wrapper. The only differences are that in Java the virtual machine crash (after 3~4 frames) and in C++ the result is a black image.
Well, I found the problem. That was my drivers acting weird.
I was using the 12.4 version (the version I installed when I began to work with OpenCL) and I just installed the 12.6 version and the problem just disappeared.
So, keep your drivers up to date !

OpenCL kernel work-group size restriction

So I keep running into strange errors when I call my kernels; the stated max kernel work-group size is one, while the work group size of my device (my Macbook) is decidedly higher than that. What possible causes could there be for the kernels restricting the code to a single work group? Here's one of my kernels:
__kernel
void termination_kernel(const int Elements,
__global float* c_I,
__global float* c_Ihat,
__global float* c_rI,
__local float* s_a)
{
const int bdim = 128;
int n = get_global_id(0);
const int tx = get_local_id(0); // thread index in thread-block (0-indexed)
const int bx = get_group_id(0); // block index (0-indexed)
const int gx = get_num_groups(0);
// is thread in range for the addition
float d = 0.f;
while(n < Elements){
d += pow(c_I[n] - c_Ihat[n], 2);
n += gx * bdim;
}
// assume bx power of 2
int alive = bdim / 2;
s_a[tx] = d;
barrier(CLK_LOCAL_MEM_FENCE);
while(alive > 1){
if(tx < alive)
s_a[tx] += s_a[tx + alive];
alive /= 2;
barrier(CLK_LOCAL_MEM_FENCE);
}
if(tx == 0)
c_rI[bx] = s_a[0] + s_a[1];
}
and the error returned is
OpenCL Error (via pfn_notify): [CL_INVALID_WORK_GROUP_SIZE] : OpenCL Error : clEnqueueNDRangeKernel
failed: total work group size (128) is greater than the device can support (1)
OpenCL Error: 'clEnqueueNDRangeKernel(queue, kernel_N, dim, NULL, global_N, local_N, 0, NULL, NULL)'
I know it says the restriction is on the device, but debugging shows that
CL_DEVICE_MAX_WORK_GROUP_SIZE = 1024
and
CL_KERNEL_WORK_GROUP_SIZE = 1
The kernel construction is called by
char *KernelSource_T = readSource("Includes/termination_kernel.cl");
cl_program program_T = clCreateProgramWithSource(context, 1, (const char **) &KernelSource_T, NULL, &err);
clBuildProgram(program_T, 1, &device, flags, NULL, NULL);
cl_kernel kernel_T = clCreateKernel(program_T, "termination_kernel", &err);
I'd include the calling function, but I'm not sure if it's relevant; my intuition is that it's something in the kernel code that's forcing the restriction. Any ideas? Thanks in advance for the help!
Apple OpenCL doesn't support work-groups larger than [1, 1, 1] on the CPU. I have no idea why, but that's how it's been at least up to OSX 10.9.2. Larger work-groups are fine on the GPU, though.
CL_KERNEL_WORK_GROUP_SIZE tells you how large the maximum work group size can be for this particular kernel. OpenCL's runtime determines that by inspecting the kernel code. CL_KERNEL_WORK_GROUP_SIZE will be a number less or equal to CL_DEVICE_MAX_WORK_GROUP_SIZE.
Hope the amount of local memory avilable is less for that work group size . Please can you show the arguments? . You can try by reducing the work group size , start with 2,4,8,16,32,64,128 so on make sure its power of 2.
Time has passed since the answer of Tomi and it seems that Apple has become slightly more flexible on this aspect. On my OS X 10.12.3 (still OpenCL 1.2), it is possible to use up to CL_DEVICE_MAX_WORK_GROUP_SIZE in the first dimension.
According to the specification, it is also possible to get the maximum number of work-groups for each dimension through CL_DEVICE_MAX_WORK_ITEM_SIZES according to the documentation

Resources