Local work size for two dimensions - opencl

I am trying to code in a nested for loop for an opencl application which I am treating as a two dimensional problem. The global work size is not the multiple of the block_dimension and so I am declaring the sizes:
size_t global_work_size[2] = {length1,length2};
size_t local_work_size[2] = {NULL,NULL};
err = clEnqueueNDRangeKernel(commands, Kernel, 2, NULL, global_work_size, local_work_size, 0, NULL, NULL);
I am getting CL_INVALID_WORK_SIZE error. What should I change my local work size to be?

Juste call the kernel like that:
err = clEnqueueNDRangeKernel(commands, Kernel, 2, NULL, global_size, NULL, 0, NULL, NULL);


OpenCL kernel math outputs incorrect results

I am currently trying to implement an OpenCL kernel. The kernel is supposed to output a number of previously calculated elements divided by the total number of elements remapped to a value from 0 to 255.
The kernel runs in a single work group with 256 work items where LX is the local ID:
#define LX get_local_id(0)
kernel void reduceStatistic(global int *inout, int nr_workgroups, int nr_pixels)
int i = 1;
for (; i < nr_workgroups; i++)
inout[LX] += inout[LX + i * 256];
inout[LX] = (int)floor(((float)inout[LX] / (float)nr_pixels) * 256.0f);
The calculation before the remapping operation is for clean up after a previous calculation on the same buffer.
The first item of inout[LX] after the cleanup is 17176, the nr_pixels is 160000 so this should result in a value of 27 using the calculation above. The code, however, returns 6.
The relevant host-side code is as follows:
// nr_workgroups is of type int
cl_mem outputBuffer = clCreateBuffer(mgr->context, CL_MEM_READ_WRITE, nr_workgroups * 256 * sizeof(cl_int), NULL, NULL);
// another kernel writes into outputBuffer
// set kernel arguments
clSetKernelArg(mgr->reduceStatisticKernel, 0, sizeof(outputBuffer), &outputBuffer);
clSetKernelArg(mgr->reduceStatisticKernel, 1, sizeof(cl_int), &nr_workgroups);
clSetKernelArg(mgr->reduceStatisticKernel, 2, sizeof(cl_int), &imgSeqSize);
size_t global_work_size_statistics[1] = { 256 };
size_t local_work_size_statistics[1] = { 256 };
// run the kernel
clEnqueueNDRangeKernel(mgr->commandQueue, mgr->reduceStatisticKernel, 1, NULL, global_work_size_statistics, local_work_size_statistics, 0, NULL, NULL);
// read result
cl_int *reducedResult = new cl_int[256];
clEnqueueReadBuffer(mgr->commandQueue, outputBuffer, CL_TRUE, 0, 256 * sizeof(cl_int), reducedResult, 0, NULL, NULL);
Help much appreciated! (:
We established in the comments that the global buffer index calculation is wrong:
inout[LX] += inout[LX + i * 265];
Should be 256
Going out of range on a buffer leads to undefined behaviour, so this is always one of the prime culprits to look for.

OpenCL: Call parameter type does not match function signature

I'm using #pragma OPENCL EXTENSION cl_khr_fp16 : enable supported GPU with OpenCL 1.2. I wanted to check the performance improvement by changin float precision from 32 to 16. In my device kernel, I converted all float to half like shown below:
__kernel void copy_kernel(int N, __global half *X, __global half *Y)
int i = get_global_id(0);
if(i < N) Y[i] = X[i];
In my host side, I made cl_mem point to array of cl_half. Host program looks as shown below:
void copy(int N, cl_mem X, cl_mem Y)
cl_kernel kernel = get_copy_kernel();
cl_command_queue queue = cl.queue;
cl_uint i = 0;
cl.error = clSetKernelArg(kernel, i++, sizeof(N), (void*) &N);
cl.error = clSetKernelArg(kernel, i++, sizeof(X), (void*) &X);
cl.error = clSetKernelArg(kernel, i++, sizeof(Y), (void*) &Y);
size_t gsize = N;
cl.error = clEnqueueNDRangeKernel(queue, kernel, 1, 0, &gsize, 0, 0, 0, NULL);
But while compiling the kernel, I get the below error:
Call parameter type does not match function signature!
%32 = load half addrspace(1)* %31, align 2
float %33 = call float #llvm.nvvm.mul.rn.f(half %32, half %19)
Broken module found, compilation terminated!
You are passing a half variable to the kernel but the kernel expects a pointer to an array of halfs.
If you want to pass an array of halfs to the GPU you still have to use cl_mem objects which then contains the array of halfs.

Open global_work_size misunderstanding

I'm trying to understand a simple OpenCL example, which is vector addition. The kernel is the following:
__kernel void addVec(__global double* a, __global double* b, __global double* c)
size_t id = get_global_id(0);
c[id] = a[id] + b[id];
For example, my input arrays have a size of 1 million elements each.
In my host program, I set global_work_size to be exactly the size of the vectors input arrays (1 million).
But when i set it to a smaller value, for example 1000, it also works with this kernel!
I don't understand why the global_work_size can be lesser than the problem dimension, and still, the OpenCL program compute every elements of the input arrays.
Could someone clarify on this?
EDIT: here is the code where I copy the data:
size_t arraySize = 1000000;
const size_t global_work_size[1] = {512};
double *host_a = malloc(arraySize*sizeof(double));
double *host_b = malloc(arraySize*sizeof(double));
double *host_c = calloc(arraySize, sizeof(double));
// Create the input and output arrays in device memory for our calculation
device_a = clCreateBuffer(context, CL_MEM_READ_ONLY, arraySize*sizeof(double), NULL, NULL);
device_b = clCreateBuffer(context, CL_MEM_READ_ONLY, arraySize*sizeof(double), NULL, NULL);
device_c = clCreateBuffer(context, CL_MEM_WRITE_ONLY, arraySize*sizeof(double), NULL, NULL);
// Copy data set into the input array in device memory. [host --> device]
status = clEnqueueWriteBuffer(command_queue, device_a, CL_TRUE, 0, arraySize*sizeof(double), host_a, 0, NULL, NULL);
status |= clEnqueueWriteBuffer(command_queue, device_b, CL_TRUE, 0, arraySize*sizeof(double), host_b, 0, NULL, NULL);
// Copy-back the results from the device [host <-- device]
clEnqueueReadBuffer(command_queue, device_c, CL_TRUE, 0, arraySize*sizeof(double), host_c, 0, NULL, NULL );
printf("checking result validity ...\n");
for (size_t i=0; i<arraySize; ++i)
if(host_c[i] - 1 > 1e-6) // the array is supposed to be 1 everywhere
printf("*** ERROR! Invalid results ! host_c[%zi]=%.9lf\n", i, host_c[i]);
Your test function doesn't look good, it will be met for any value < 1, it should be like this:
for (size_t i=0; i<arraySize; ++i){
cl_double val = host_c[i] - 1; // the array is supposed to be 1 everywhere
if((val > 1e-6) || (val < -1e-6))
printf("*** ERROR! Invalid results ! host_c[%zi]=%.9lf\n", i, host_c[i]);
Non initialized values in the GPU are likely to be 0, therefore meeting your condition.
Additionally, remember that if you run the program once with the full size, consecutive reads will still hold the proper processed data (even if you close and open the app again). Since the GPU memory is not cleaned after the buffer is created/destroyed.

Optimising Host to GPU transfer

I am offloading work to GPU using OpenCL (a variant of matrix multiplication). The matrix code itself works fantastically well, but the cost of moving data to GPU is prohibitive.
I've moved from using clEnqueueRead/clEnqueueWrite to memory mapped buffers as follows:
d_a = clCreateBuffer(context, CL_MEM_READ_ONLY|CL_MEM_ALLOC_HOST_PTR,
sizeof(char) * queryVector_size,
checkErr(err,"Buf A");
d_b = clCreateBuffer(context, CL_MEM_READ_ONLY|CL_MEM_ALLOC_HOST_PTR,
sizeof(char) * segment_size,
checkErr(err,"Buf B");
err = clSetKernelArg(ko_smat, 0, sizeof(cl_mem), &d_c);
checkErr(err,"Compute Kernel");
err = clSetKernelArg(ko_smat, 1, sizeof(cl_mem), &d_a);
checkErr(err,"Compute Kernel");
err = clSetKernelArg(ko_smat, 2, sizeof(cl_mem), &d_b);
checkErr(err,"Compute Kernel");
query_vector = (char*) clEnqueueMapBuffer(commands, d_a, CL_TRUE,CL_MAP_READ, 0, sizeof(char) * queryVector_size, 0, NULL, NULL, &err);
checkErr(err,"Write A");
segment_data = (char*) clEnqueueMapBuffer(commands, d_b, CL_TRUE,CL_MAP_READ, 0, sizeof(char) * segment_size, 0, NULL, NULL, &err);
checkErr(err,"Write B");
// code which initialises buffers using ptrs (segment_data and queryV)
err = clEnqueueUnmapMemObject(commands,
query_vector, 0, NULL, NULL);
checkErr(err,"Unmap Buffer");
err = clEnqueueUnmapMemObject(commands,
segment_data, 0, NULL, NULL);
checkErr(err,"Unmap Buff");
err = clEnqueueNDRangeKernel(commands, ko_smat, 2, NULL, globalWorkItems, localWorkItems, 0, NULL, NULL);
err = clFinish(commands);
checkErr(err, "Execute Kernel");
result = (char*) clEnqueueMapBuffer(commands, d_c, CL_TRUE,CL_MAP_WRITE, 0, sizeof(char) * result_size, 0, NULL, NULL, &err);
checkErr(err,"Write C");
printMatrix(result, result_row, result_col);
This code works fine when I use the ReadEnqueue/WriteEnqueue methods and intialise d_a, d_b, d_c through that, but when I use the MappedBuffers, result is 0 due to d_a and d_b being null
when running the kernel.
What is the appropriate way to map/unmap buffers?
the core problem seems to be from here
segment_data = (char*) clEnqueueMapBuffer(commands, d_b, CL_TRUE,CL_MAP_READ, 0, sizeof(char) * segment_width * segment_length, 0, NULL, NULL, &err);
printMatrix(segment_data, segment_length, segment_width);
err = clEnqueueUnmapMemObject(commands,
segment_data, 0, NULL, NULL);
checkErr(err,"Unmap Buff");
segment_data = (char*) clEnqueueMapBuffer(commands, d_b, CL_TRUE,CL_MAP_READ, 0, sizeof(char) * segment_width * segment_length, 0\
, NULL, NULL, &err);
printMatrix(segment_data, segment_length, segment_width);
// ALL ZEROs again
The first printMatrix() returns the correct output, once I unmap it and remap it, segment_data becomes all 0s (it's initial value). I suspect I'm using an incorrect flag somewhere? I cant' figure out where though.
query_vector = (char*) clEnqueueMapBuffer(commands, d_a, CL_TRUE,CL_MAP_READ, 0, sizeof(char) * queryVector_size, 0, NULL, NULL, &err);
checkErr(err,"Write A");
segment_data = (char*) clEnqueueMapBuffer(commands, d_b, CL_TRUE,CL_MAP_READ, 0, sizeof(char) * segment_size, 0, NULL, NULL, &err);
checkErr(err,"Write B");
The buffers are mapped as CL_MAP_READ but writing to them. Unlike buffer creation, these flags do not take a device view of the memory, but a host view, so they should be mapped using the CL_MAP_WRITE flag otherwise any changes will just be discarded when its unmapped
From the OpenCL 1.2 spec:
5.4.3 Accessing mapped regions of a memory object
If a memory object is currently mapped for reading, the application must ensure that the memory object is unmapped before any enqueued kernels or commands that write to this memory object or any of its associated memory objects (sub-buffer or 1D image buffer objects) or its parent object (if the memory object is a sub-buffer or 1D image buffer object) begin execution; otherwise the behavior is undefined.
So, you need to map the results buffer after you've enqueued the kernel. Similarly, you need to unmap the input buffers before you enqueue the kernel. The timeline for mapping/unmapping buffers should be roughly as follows:
Create input buffers
Create output buffers
Map input buffers
Write input data
Unmap input buffers
Enqueue kernel
Map output buffers
Read output data
Unmap output buffers
Clearly the best way for speeding up your code is using mapped buffers. You can create the buffers using CL_MEM_ALLOC_HOST_PTR and this basically takes some transfer burden off the CPU by initiating DMA transfers.
Here is an example of using the mapped buffers:
// pointer to hold the result
int * host_ptr = malloc(size * sizeof(int));
d_mem = clCreateBuffer(context,CL_MEM_READ_WRITE|CL_MEM_ALLOC_HOST_PTR,
size*sizeof(cl_int), NULL, &ret);
int * map_ptr = clEnqueueMapBuffer(command_queue,d_mem,CL_TRUE,CL_MAP_WRITE,
// initialize data
for (i=0; i<size;i++) {
map_ptr[i] = i;
ret = clEnqueueUnmapMemObject(command_queue,d_mem,map_ptr,0,NULL,NULL);
//Set OpenCL Kernel Parameters
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&d_mem);
size_t global_work[1] = { size };
//Execute OpenCL Kernel
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,
global_work, NULL, 0, 0, NULL);
map_ptr = clEnqueueMapBuffer(command_queue,d_mem,CL_TRUE,CL_MAP_READ,
// copy the data to result array
for (i=0; i<size;i++){
host_ptr[i] = map_ptr[i];
ret = clEnqueueUnmapMemObject(command_queue,d_mem,map_ptr,0,NULL,NULL);
// cl finish etc
It is taken from this post.

Transmission parameters in to the function in openCL

Have following kernel function:
private static String programSource =
"__kernel void sampleKernel(__global float *Y, __global float *param) "
+ "{ int index = get_global_id(0); "
+ " Y[index]=param[0]-Y[index]/param[1]-param[2]; "
+ "} ";
First argument "Y" works perfect, but second parameter "param" works non correct, I receive null values . Second parametr must be a array and consists from 3 cells.
Fragment of code with the transmission parameters:
float[] arr_params = new float[3];
arr_params[0] = (float) h_c;
arr_params[1] = (float) sy;
arr_params[2] = (float) dy;
Pointer Pvy = Pointer.to(vy);
Pointer Parr_params = Pointer.to(arr_params);
cl_mem memObjects[] = new cl_mem[2];
memObjects[0] = clCreateBuffer(context,
Sizeof.cl_float * vy.length, Pvy, null);
memObjects[1] = clCreateBuffer(context,
Sizeof.cl_float * arr_params.length, Parr_params, null);
// Set the arguments for the kernel
clSetKernelArg(kernel, 0,
Sizeof.cl_mem, Pointer.to(memObjects[0]));
clSetKernelArg(kernel, 1,
Sizeof.cl_mem, Pointer.to(memObjects[1]));
// Set the work-item dimensions
long global_work_size[] = new long[]{vy.length};
long local_work_size[] = new long[]{1};
// Execute the kernel
clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
global_work_size, local_work_size, 0, null, null);
// Read the output data
clEnqueueReadBuffer(commandQueue, memObjects[0], CL_TRUE, 0, vy.length * Sizeof.cl_float, Pvy, 0, null, null);
// Release kernel, program, and memory objects
The second buffer is all zeros because, in the clCreateBuffer call, you haven't told OpenCL where to get the data. Use CL_MEM_USE_HOST_PTR or CL_MEM_COPY_HOST_PTR.
