How to allocate Local Work Item sizes in OpenCL - opencl

I've set up a convolution kernel in OpenCL to convolve a 228x228x3 image with 11x11x3x96 weights to produce 55x55x96 filters.
My code without allotting localWorkSize works perfectly, but when I do allot it, I start getting errors
My questions are therefore,
1) How many threads are being launched when I set localWorkSize to NULL? I'm guessing it's implicit but is there any way to get those numbers?
2) How should I allot localWorkSize to avoid errors?
//When localWorkSize is NULL
size_t globalWorkSize[3] = {55,55,96};
//Passing NULL for localWorkSize argument
errNum = clEnqueueNDRangeKernel(command_queue, kernel,3,NULL,globalWorkSize, NULL,0, NULL,&event);
//WORKS PERFECTLY
// When I set localWorkSize
size_t globalWorkSize[3] = {55,55,96};
size_t localWorkSize[3] = {1,1,1};
errNum = clEnqueueNDRangeKernel(command_queue, kernel,3,NULL,globalWorkSize, localWorkSize,0, NULL,&event);
//ERROR CONTEXT CODE 999
I'm just trying to understand how many threads are created when localWorkSize is Null and GlobalWorkSize is described

Related

Copying global on-device pointer address back and forth between device and host

I created a buffer on the OpenCL device (a GPU), and from the host I need to know the global on-device pointer address so that I can put that on-device address in another buffer so that the kernel can then read from that buffer that contains the address of the first buffer so that then it can access the contents of that buffer.
If that's confusing here's what I'm trying to do: I create a generic floats-containing buffer representing a 2D image, then from the host I create a todo list of all the things my kernel needs to draw, which lines, which circles, which images... So from that list the kernel has to know where to find that image, but the reference to that image cannot be passed as a kernel argument, because that kernel might draw no image, or a thousand different images, all depending on what the list says, so it has to be referenced in that buffer that serves as a todo list for my kernel.
The awkward way I've done it so far:
To do so I tried making a function that calls a kernel after the creation of the image buffer that gets the buffer and returns the global on-device address as a ulong in another buffer, then the host stores that value in a 64-bit integer, like this:
uint64_t get_clmem_device_address(clctx_t *clctx, cl_mem buf)
{
const char kernel_source[] =
"kernel void get_global_ptr_address(global void *ptr, global ulong *devaddr) \n"
"{ \n"
" *devaddr = (ulong) ptr; \n"
"} \n";
int32_t i;
cl_int ret;
static int init=1;
static cl_program program;
static cl_kernel kernel;
size_t global_work_size[1];
static cl_mem ret_buffer;
uint64_t devaddr;
if (init)
{
init=0;
ret = build_cl_program(clctx, &program, kernel_source);
ret = create_cl_kernel(clctx, program, &kernel, "get_global_ptr_address");
ret_buffer = clCreateBuffer(clctx->context, CL_MEM_WRITE_ONLY, 1*sizeof(uint64_t), NULL, &ret);
}
if (kernel==NULL)
return ;
// Run the kernel
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf);
ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), &ret_buffer);
global_work_size[0] = 1;
ret = clEnqueueNDRangeKernel(clctx->command_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // enqueue the kernel
ret = clEnqueueReadBuffer(clctx->command_queue, ret_buffer, CL_FALSE, 0, 1*sizeof(uint64_t), &devaddr, 0, NULL, NULL); // copy the value
ret = clFlush(clctx->command_queue);
clFinish(clctx->command_queue);
return devaddr;
}
Apparently this works (it does return a number, although it's hard to know if it's correct), but then I put this devaddr (a 64-bit integer on the host) in the todo list buffer that the kernel uses to know what to do, and then if necessary (according to the list) the kernel calls the function below, le here being a pointer to the relevant entry in the todo list, and the 64-bit address being the first element:
float4 blit_sprite(global uint *le, float4 pv)
{
const int2 p = (int2) (get_global_id(0), get_global_id(1));
ulong devaddr;
global float4 *im;
int2 im_dim;
devaddr = ((global ulong *) le)[0]; // global address for the start of the image as a ulong
im_dim.x = le[2];
im_dim.y = le[3];
im = (global float4 *) devaddr; // ulong is turned into a proper global pointer
if (p.x < im_dim.x)
if (p.y < im_dim.y)
pv += im[p.y * im_dim.x + p.x]; // this gives me a CL_OUT_OF_RESOURCES error, even when changing it to im[0]
return pv;
}
but big surprise this doesn't work, it gives me a CL_OUT_OF_RESOURCES error, which I assume means my im pointer isn't valid. Actually it works, it didn't work when I used two different contexts. But it's still pretty unwieldy.
Is there a less weird way to do what I want to do?
OpenCL standard doesn't guarantee that memory objects will not be physically reallocated between kernel calls. So, original Device-side address is valid only within single kernel NDRange. That's one of the reasons why OpenCL memory objects are represented on Host side as transparent structure pointers.
Though, you can save offset to memory object's first byte in 1st kernel and pass it to 2nd kernel. Every time you launch your kernel, you will obtain actual Device-side address within your kernel & increment it by saved shift value. That would be perfectly "legal".

Reading an external kernel in OpenCL

I have the following lines of code which I use to first determine the file size of the .cl file I am reading from (and loading into a buffer), and subsequently building my program and kernel from the buffer. Assuming calculate.cl contains a simple vector addition kernel.
//get size of kernel source
FILE *f = fopen("calculate.cl", "r");
fseek(f, 0, SEEK_END);
size_t programSize = ftell(f);
rewind(f);
//load kernel into buffer
char *programBuffer = (char*)malloc(programSize + 1);
programBuffer[programSize] = '\0';
fread(programBuffer, sizeof(char), programSize, f);
fclose(f);
//create program from buffer
cl_program program = clCreateProgramWithSource(context, 1, (const char**) &programBuffer, &programSize, &status);
//build program for devices
status = clBuildProgram(program, numDevices, devices, NULL, NULL, NULL);
//create the kernel
cl_kernel calculate = clCreateKernel(program, "calculate", &status);
However, when I run my program, the output produced is zero instead of the intended vector addition results. I've verified that the problem is not to do with the kernel itself (I used a different method to load an external kernel which worked and gave me the intended results) however I am still curious as to why this initial method I attempted did not work.
Any help?
the problem's been solved.
following bl0z0's suggestion and looking up the error, I've found the solution here:
OpenCL: Expected identifier in kernel
thanks everyone :D I really appreciate it!
I believe this gives the programing size in terms of the number of chars:
size_t programSize = ftell(f);
and here you need to allocate in terms of bytes:
char *programBuffer = (char*)malloc(programSize + 1);
so I think that previous line should be
char *programBuffer = (char*)malloc(programSize * sizeof(char) + 1);
Double check this by just printing the programBuffer.

OpenCL - adding to a single global value

I'm fighting a bug related to adding to a single global value from an OpenCL kernel.
Consider this (oversimplified) example:
__kernel some_kernel(__global unsigned int *ops) {
unsigned int somevalue = ...; // a non-zero value is assigned here
*ops += somevalue;
}
I pass in an argument initialized as zero through clCreateBuffer and clEnqueueWriteBuffer. I assumed that after adding to the value, letting the queue finish and reading it back, I'd get a non-zero value.
Then I figured this might be some weird conflict, so I tried to do an atomic operation:
__kernel some_kernel(__global unsigned int *ops) {
unsigned int somevalue = ...; // a non-zero value is assigned here
atomic_add(ops, somevalue);
}
Alas, no dice - after reading the value back to a host pointer, it's still zero. I've already verified that somevalue has non-zero values in kernel executions, and am at a loss.
By request, the code for creating the memory:
unsigned int *cpu_ops = new unsigned int;
*cpu_ops = 0;
cl_mem_flags flags = CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR;
cl_int error;
cl_mem buffer = clCreateBuffer(context, flags, sizeof(unsigned int), (void*)cpu_ops, &error);
// error code check snipped
error = clEnqueueWriteBuffer(queue, buffer, CL_TRUE, 0, sizeof(unsigned int), (void*)cpu_ops, 0, NULL, NULL);
// error code check snipped
// snip: program setup - it checks out, no errors
cl_kernel some_kernel = clCreateKernel(program, "some_kernel", &error);
// error code check snipped
cl_int error = clSetKernelArg(some_kernel, 0, sizeof(cl_mem), &buffer);
// error code check snipped
//global_work_size and local_work_size set elsewhere
cl_int error = clEnqueueNDRangeKernel(queue, some_kernel, 1, NULL, &global_work_size, &local_work_size, 0, NULL, NULL);
// error code check snipped
clFinish(queue);
cl_int error = clEnqueueReadBuffer(queue, buffer, CL_TRUE, 0, sizeof(unsigned int), (void*)cpu_ops, 0, NULL, NULL);
// error code check snipped
// at this point, cpu_ops still has it's initial value (whatever that value might have been set to)'
I've skipped the error checking code since it does not error out. I'm actually using a bunch of custom helper functions for sending and receiving data, setting up the platform and context, compiling the program and so on, so the above is constructed of the bodies of the appropriate helpers with the parameters' names changed to make sense.
I'm fairly sure that this is a slip-up or lack of understanding on my part, but desperately need input on this.
Never mind. I was confused about my memory handles - just a stupid error. The code is probably fine.

clGetProgramInfo CL_PROGRAM_BINARY_SIZES Incorrect Results?

I am trying to cache a program in a file so that it does not need to compile to assembly. Consequently, I am trying to dump the binaries. I am getting an issue where the binary program returned alternately has garbage data at the end.
Error checking omitted for clarity (no errors occur, though):
clGetProgramInfo(kernel->program, CL_PROGRAM_BINARY_SIZES, 0,NULL, &n);
n /= sizeof(size_t);
size_t* sizes = new size_t[n];
clGetProgramInfo(kernel->program, CL_PROGRAM_BINARY_SIZES, n*sizeof(size_t),sizes, NULL);
I have confirmed that kernel->program is identical between times. In the above code, "n" is invariably 1, but sizes[0] varies between 2296 and 2312 alternate runs.
The problem is that the 2296 number appears to be more accurate--after the final closing brace in the output, there are three newlines and then three spaces.
For the 2312 number, after the final closing brace in the output, there are the three newlines, a line of garbage data, and then the three spaces.
Naturally, the line of garbage data is problematic. I'm not sure how to get rid of it, and I'm pretty sure it's not an error on my part.
NVIDIA GeForce GTX 580M, with driver 305.60 on Windows 7.
Update: I have changed the code to the following:
//Get how many devices there are
size_t n;
clGetProgramInfo(kernel->program, CL_PROGRAM_NUM_DEVICES, 0,NULL, &n);
//Get the list of binary sizes
size_t* sizes = new size_t[n];
clGetProgramInfo(kernel->program, CL_PROGRAM_BINARY_SIZES, n*sizeof(size_t),sizes, NULL);
//Get the binaries
unsigned char** binaries = new unsigned char*[n];
for (int i=0;i<(int)n;++i) {
binaries[i] = new unsigned char[sizes[i]];
}
clGetProgramInfo(kernel->program, CL_PROGRAM_BINARIES, n*sizeof(unsigned char*),binaries, NULL);
Now, the code has n = 4, but only sizes[0] contains meaningful information (so the alloc of sizes[1] fails in the loop). Thoughts?
I get the number of devices with the following line:
clGetProgramInfo(kernel->program, CL_PROGRAM_NUM_DEVICES, sizeof(cl_uint), &n, NULL);
clGetProgramInfo(kernel->program, CL_PROGRAM_NUM_DEVICES, 0,NULL, &n);
needs to be:
clGetProgramInfo(kernel->program, CL_PROGRAM_NUM_DEVICES, sizeof(size_t), &n, NULL);
clGetProgramInfo with CL_PROGRAM_BINARY_SIZES and CL_PROGRAM_BINARIES needs a pointer to an array and not just to a single variable because it creates binaries for each device that you supplied when building the program. That is why the first line returns nothing. n for the second example should be the number of devices.
Not sure why the second example is different for each run... are you sure you are building for the same device each time?

OpenCL to search array and set a flag

I'm brand new to using OpenCL, and this seems like it should be very simple, so bear with me.
I'm writing a simple kernel to scan an array and look for a particular value. If that value is found anywhere in the array, I'd like a flag to be set. If the value is not found, the flag should remain 0;
Currently I'm creating a cl_mem object to hold an int
cl_mem outputFlag = clCreateBuffer(mCLContext, CL_MEM_WRITE_ONLY, sizeof(cl_int), NULL, NULL);
setting it as a kernel argument
clSetKernelArg(mCLKernels[1],1, sizeof(cl_mem), &outputFlag);
and executing my kernel which looks like:
__kernel void checkForHole(__global uchar *image , __global int found, uchar holeValue)
{
int i = get_global_id(0);
int j = get_global_id(1);
uchar sample = image[i*j];
if (sample == holeValue) {
found = 1;
}
}
Note that my array is 2D, though it shouldn't matter.
When I put a printf statement inside my found condition, it does get called (the value is found). But when I read back my value via:
cl_int result;
errorCode = clEnqueueReadBuffer(mCLCommandQueue, outputFlag, CL_TRUE
, 0, sizeof(cl_int), &result, 0, NULL, NULL);
I get 0. Is there a proper way to set a flag in openCL? it would also be nice if there was a way to halt the entire execution and just return my value if it is found.
Can I write a bool return type kernel and just return true?
Thanks!
In the kernel the output flag should be a pointer to an int.
Change the kernel parameter to __global int *found
I always seem to figure out my issues just by writing them here....
If anyone knows a way to halt the execution though, or if it's even possible, I'd still be interested :)

Resources