Is there any reason why clEnqueueNDRangeKernel may block? - opencl

I am developing an application making use of OpenCL, targeted to 1.2 versión. I use DX11 interoperability to display the kernel results. I try my code in Intel (iGPU) and Nvidia platforms, in both I recon the same behaviour.
My call to clEnqueueNDRangeKernel is blocking the CPU thread. I have checked the documentation and I can not find an statement declaring the situations in which a kernel call may block. I have read in some forums that those things happens sometimes with some OpenCL implementations. The code is working properly and outputting valid results. The API does not return any error at any given point, all seems smooth.
I can not paste the full source but I will paste the in-loop part:
size_t local = 64;
size_t global = ctx->dec_in_host->horizontal_blocks * ctx->dec_in_host->vertical_blocks * local;
print_if_error(clEnqueueWriteBuffer(ctx->queue, ctx->blocks_gpu, CL_TRUE, 0, sizeof(block_input) * TOTAL_BLOCKS, ctx->blocks_host, 0, NULL, &ctx->blocks_copy_status), "copying data");
print_if_error(clEnqueueWriteBuffer(ctx->queue, ctx->dec_in_gpu, CL_TRUE, 0, sizeof(decoder_input), ctx->dec_in_host, 0, NULL, &ctx->frame_copy_status), "copying data");
if (ctx->mode == nv_d3d11_sharing)
print_if_error(ctx->fp_clEnqueueAcquireD3D11ObjectsNV(ctx->queue, 1, &(ctx->image_gpu), 0, NULL, NULL), "Adquring texture");
else if (ctx->mode == khr_d3d11_sharing)
print_if_error(ctx->fp_clEnqueueAcquireD3D11ObjectsKHR(ctx->queue, 1, &(ctx->image_gpu), 0, NULL, NULL), "Adquring texture");
t1 = clock();
print_if_error(clEnqueueNDRangeKernel(ctx->queue, ctx->kernel, 1, NULL, &global, &local, 0, NULL, &ctx->kernel_status), "kernel launch");
t2 = clock();
if (ctx->mode == nv_d3d11_sharing)
print_if_error(ctx->fp_clEnqueueReleaseD3D11ObjectsNV(ctx->queue, 1, &(ctx->image_gpu), 0, NULL, NULL), "Releasing texture");
else if (ctx->mode == khr_d3d11_sharing)
print_if_error(ctx->fp_clEnqueueReleaseD3D11ObjectsKHR(ctx->queue, 1, &(ctx->image_gpu), 0, NULL, NULL), "Releasing texture");
printf("Elapsed time %lf ms\n", (double)(t2 - t1)*1000 / CLOCKS_PER_SEC);
So my question is:
¿Do you know any reason why the clEnqueueNDRangeKernel would block?
¿Do you know if the Dx11 interop might cause this?
¿Do you know if some OpenCL configuration can create a syncronous kernel launch?
Thank you :)
EDIT 1:
Thanks to doqtor comment I realize that commenting out parts of the kernel the kernel launch becomes asyncronous. The result is not Ok but I have some hint to work out the answer.

Related

Why would JOCL CL.clEnqueueReadBuffer never return?

Currently I have a kernel that is supposed to take a flattened array of bytes and transform render some image, I have all of this implemented, however, I have a line of code that never returns
ret = CL.clEnqueueReadBuffer(this.commandQueue, this.finalImageBuffer, CL.CL_TRUE, 0, this.outputSize, Pointer.to(this.outArray), 0, null, null);
why would it never return, that's my question.
Thanks in advance

Is it possibile in OpenCL to wait for an event that has not been returned by clEnqueueNDRangeKernel yet?

Let's say there are two Command Queues and I want to synchronize them using events. It can be done like this:
cl_event event1;
clEnqueueNDRangeKernel(queue1, <..params..>, 0, NULL, &event1);
cl_event event2;
clEnqueueNDRangeKernel(queue2, <..params..>, 0, NULL, &event2);
clEnqueueNDRangeKernel(queue1, <..params..>, 1, &event2, NULL);
clEnqueueNDRangeKernel(queue2, <..params..>, 1, &event1, NULL);
Is there a possibility to achive similar results but with a different order of clEnqueueNDRangeKernel calls? For example:
cl_event event1;
cl_event event2;
clEnqueueNDRangeKernel(queue1, <..params..>, 0, NULL, &event1);
clEnqueueNDRangeKernel(queue1, <..params..>, 1, &event2, NULL); //it fails here because event2 does not exist
clEnqueueNDRangeKernel(queue2, <..params..>, 0, NULL, &event2);
clEnqueueNDRangeKernel(queue2, <..params..>, 1, &event1, NULL);
Is it possibile in OpenCL to wait for an event that has not been returned by clEnqueueNDRangeKernel yet?
Not really, but there's a different approach possible.
You could create a user event (clCreateUserEvent) and use the returned userEvent instead of event2 argument in the enqueue call. Then, after enqueueing a kernel that creates event2, you add a callback (clSetEventCallback) on event2, and from that callback you call clSetUserEventStatus(userEvent, CL_COMPLETE).
There are only two problems with this, 1) even if the most common OpenCL implementations weren't horrible WRT user events, you're introducing unnecessary userspace trips (= slowdown), 2) they are horrible WRT user events. By which i mean, the callback will be called... at some point. It's not unusual to see it called with 10-200ms delay after the event has actually finished.
You could get more useful answers if you said what is the problem you're trying to solve.

How do I pass an array to an OpenCL kernel?

I have an array which I want pass to an OpenCL kernel. Part of my code is
cl_mem arr_cl;
unsigned int arr[4] = { 0 };
arr_cl = clCreateBuffer(ocl.context, CL_MEM_ALLOC_HOST_PTR, 4*sizeof(unsigned int), NULL, &status);
arr = (unsigned int*)clEnqueueMapBuffer(ocl.command_queue, arr_cl, CL_TRUE, CL_MAP_READ | CL_MAP_WRITE, 0, 4*sizeof(unsigned int), 0, NULL, NULL, NULL);
status |= clSetKernelArg(ocl.kernel, 0, sizeof(cl_mem), &(arr_cl));
The above code compiles but crashes during run time. Please let me know what I'm doing wrong here.
I am using OpenCL 2.0.
source: https://www.khronos.org/registry/OpenCL/sdk/2.1/docs/man/xhtml/clEnqueueMapBuffer.html
CL_MAP_READ or CL_MAP_WRITE and CL_MAP_WRITE_INVALIDATE_REGION are
mutually exclusive.
you should only read or only write within a mapping if opencl version >=1.2.
Also when changing alloc_host_ptr to use_host_ptr, the array should be aligned to CL_DEVICE_MEM_BASE_ADDR_ALIGN value/query.

Issues with clEnqueueMapBuffer in OpenCL

I'm developing a program that implements a recursive ray tracing in OpenCL.
To run the kernel I have to options of devices: the Intel one that is integrated with the system and the Nvidia GeForce graphic Card.
When I run the project with the first device there's no problem; it runs correctly and shows the result of the algorithm just fine.
But when I try to run it with the Nvidia device, it crashes in the callback function that has the synchronous buffer map.
The part of the code where it crashes is the following:
clEnqueueNDRangeKernel( queue, kernel, 1, NULL, &global_work_size, NULL, 0, NULL, NULL);
// 7. Look at the results via synchronous buffer map.
cl_float4 *ptr = (cl_float4 *) clEnqueueMapBuffer( queue, buffer, CL_TRUE, CL_MAP_READ, 0, kWidth * kHeight * sizeof(cl_float4), 0, NULL, NULL, NULL );
cl_float *viewTransformPtr = (cl_float *) clEnqueueMapBuffer( queue, viewTransform, CL_TRUE, CL_MAP_WRITE, 0, 16 * sizeof(cl_float), 0, NULL, NULL, NULL );
cl_float *worldTransformsPtr = (cl_float *) clEnqueueMapBuffer( queue, worldTransforms, CL_TRUE, CL_MAP_WRITE, 0, 16 * sizeof(cl_float), 0, NULL, NULL, NULL );
memcpy(viewTransformPtr, viewMatrix, sizeof(float)*16);
memcpy(worldTransformsPtr, sphereTransforms, sizeof(float)*16);
clEnqueueUnmapMemObject(queue, viewTransform, viewTransformPtr, 0, 0, 0);
clEnqueueUnmapMemObject(queue, worldTransforms, worldTransformsPtr, 0, 0, 0);
unsigned char* pixels = new unsigned char[kWidth*kHeight*4];
for(int i=0; i < kWidth * kHeight; i++){
pixels[i*4] = ptr[i].s[0]*255;
pixels[i*4+1] = ptr[i].s[1]*255;
pixels[i*4+2] = ptr[i].s[2]*255;
pixels[i*4+3] = 1;
}
glBindTexture(GL_TEXTURE_2D, 1);
glTexParameterf( GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR );
glTexParameterf( GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR );
glTexImage2D(GL_TEXTURE_2D, 0, 4, kWidth, kHeight, 0, GL_RGBA, GL_UNSIGNED_BYTE, pixels);
delete [] pixels;
The two last calls to clEnqueueMapBuffer return the error -5 that matches CL_OUT_OF_RESOURCES but I believe that the sizes of the buffers are correct.
According to the CL spec, calling CL blocking calls from a callback is undefined. It is likely your code is correct, but you can't use it from a Callback. In Intel platform with integrated memory, the maps are no-ops, thus, not failing.
CL spec: clSetEventCallbacks
The behavior of calling expensive system routines, OpenCL API calls to
create contexts or command-queues, or blocking OpenCL operations from
the following list below, in a callback is undefined.
clFinish
clWaitForEvents
blocking calls to clEnqueueReadBuffer, clEnqueueReadBufferRect, clEnqueueWriteBuffer, and clEnqueueWriteBufferRect
blocking calls to clEnqueueReadImage and clEnqueueWriteImage
blocking calls to clEnqueueMapBuffer and clEnqueueMapImage
blocking calls to clBuildProgram
If an application needs to wait for completion of a routine from the
above l ist in a callback, please use the non-blocking form of the
function, and assign a completion callback to it to do the remainder
of your work. Note that when a callback (or other code) enqueues
commands to a command-queue, the commands are not required to begin
execution until the queue is flushed. In standard usage, blocking
enqueue calls serve this role by implicitly flushing the queue. Since
blocking calls are not permitted in callbacks, those callbacks that
enqueue commands on a command queue should either call clFlush on the
queue before returning or arrange for clFlush to be called later on
another thread.

Incorrect copying of memory in OpenCL

I am getting some unusual behaviour in my openCL program.
In a host part of the program I create an array of double and set all elements to zero. That array is copied to the GPU using:
memObjects[4] = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(double) * I_numel, I, NULL);
Inside the kernel some elements are set to 1 depending on some condition and then I read it back to the host with:
errNum = clEnqueueReadBuffer(commandQueue, memObjects[4], CL_TRUE, 0,
I_numel * sizeof(double), I, 0, NULL, NULL);
However, some of the elements that were supposed to be zero have changed to very small ( 6.953267903e-310 ) or very large numbers ( 2.0002319483e+161 ) !?!
I've tried changing double to float but the results are similar. I am using nvidia implementation of openCL, version is 1.1. Does anyone know what is the problem?
I suspect there's something wrong with your kernel code. What happens if you do just the clEnqueueRead without running the kernel at all, do you then get all zeros? How about if you drop the CL_MEM_COPY_HOST_PTR and clear the buffer with clEnqueueWrite instead?
I tried to reproduce the issue with this simplified kernel, but the output was just alternating zeros and ones, as expected:
kernel void enqueueReadBuffer(global float* outputValueArray) {
int gid = get_global_id(0);
if (gid % 2 == 0) {
outputValueArray[gid] = 1.0f;
}
}
I ran this on three different OpenCL drivers on Windows 7, including NVIDIA Quadro FX4800 (R307.45), and got the correct result on all of them.
Try replacing the shown code with this and then post the err numbers
cl_int err;
memObjects[4] = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(double) * I_numel, I, &err);
printf("Buffer creation error no = %d", err);
And for the copy back
cl_int err2;
err2= clEnqueueReadBuffer(commandQueue, memObjects[4], CL_TRUE, 0,
I_numel * sizeof(double), I, 0, NULL, NULL);
printf("Copy back error no = %d", err2);

Resources