Issues with clEnqueueMapBuffer in OpenCL - opencl

I'm developing a program that implements a recursive ray tracing in OpenCL.
To run the kernel I have to options of devices: the Intel one that is integrated with the system and the Nvidia GeForce graphic Card.
When I run the project with the first device there's no problem; it runs correctly and shows the result of the algorithm just fine.
But when I try to run it with the Nvidia device, it crashes in the callback function that has the synchronous buffer map.
The part of the code where it crashes is the following:
clEnqueueNDRangeKernel( queue, kernel, 1, NULL, &global_work_size, NULL, 0, NULL, NULL);
// 7. Look at the results via synchronous buffer map.
cl_float4 *ptr = (cl_float4 *) clEnqueueMapBuffer( queue, buffer, CL_TRUE, CL_MAP_READ, 0, kWidth * kHeight * sizeof(cl_float4), 0, NULL, NULL, NULL );
cl_float *viewTransformPtr = (cl_float *) clEnqueueMapBuffer( queue, viewTransform, CL_TRUE, CL_MAP_WRITE, 0, 16 * sizeof(cl_float), 0, NULL, NULL, NULL );
cl_float *worldTransformsPtr = (cl_float *) clEnqueueMapBuffer( queue, worldTransforms, CL_TRUE, CL_MAP_WRITE, 0, 16 * sizeof(cl_float), 0, NULL, NULL, NULL );
memcpy(viewTransformPtr, viewMatrix, sizeof(float)*16);
memcpy(worldTransformsPtr, sphereTransforms, sizeof(float)*16);
clEnqueueUnmapMemObject(queue, viewTransform, viewTransformPtr, 0, 0, 0);
clEnqueueUnmapMemObject(queue, worldTransforms, worldTransformsPtr, 0, 0, 0);
unsigned char* pixels = new unsigned char[kWidth*kHeight*4];
for(int i=0; i < kWidth * kHeight; i++){
pixels[i*4] = ptr[i].s[0]*255;
pixels[i*4+1] = ptr[i].s[1]*255;
pixels[i*4+2] = ptr[i].s[2]*255;
pixels[i*4+3] = 1;
}
glBindTexture(GL_TEXTURE_2D, 1);
glTexParameterf( GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR );
glTexParameterf( GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR );
glTexImage2D(GL_TEXTURE_2D, 0, 4, kWidth, kHeight, 0, GL_RGBA, GL_UNSIGNED_BYTE, pixels);
delete [] pixels;
The two last calls to clEnqueueMapBuffer return the error -5 that matches CL_OUT_OF_RESOURCES but I believe that the sizes of the buffers are correct.

According to the CL spec, calling CL blocking calls from a callback is undefined. It is likely your code is correct, but you can't use it from a Callback. In Intel platform with integrated memory, the maps are no-ops, thus, not failing.
CL spec: clSetEventCallbacks
The behavior of calling expensive system routines, OpenCL API calls to
create contexts or command-queues, or blocking OpenCL operations from
the following list below, in a callback is undefined.
clFinish
clWaitForEvents
blocking calls to clEnqueueReadBuffer, clEnqueueReadBufferRect, clEnqueueWriteBuffer, and clEnqueueWriteBufferRect
blocking calls to clEnqueueReadImage and clEnqueueWriteImage
blocking calls to clEnqueueMapBuffer and clEnqueueMapImage
blocking calls to clBuildProgram
If an application needs to wait for completion of a routine from the
above l ist in a callback, please use the non-blocking form of the
function, and assign a completion callback to it to do the remainder
of your work. Note that when a callback (or other code) enqueues
commands to a command-queue, the commands are not required to begin
execution until the queue is flushed. In standard usage, blocking
enqueue calls serve this role by implicitly flushing the queue. Since
blocking calls are not permitted in callbacks, those callbacks that
enqueue commands on a command queue should either call clFlush on the
queue before returning or arrange for clFlush to be called later on
another thread.

Related

Is it possibile in OpenCL to wait for an event that has not been returned by clEnqueueNDRangeKernel yet?

Let's say there are two Command Queues and I want to synchronize them using events. It can be done like this:
cl_event event1;
clEnqueueNDRangeKernel(queue1, <..params..>, 0, NULL, &event1);
cl_event event2;
clEnqueueNDRangeKernel(queue2, <..params..>, 0, NULL, &event2);
clEnqueueNDRangeKernel(queue1, <..params..>, 1, &event2, NULL);
clEnqueueNDRangeKernel(queue2, <..params..>, 1, &event1, NULL);
Is there a possibility to achive similar results but with a different order of clEnqueueNDRangeKernel calls? For example:
cl_event event1;
cl_event event2;
clEnqueueNDRangeKernel(queue1, <..params..>, 0, NULL, &event1);
clEnqueueNDRangeKernel(queue1, <..params..>, 1, &event2, NULL); //it fails here because event2 does not exist
clEnqueueNDRangeKernel(queue2, <..params..>, 0, NULL, &event2);
clEnqueueNDRangeKernel(queue2, <..params..>, 1, &event1, NULL);
Is it possibile in OpenCL to wait for an event that has not been returned by clEnqueueNDRangeKernel yet?
Not really, but there's a different approach possible.
You could create a user event (clCreateUserEvent) and use the returned userEvent instead of event2 argument in the enqueue call. Then, after enqueueing a kernel that creates event2, you add a callback (clSetEventCallback) on event2, and from that callback you call clSetUserEventStatus(userEvent, CL_COMPLETE).
There are only two problems with this, 1) even if the most common OpenCL implementations weren't horrible WRT user events, you're introducing unnecessary userspace trips (= slowdown), 2) they are horrible WRT user events. By which i mean, the callback will be called... at some point. It's not unusual to see it called with 10-200ms delay after the event has actually finished.
You could get more useful answers if you said what is the problem you're trying to solve.

Is there any reason why clEnqueueNDRangeKernel may block?

I am developing an application making use of OpenCL, targeted to 1.2 versión. I use DX11 interoperability to display the kernel results. I try my code in Intel (iGPU) and Nvidia platforms, in both I recon the same behaviour.
My call to clEnqueueNDRangeKernel is blocking the CPU thread. I have checked the documentation and I can not find an statement declaring the situations in which a kernel call may block. I have read in some forums that those things happens sometimes with some OpenCL implementations. The code is working properly and outputting valid results. The API does not return any error at any given point, all seems smooth.
I can not paste the full source but I will paste the in-loop part:
size_t local = 64;
size_t global = ctx->dec_in_host->horizontal_blocks * ctx->dec_in_host->vertical_blocks * local;
print_if_error(clEnqueueWriteBuffer(ctx->queue, ctx->blocks_gpu, CL_TRUE, 0, sizeof(block_input) * TOTAL_BLOCKS, ctx->blocks_host, 0, NULL, &ctx->blocks_copy_status), "copying data");
print_if_error(clEnqueueWriteBuffer(ctx->queue, ctx->dec_in_gpu, CL_TRUE, 0, sizeof(decoder_input), ctx->dec_in_host, 0, NULL, &ctx->frame_copy_status), "copying data");
if (ctx->mode == nv_d3d11_sharing)
print_if_error(ctx->fp_clEnqueueAcquireD3D11ObjectsNV(ctx->queue, 1, &(ctx->image_gpu), 0, NULL, NULL), "Adquring texture");
else if (ctx->mode == khr_d3d11_sharing)
print_if_error(ctx->fp_clEnqueueAcquireD3D11ObjectsKHR(ctx->queue, 1, &(ctx->image_gpu), 0, NULL, NULL), "Adquring texture");
t1 = clock();
print_if_error(clEnqueueNDRangeKernel(ctx->queue, ctx->kernel, 1, NULL, &global, &local, 0, NULL, &ctx->kernel_status), "kernel launch");
t2 = clock();
if (ctx->mode == nv_d3d11_sharing)
print_if_error(ctx->fp_clEnqueueReleaseD3D11ObjectsNV(ctx->queue, 1, &(ctx->image_gpu), 0, NULL, NULL), "Releasing texture");
else if (ctx->mode == khr_d3d11_sharing)
print_if_error(ctx->fp_clEnqueueReleaseD3D11ObjectsKHR(ctx->queue, 1, &(ctx->image_gpu), 0, NULL, NULL), "Releasing texture");
printf("Elapsed time %lf ms\n", (double)(t2 - t1)*1000 / CLOCKS_PER_SEC);
So my question is:
¿Do you know any reason why the clEnqueueNDRangeKernel would block?
¿Do you know if the Dx11 interop might cause this?
¿Do you know if some OpenCL configuration can create a syncronous kernel launch?
Thank you :)
EDIT 1:
Thanks to doqtor comment I realize that commenting out parts of the kernel the kernel launch becomes asyncronous. The result is not Ok but I have some hint to work out the answer.

Non blocking kernel launches in OpenCL intel implementation

I have the following skeleton code
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,
&global_item_size,NULL,0, NULL, NULL);
printf("print immediately\n ");
I thought and read somewhere that clEnqueueNDRangeKernel is non blocking call and cpu continues its execution immediately after enqueuing the kernel.
But I see a different behaviour. printf statement executes after kernel completes execution. Why am I seeing this behaviour?. How to make any kernel calls non blocking?.
Yes, clEnqueueNDRangeKernel() is supposed to be non-blocking. However, the code you show does not allow to definitively conclude that the kernel finishes before the printf statement. There's several possibilities:
The kernel is not enqueued properly or fails to run. You need to check if the return value ret is CL_SUCCESS, and if not, fix whatever needs to be fixed and try again.
The kernel runs fast and the thread on which the kernel runs is likely to be given priority, such that the printf statement ends up being executed after the kernel finishes.
The kernel is actually running during the printf statement, since nothing in your code allows you to conclude otherwise. To check if the kernel is running or finished, you need to use an event. For example:
cl_event evt = NULL;
cl_int ret, evt_status;
// ...
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,
&global_item_size, NULL, 0, NULL, &evt);
// Check if it's finished or not
if (ret == CL_SUCCESS)
{
clGetEventInfo(evt, CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), (void*) &evt_status, NULL);
if (evt_status == CL_COMPLETE)
printf("Kernel is finished\n");
else
printf("Kernel is NOT finished\n");
}
else
{
printf("Something's wrong: %d\n", ret);
}

How make data transfer between host and device more faster?

I am coding some program and i need to perform some data transfer repeatedly between host and device. I will try to minimize best i can, but is there any faster way to perform this? Here, array copied to the device is changing on each iteration, hence device needs to be updated with the new array values. Any suggestions/pointers/help will be appreciated.
for (i = 0; i <= SEVERALCALLS; i++) {
wrtBuffer = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(double) * num, NULL, &ret);
if (ret != 0) {
printf("clCreateBuffer wrtBuffer error: %d. couldn't load\n", ret);
exit(1);
}
// update cti array
ret = clEnqueueWriteBuffer(command_queue, wrtBuffer, CL_TRUE, 0, sizeof(double) * num, cti, 0, NULL, NULL);
if (ret != 0) {
printf("clEnqueueWriteBuffer wrtBuffer error: %d. couldn't load\n", ret);
exit(1);
}
// NDRange Kernel call
ret = clEnqueueReadBuffer(command_queue, readBuffer, CL_TRUE, 0, sizeof(double) * num, newcti, 0, NULL, NULL);
if (ret != 0) {
printf("clEnqueueReadBuffer readBuffer error: %d. couldn't load\n", ret);
exit(1);
}
}
Three ways to optimize this:
On integrated GPU (like Intel or AMD APU) use "zero copy" buffers so you won't pay for any transfers.
On NVIDIA use pinned host memory for the host-side memory source for clEnqueueWriteBuffer or receive buffer for clEnqueueReadBuffer. This will be faster than using normal malloc memory and won't block.
Overlap transfer and compute. Use three command queues, one for upload, one for compute, and one for download, and use events to ensure dependencies. See NVIDIA's example: oclCopyComputeOverlap (although it is suboptimal; it can go slightly faster than they say it can).

Base Address of Memory Object OpenCL

I want to traverse a tree at GPU with OpenCL, so i assemble the tree in a contiguous block at host and i change the addresses of all pointers so as to be consistent at device as follows:
TreeAddressDevice = (size_t)BaseAddressDevice + ((size_t)TreeAddressHost - (size_t)BaseAddressHost);
I want the base address of the memory buffer:
At host i allocate memory for the buffer, as follows:
cl_mem tree_d = clCreateBuffer(...);
The problem is that cl_mems are objects that track an internal representation of the data. Technically they're pointers to an object, but they are not pointers to the data. The only way to access a cl_mem from within a kernel is to pass it in as an argument via setKernelArgs.
Here http://www.proxya.net/browse.php?u=%3A%2F%2Fwww.khronos.org%2Fmessage_boards%2Fviewtopic.php%3Ff%3D37%26amp%3Bt%3D2900&b=28 i found the following solution, but it doesnot work:
__kernel void getPtr( __global void *ptr, __global void *out )
{
*out = ptr;
}
that can be invoked as follows
Code:
...
cl_mem auxBuf = clCreateBuffer( context, CL_MEM_READ_WRITE, sizeof(void*), NULL, NULL );
void *gpuPtr;
clSetKernelArg( getterKernel, 0, sizeof(cl_mem), &myBuf );
clSetKernelArg( getterKernel, 1, sizeof(cl_mem), &auxBuf );
clEnqueueTask( commandQueue, getterKernel, 0, NULL, NULL );
clEnqueueReadBuffer( commandQueue, auxBuf, CL_TRUE, 0, sizeof(void*), &gpuPtr, 0, NULL, NULL );
clReleaseMemObject(auxBuf);
...
Now "gpuPtr" should contain the address of the beginning of "myBuf" in GPU memory space.
The solution is obvious and i can't find it? How can I get back a pointer to device memory when creating buffers?
It's because in the OpenCL model, host memory and device memory are disjoint. A pointer in device memory will have no meaning on the host.
You can map a device buffer to host memory using clEnqueueMapBuffer. The mapping will synchronize device to host, and unmapping will synchronize back host to device.
Update. As you explain in the comments, you want to send a tree structure to the GPU. One solution would be to store all tree nodes inside an array, replacing pointers to nodes with indices in the array.
As Eric pointed out, there are two sets of memory to consider: host memory and device memory. Basically, OpenCL tries to hide the gritty details of this interaction by introducing the buffer object for us to interact with in our program on the host side. Now, as you noted, the problem with this methodology is that it hides away the details of our device when we want to do something trickier than the OpenCL developers intended or allowed in their scope. The solution here is to remember that OpenCL kernels use C99 and that the language allows us to access pointers without any issue. With this in mind, we can just demand the pointer be stored in an unsigned integer variable to be referenced later.
Your implementation was on the right track, but it needed a little bit more C syntax to finish up the transfer.
OpenCL Kernel:
// Kernel used to obtain pointer from target buffer
__kernel void mem_ptr(__global char * buffer, __global ulong * ptr)
{
ptr[0] = &buffer[0];
}
// Kernel to demonstrate how to use that pointer again after we extract it.
__kernel void use_ptr(__global ulong * ptr)
{
char * print_me = (char *)ptr[0];
/* Code that uses all of our hard work */
/* ... */
}
Host Program:
// Create the buffer that we want the device pointer from (target_buffer)
// and a place to store it (ptr_buffer).
cl_mem target_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
MEM_SIZE * sizeof(char), NULL, &ret);
cl_mem ptr_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE,
1 * sizeof(cl_ulong), NULL, &ret);
/* Setup the rest of our OpenCL program */
/* .... */
// Setup our kernel arguments from the host...
ret = clSetKernelArg(kernel_mem_ptr, 0, sizeof(cl_mem), (void *)&target_buffer);
ret = clSetKernelArg(kernel_mem_ptr, 1, sizeof(cl_mem), (void *)&ptr_buffer);
ret = clEnqueueTask(command_queue, kernel_mem_ptr, 0, NULL, NULL);
// Now it's just a matter of storing the pointer where we want to use it for later.
ret = clEnqueueCopyBuffer(command_queue, ptr_buffer, dst_buffer, 0, 1 * sizeof(cl_ulong),
sizeof(cl_ulong), 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue, ptr_buffer, CL_TRUE, 0,
1 * sizeof(cl_ulong), buffer_ptrs, 0, NULL, NULL);
There you have it. Now, keep in mind that you don't have to use the char variables I used; it works for any type. However, I'd recommend using cl_ulong for the storing of pointers. This shouldn't matter for devices with less than 4GB of accessible memory. But for devices with a larger address space, you have to use cl_ulong. If you absolutely NEED to save space on your device but have a device whose memory > 4GB, then you might be able to create a struct that can store the lower 32 LSB of the address into a uint type, with the MSB's being stored in a small type.

Resources