Non blocking kernel launches in OpenCL intel implementation - opencl

I have the following skeleton code
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,
&global_item_size,NULL,0, NULL, NULL);
printf("print immediately\n ");
I thought and read somewhere that clEnqueueNDRangeKernel is non blocking call and cpu continues its execution immediately after enqueuing the kernel.
But I see a different behaviour. printf statement executes after kernel completes execution. Why am I seeing this behaviour?. How to make any kernel calls non blocking?.

Yes, clEnqueueNDRangeKernel() is supposed to be non-blocking. However, the code you show does not allow to definitively conclude that the kernel finishes before the printf statement. There's several possibilities:
The kernel is not enqueued properly or fails to run. You need to check if the return value ret is CL_SUCCESS, and if not, fix whatever needs to be fixed and try again.
The kernel runs fast and the thread on which the kernel runs is likely to be given priority, such that the printf statement ends up being executed after the kernel finishes.
The kernel is actually running during the printf statement, since nothing in your code allows you to conclude otherwise. To check if the kernel is running or finished, you need to use an event. For example:
cl_event evt = NULL;
cl_int ret, evt_status;
// ...
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,
&global_item_size, NULL, 0, NULL, &evt);
// Check if it's finished or not
if (ret == CL_SUCCESS)
{
clGetEventInfo(evt, CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), (void*) &evt_status, NULL);
if (evt_status == CL_COMPLETE)
printf("Kernel is finished\n");
else
printf("Kernel is NOT finished\n");
}
else
{
printf("Something's wrong: %d\n", ret);
}

Related

Wrong synchronization of RMA calls in MPI

I am trying to MPIs RMA scheme with Fences. In some cases it works fine, but for systems with multiple nodes I get the following error:
Error message: MPI failed with Error_code = 71950898
Wrong synchronization of RMA calls , error stack:
MPI_Rget(176): MPI_Rget(origin_addr=0x2ac7b10, origin_count=1, MPI_INTEGER, target_rank=0, target_disp=0, target_count=1, MPI_INTEGER, win=0xa0000000, request=0x7ffdc1efe634) failed
(unknown)(): Wrong synchronization of RMA calls
Error from PE:0/4
This is a schematic of how I setup the code:
call MPI_init(..)
CALL MPI_WIN_CREATE(..)
do i =1,10
MPI_Win_fence(0, handle, err)
calc_values()
MPI_Put(values)
MPI_Put(values)
MPI_Put(values)
MPI_Win_fence(0, handle, err)
MPI_Rget(values, req)
MPI_WAIT(req)
do_something(values)
MPI_Rget(values, req)
MPI_WAIT(req)
do_something(values)
enddo
call MPI_finalize()
I know that MPI_Put is non-blocking. Is it guaranteed, that the MPI_Put is finished after MPI_Win_fence(0, handle, err) or do I have to use MPI_RPUT?
What does this error even mean: Wrong synchronization of RMA calls ?
How do I fix my communication scheme?
Make sure you add the following call as necessary to ensure synchronization (you need to make sure your window(s) are created before putting data in them):
MPI_Win_fence(0, window);
Please look at the example below (source) and note that they are making two fence calls.
// Create the window
int window_buffer = 0;
MPI_Win window;
MPI_Win_create(&window_buffer, sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &window);
if(my_rank == 1)
{
printf("[MPI process 1] Value in my window_buffer before MPI_Put: %d.\n", window_buffer);
}
MPI_Win_fence(0, window);
if(my_rank == 0)
{
// Push my value into the first integer in MPI process 1 window
int my_value = 12345;
MPI_Put(&my_value, 1, MPI_INT, 1, 0, 1, MPI_INT, window);
printf("[MPI process 0] I put data %d in MPI process 1 window via MPI_Put.\n", my_value);
}
// Wait for the MPI_Put issued to complete before going any further
MPI_Win_fence(0, window);
if(my_rank == 1)
{
printf("[MPI process 1] Value in my window_buffer after MPI_Put: %d.\n", window_buffer);
}
// Destroy the window
MPI_Win_free(&window);

How to allocate Local Work Item sizes in OpenCL

I've set up a convolution kernel in OpenCL to convolve a 228x228x3 image with 11x11x3x96 weights to produce 55x55x96 filters.
My code without allotting localWorkSize works perfectly, but when I do allot it, I start getting errors
My questions are therefore,
1) How many threads are being launched when I set localWorkSize to NULL? I'm guessing it's implicit but is there any way to get those numbers?
2) How should I allot localWorkSize to avoid errors?
//When localWorkSize is NULL
size_t globalWorkSize[3] = {55,55,96};
//Passing NULL for localWorkSize argument
errNum = clEnqueueNDRangeKernel(command_queue, kernel,3,NULL,globalWorkSize, NULL,0, NULL,&event);
//WORKS PERFECTLY
// When I set localWorkSize
size_t globalWorkSize[3] = {55,55,96};
size_t localWorkSize[3] = {1,1,1};
errNum = clEnqueueNDRangeKernel(command_queue, kernel,3,NULL,globalWorkSize, localWorkSize,0, NULL,&event);
//ERROR CONTEXT CODE 999
I'm just trying to understand how many threads are created when localWorkSize is Null and GlobalWorkSize is described

Issues with clEnqueueMapBuffer in OpenCL

I'm developing a program that implements a recursive ray tracing in OpenCL.
To run the kernel I have to options of devices: the Intel one that is integrated with the system and the Nvidia GeForce graphic Card.
When I run the project with the first device there's no problem; it runs correctly and shows the result of the algorithm just fine.
But when I try to run it with the Nvidia device, it crashes in the callback function that has the synchronous buffer map.
The part of the code where it crashes is the following:
clEnqueueNDRangeKernel( queue, kernel, 1, NULL, &global_work_size, NULL, 0, NULL, NULL);
// 7. Look at the results via synchronous buffer map.
cl_float4 *ptr = (cl_float4 *) clEnqueueMapBuffer( queue, buffer, CL_TRUE, CL_MAP_READ, 0, kWidth * kHeight * sizeof(cl_float4), 0, NULL, NULL, NULL );
cl_float *viewTransformPtr = (cl_float *) clEnqueueMapBuffer( queue, viewTransform, CL_TRUE, CL_MAP_WRITE, 0, 16 * sizeof(cl_float), 0, NULL, NULL, NULL );
cl_float *worldTransformsPtr = (cl_float *) clEnqueueMapBuffer( queue, worldTransforms, CL_TRUE, CL_MAP_WRITE, 0, 16 * sizeof(cl_float), 0, NULL, NULL, NULL );
memcpy(viewTransformPtr, viewMatrix, sizeof(float)*16);
memcpy(worldTransformsPtr, sphereTransforms, sizeof(float)*16);
clEnqueueUnmapMemObject(queue, viewTransform, viewTransformPtr, 0, 0, 0);
clEnqueueUnmapMemObject(queue, worldTransforms, worldTransformsPtr, 0, 0, 0);
unsigned char* pixels = new unsigned char[kWidth*kHeight*4];
for(int i=0; i < kWidth * kHeight; i++){
pixels[i*4] = ptr[i].s[0]*255;
pixels[i*4+1] = ptr[i].s[1]*255;
pixels[i*4+2] = ptr[i].s[2]*255;
pixels[i*4+3] = 1;
}
glBindTexture(GL_TEXTURE_2D, 1);
glTexParameterf( GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR );
glTexParameterf( GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR );
glTexImage2D(GL_TEXTURE_2D, 0, 4, kWidth, kHeight, 0, GL_RGBA, GL_UNSIGNED_BYTE, pixels);
delete [] pixels;
The two last calls to clEnqueueMapBuffer return the error -5 that matches CL_OUT_OF_RESOURCES but I believe that the sizes of the buffers are correct.
According to the CL spec, calling CL blocking calls from a callback is undefined. It is likely your code is correct, but you can't use it from a Callback. In Intel platform with integrated memory, the maps are no-ops, thus, not failing.
CL spec: clSetEventCallbacks
The behavior of calling expensive system routines, OpenCL API calls to
create contexts or command-queues, or blocking OpenCL operations from
the following list below, in a callback is undefined.
clFinish
clWaitForEvents
blocking calls to clEnqueueReadBuffer, clEnqueueReadBufferRect, clEnqueueWriteBuffer, and clEnqueueWriteBufferRect
blocking calls to clEnqueueReadImage and clEnqueueWriteImage
blocking calls to clEnqueueMapBuffer and clEnqueueMapImage
blocking calls to clBuildProgram
If an application needs to wait for completion of a routine from the
above l ist in a callback, please use the non-blocking form of the
function, and assign a completion callback to it to do the remainder
of your work. Note that when a callback (or other code) enqueues
commands to a command-queue, the commands are not required to begin
execution until the queue is flushed. In standard usage, blocking
enqueue calls serve this role by implicitly flushing the queue. Since
blocking calls are not permitted in callbacks, those callbacks that
enqueue commands on a command queue should either call clFlush on the
queue before returning or arrange for clFlush to be called later on
another thread.

Enqueuing to device side queue in a loop

In my code I have kernelA and kernelB. kernelB depends on kernelA results. I am iterating over this kernels tousand of times and each iteration depends on the results from the previous iteration.
The host side enqueue code snipped is like this:
for(int x = 0; x < iterations; ++x)
{
queue.enqueueNDRangeKernel(kernelA, cl::NullRange, cl::NDRange(3*256, 1), cl::NDRange(256, 1));
queue.enqueueNDRangeKernel(kernelB, cl::NullRange, cl::NDRange(256, 1), cl::NDRange(256, 1));
}
queue.finish();
The above code is working perfectly fine.
Now I want to port the above code to use device side enqueue and I'm facing issues on AMD GPU. The kernel code:
__attribute__((reqd_work_group_size(256, 1, 1)))
__kernel void kernelA(...){}
__attribute__((reqd_work_group_size(256, 1, 1)))
__kernel void kernelB(...){}
__attribute__((reqd_work_group_size(1, 1, 1)))
__kernel void kernelLauncher(...)
{
queue_t default_queue = get_default_queue();
clk_event_t ev1, ev2;
for (int x = 0; x < iterations; ++x)
{
void(^fnKernelA)(void) = ^{ kernelA(
... // kernel params come here
); };
if (x == 0)
{
enqueue_kernel(default_queue,
CLK_ENQUEUE_FLAGS_NO_WAIT,
ndrange_1D(3 * 256, 256),
0, NULL, &ev1,
fnKernelA);
}
else
{
enqueue_kernel(default_queue,
CLK_ENQUEUE_FLAGS_NO_WAIT,
ndrange_1D(3 * 256, 256),
1, &ev2, &ev1, // ev2 sets dependency on kernelB here
fnKernelA);
}
void(^fnKernelB)(void) = ^{ kernelB(
... // kernel params come here
); };
enqueue_kernel(default_queue,
CLK_ENQUEUE_FLAGS_NO_WAIT,
ndrange_1D(256, 256),
1, &ev1, &ev2, // ev1 sets dependency on kernelA here
fnKernelB);
}
}
The host code:
queue.enqueueNDRangeKernel(kernelLauncher, cl::NullRange, cl::NDRange(1, 1), cl::NDRange(1, 1));
The issue is that the results returned from the kernel when run on AMD GPU are wrong. Sometimes kernel also hangs which may indicate that there is probably something wrong with kernel synchronization. The same code works fine on Intel CPU, not sure if that is a luck or there is something wrong with synchronization points in the kernel.
Update: enqueue_kernel is failing on 1025th enqueue command with error -1. I tried to get more detailed error (added -g during build) but to no avail. I increased the device queue size to maximum but that didn't change anything (still failing on 1025th enqueue command). Removing content of kernelA and kernelB didn't change anything either. Any thoughts?
Answering an old question to hopefully save someone time in the future. If you query CL_DEVICE_MAX_ON_DEVICE_EVENTS on your device it will return 1024. That is the max number of events you can queue "on device". That is why it is failing on the 1025 queue. If you run your OpenCL code on a different GPU (like Intel) you may be lucky enough to get a real error code back which will be CLK_DEVICE_QUEUE_FULL or -161. AMD ignores the -g option and doesn't ever seem to give anything back but -1 on a failed on-device enqueue.

Is it possible to build the same program twice in OpenCL with different preprocessor options.

Given the following code, wher P is a cl_program loaded with some source code.
What happens if i run
*err = clBuildProgram (p,
1,
m_gpu_device_id,
str0, // Compiler options, see the specifications for more details
0,
0);
cl_kernel kernel0= clCreateKernel (p, // The program where the kernel is
"nn_feedforward", // The name of the kernel, i.e. the name of the kernel function as it's declared in the code
err);
*err = clBuildProgram (p,1,m_gpu_device_id, str1 ,0, 0);
cl_kernel kernel1 = clCreateKernel (p, "nn_feedforward", err);
Will the kernel1 work with the options of str1 in contrast to kernel0 with str0 options. or will the first kernel get written over in some way.

Resources