Enqueuing to device side queue in a loop - opencl

In my code I have kernelA and kernelB. kernelB depends on kernelA results. I am iterating over this kernels tousand of times and each iteration depends on the results from the previous iteration.
The host side enqueue code snipped is like this:
for(int x = 0; x < iterations; ++x)
{
queue.enqueueNDRangeKernel(kernelA, cl::NullRange, cl::NDRange(3*256, 1), cl::NDRange(256, 1));
queue.enqueueNDRangeKernel(kernelB, cl::NullRange, cl::NDRange(256, 1), cl::NDRange(256, 1));
}
queue.finish();
The above code is working perfectly fine.
Now I want to port the above code to use device side enqueue and I'm facing issues on AMD GPU. The kernel code:
__attribute__((reqd_work_group_size(256, 1, 1)))
__kernel void kernelA(...){}
__attribute__((reqd_work_group_size(256, 1, 1)))
__kernel void kernelB(...){}
__attribute__((reqd_work_group_size(1, 1, 1)))
__kernel void kernelLauncher(...)
{
queue_t default_queue = get_default_queue();
clk_event_t ev1, ev2;
for (int x = 0; x < iterations; ++x)
{
void(^fnKernelA)(void) = ^{ kernelA(
... // kernel params come here
); };
if (x == 0)
{
enqueue_kernel(default_queue,
CLK_ENQUEUE_FLAGS_NO_WAIT,
ndrange_1D(3 * 256, 256),
0, NULL, &ev1,
fnKernelA);
}
else
{
enqueue_kernel(default_queue,
CLK_ENQUEUE_FLAGS_NO_WAIT,
ndrange_1D(3 * 256, 256),
1, &ev2, &ev1, // ev2 sets dependency on kernelB here
fnKernelA);
}
void(^fnKernelB)(void) = ^{ kernelB(
... // kernel params come here
); };
enqueue_kernel(default_queue,
CLK_ENQUEUE_FLAGS_NO_WAIT,
ndrange_1D(256, 256),
1, &ev1, &ev2, // ev1 sets dependency on kernelA here
fnKernelB);
}
}
The host code:
queue.enqueueNDRangeKernel(kernelLauncher, cl::NullRange, cl::NDRange(1, 1), cl::NDRange(1, 1));
The issue is that the results returned from the kernel when run on AMD GPU are wrong. Sometimes kernel also hangs which may indicate that there is probably something wrong with kernel synchronization. The same code works fine on Intel CPU, not sure if that is a luck or there is something wrong with synchronization points in the kernel.
Update: enqueue_kernel is failing on 1025th enqueue command with error -1. I tried to get more detailed error (added -g during build) but to no avail. I increased the device queue size to maximum but that didn't change anything (still failing on 1025th enqueue command). Removing content of kernelA and kernelB didn't change anything either. Any thoughts?

Answering an old question to hopefully save someone time in the future. If you query CL_DEVICE_MAX_ON_DEVICE_EVENTS on your device it will return 1024. That is the max number of events you can queue "on device". That is why it is failing on the 1025 queue. If you run your OpenCL code on a different GPU (like Intel) you may be lucky enough to get a real error code back which will be CLK_DEVICE_QUEUE_FULL or -161. AMD ignores the -g option and doesn't ever seem to give anything back but -1 on a failed on-device enqueue.

Related

Non blocking kernel launches in OpenCL intel implementation

I have the following skeleton code
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,
&global_item_size,NULL,0, NULL, NULL);
printf("print immediately\n ");
I thought and read somewhere that clEnqueueNDRangeKernel is non blocking call and cpu continues its execution immediately after enqueuing the kernel.
But I see a different behaviour. printf statement executes after kernel completes execution. Why am I seeing this behaviour?. How to make any kernel calls non blocking?.
Yes, clEnqueueNDRangeKernel() is supposed to be non-blocking. However, the code you show does not allow to definitively conclude that the kernel finishes before the printf statement. There's several possibilities:
The kernel is not enqueued properly or fails to run. You need to check if the return value ret is CL_SUCCESS, and if not, fix whatever needs to be fixed and try again.
The kernel runs fast and the thread on which the kernel runs is likely to be given priority, such that the printf statement ends up being executed after the kernel finishes.
The kernel is actually running during the printf statement, since nothing in your code allows you to conclude otherwise. To check if the kernel is running or finished, you need to use an event. For example:
cl_event evt = NULL;
cl_int ret, evt_status;
// ...
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL,
&global_item_size, NULL, 0, NULL, &evt);
// Check if it's finished or not
if (ret == CL_SUCCESS)
{
clGetEventInfo(evt, CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), (void*) &evt_status, NULL);
if (evt_status == CL_COMPLETE)
printf("Kernel is finished\n");
else
printf("Kernel is NOT finished\n");
}
else
{
printf("Something's wrong: %d\n", ret);
}

Calculating the average of Sensor Data (Capacitive Sensor)

So I am starting to mess around with Capacitive sensors and all because its some pretty cool stuff.
I have followed some tutorials online about how to set it up and use the CapSense library for Arduino and I just had a quick question about this code i wrote here to get the average for that data.
void loop() {
long AvrNum;
int counter = 0;
AvrNum += cs_4_2.capacitiveSensor(30);
counter++;
if (counter = 10) {
long AvrCap = AvrNum/10;
Serial.println(AvrCap);
counter = 0;
}
}
This is my loop statement and in the Serial it seems like its working but the numbers just look suspiciously low to me. I'm using a 10M resistor (brown, black, black, green, brown) and am touching a piece of foil that both the send and receive pins are attached to (electrical tape) and am getting numbers around about 650, give or take 30.
Basically I'm asking if this code looks right and if these numbers make sense...?
The language used in the Arduino environment is really just an unenforced subset of C++ with the main() function hidden inside the framework code supplied by the IDE. Your code is a module that will be compiled and linked to the framework. When the framework starts running it first initializes itself then your module by calling the function setup(). Once initialized, the framework enters an infinite loop, calling your modules function loop() on each iteration.
Your code is using local variables in loop() and expecting that they will hold their values from call to call. While this might happen in practice (and likely does since that part of framework's main() is probably just while(1) loop();), this is invoking the demons of Undefined Behavior. C++ does not make any promises about the value of an uninitialized variable, and even reading it can cause anything to happen. Even apparently working.
To fix this, the accumulator AvrNum and the counter must be stored somewhere other than on loop()'s stack. They could be declared static, or moved to the module outside. Outside is better IMHO, especially in the constrained Arduino environment.
You also need to clear the accumulator after you finish an average. This is the simplest form of an averaging filter, where you sum up fixed length blocks of N samples, and then use that average each Nth sample.
I believe this fragment (untested) will work for you:
long AvrNum;
int counter;
void setup() {
AvrNum = 0;
counter = 0;
}
void loop() {
AvrNum += cs_4_2.capacitiveSensor(30);
counter++;
if (counter == 10) {
long AvrCap = AvrNum/10;
Serial.println(AvrCap);
counter = 0;
AvrNum = 0;
}
}
I provided a setup(), although it is redundant with the C++ language's guarantee that the global variables begin life initialized to 0.
your line if (counter = 10) is invalid. It should be if (counter == 10)
The first sets counter to 10 and will (of course) evaluate to true.
The second tests for counter equal to 10 and will not evaluate to true until counter is, indeed, equal to 10.
Also, kaylum mentions the other problem, no initialization of AvrNum
This is What I ended up coming up with after spending some more time on it. After some manual calc it gets all the data.
long AvrArray [9];
for(int x = 0; x <= 10; x++){
if(x == 10){
long AvrMes = (AvrArray[0] + AvrArray[1] + AvrArray[2] + AvrArray[3] + AvrArray[4] + AvrArray[5] + AvrArray[6] + AvrArray[7] + AvrArray[8] + AvrArray[9]);
long AvrCap = AvrMes/x;
Serial.print("\t");
Serial.println(AvrCap);
x = 0;
}
AvrArray[x] = cs_4_2.capacitiveSensor(30);
Serial.println(AvrArray[x]);
delay(500);

OpenCL MultiGPU slower than single GPU

I am developing an application which performs some processing on video frame data. To accelerate it I use 2 graphic cards and process the data with OpenCL. My idea is to send one frame to the first card and another one to the second card. The devices use the same context, but different command queues, kernels and memory objects.
However, it seems to me that the computations are not executed in parallel, because the time required by the 2 cards is almost the same as the time required by only one graphic card.
Does anyone have a good example of using multiple devices on independant data pieces simultaneously?
Thanks in advance.
EDIT:
Here is the resulting code after switching to 2 separate contexts. However, the execution time with 2 graphic cards still remains the same as with 1 graphic card.
cl::NDRange globalws(imageSize);
cl::NDRange localws;
for (int i = 0; i < numDevices; i++){
// Copy the input data to the device
commandQueues[i].enqueueWriteBuffer(inputDataBuffer[i], CL_TRUE, 0, imageSize*sizeof(float), wt[i].data);
// Set kernel arguments
kernel[i].setArg(0, inputDataBuffer[i]);
kernel[i].setArg(1, modulusBuffer[i]);
kernel[i].setArg(2, imagewidth);
}
for (int i = 0; i < numDevices; i++){
// Run kernel
commandQueues[i].enqueueNDRangeKernel(kernel[i], cl::NullRange, globalws, localws);
}
for (int i = 0; i < numDevices; i++){
// Read the modulus back to the host
float* modulus = new float[imageSize/4];
commandQueues[i].enqueueReadBuffer(modulusBuffer[i], CL_TRUE, 0, imageSize/4*sizeof(float), modulus);
// Do something with the modulus;
}
Your main problem is that you are using blocking calls. It doesn't matter how many devices you have, if you operate them in that way. Since you are doing an operation and waiting for it to finish, so no parallelization at all (or very little). You are doing this at the moment:
Wr:-Copy1--Copy2--------------------
G1:---------------RUN1--------------
G2:---------------RUN2--------------
Re:-------------------Read1--Read2--
You should change your code to do it like this at least:
Wr:-Copy1-Copy2-----------
G1:------RUN1-------------
G2:------------RUN2-------
Re:----------Read1-Read2--
With this code:
cl::NDRange globalws(imageSize);
cl::NDRange localws;
for (int i = 0; i < numDevices; i++){
// Set kernel arguments //YOU SHOULD DO THIS AT INIT STAGE, IT IS SLOW TO DO IT IN A LOOP
kernel[i].setArg(0, inputDataBuffer[i]);
kernel[i].setArg(1, modulusBuffer[i]);
kernel[i].setArg(2, imagewidth);
// Copy the input data to the device
commandQueues[i].enqueueWriteBuffer(inputDataBuffer[i], CL_FALSE, 0, imageSize*sizeof(float), wt[i].data);
}
for (int i = 0; i < numDevices; i++){
// Run kernel
commandQueues[i].enqueueNDRangeKernel(kernel[i], cl::NullRange, globalws, localws);
}
float* modulus[numDevices];
for (int i = 0; i < numDevices; i++){
// Read the modulus back to the host
modulus[i] = new float[imageSize/4];
commandQueues[i].enqueueReadBuffer(modulusBuffer[i], CL_FALSE, 0, imageSize/4*sizeof(float), modulus[i]);
}
clFinish();
// Do something with the modulus;
Regarding the comments to have multiple contexts, depends if you are ever going to comunicate both GPUs or not. As long as the GPUs only use their memory, theere will be no copy overhead. But if you set/unset kernel args constantly, that will trigger copys to the other GPUs. So, be careful with that.
The safer approach for a non-comunication between GPUs are different contexts.
I suspect your main problem is the memory copy and not the kernel execution, highly likely 1 GPU will fulfil your needs if you hide the memory latency:
Wr:-Copy1-Copy2-Copy3----------
G1:------RUN1--RUN2--RUN3------
Re:----------Read1-Read2-Read3-

OpenCL trying to use semaphore crashes drivers

While writing simple OpenCL kernel I tried to use semaphores and it crushed my GPU Drivers (AMD 12.10). After checking out examples I found out, that crash happens only when local work size is not equal to 1.
This code taken from example:
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable
void GetSemaphor(__global int * semaphor)
{
int occupied = atom_xchg(semaphor, 1);
while(occupied > 0)
{
occupied = atom_xchg(semaphor, 1);
}
}
void ReleaseSemaphor(__global int * semaphor)
{
int prevVal = atom_xchg(semaphor, 0);
}
__kernel void kernelNoAtomInc(__global int * num,
__global int * semaphor)
{
int i = get_global_id(0);
GetSemaphor(&semaphor[0]);
{
num[0]++;
}
ReleaseSemaphor(&semaphor[0]);
}
In example author uses
CQ.Execute(kernelNoAtomInc, null, new long[1] { N }, new long[1] { 1 }, null);
Where N = global_work_size and local_work_size = 1
Now if I change 1 to null or 2 or 4 or any other number i tried - AMD drivers will crush.
CQ.Execute(kernelNoAtomInc, null, new long[1] { N }, new long[1] { 2 }, null);
I do not have other PC to test on it at the moment. However it seems strange that author deliberately left local_group_size = 1, that's why I think I missing something here. Can someone please explain this to me? Also, as far as I understand, leaving local_group_size at 1 will affect performance greatly or it won't?
Thanks.
Host: Win8 x64, HD6870
Your problem is not reproducible and I can furthermore not find your source from the link, but here are a few ideas on why it could crash, which should be helpful (9 years in the past).
It propably crashes, because...
... the driver thinks you want the local version of that atom_xchg() function to be executed, when instead you want the global one.
... your loop slows down execution of that kernel so drastically on an old machine, that an internal limit of execution time got passed, causing the driver to terminate the kernel.
What I can suggest for a possible fix:
do not activate the local version of the atom function in your kernel
Try running it on CPU
There is no way to fix this, unless we could access your computer and debug on it.
You were also asking, why the author chose the local_group_size of one. This is because the global work size needs to be divisible by the local work size, such that the division results in a natural number. Dividing a natural number by one always results in a natural number, therefor this is perfect for experimenting. You are completely correct by saying that it will affect performance greatly. (Just maybe the maths didn't add up and it didn't crash, but not even start)
Different notes:
To make the incrementing be functionally correct, you should use an atom_inc() on your num buffer. I don't see how this could lead to a crash, but it definitely makes your program not work as intended
I would go and use the atomic functions from the 2.0 standard, since they already feature a semaphore-like functions: bool atomic_flag_test_and_set(volatile atomic_flag *object) and void atomic_flag_clear(volatile atomic_flag *object)

OpenCL random kernel behaviour when certain system size is exceeded

I am having a problem like this. Basically, I have a 2D grid allocated on host:
double* grid = (double*)malloc(sizeof(double)*(ny*nx) * 9);
Folllowing normal openCL procedure to put it on the openCL device:
cl_mem cl_grid = clCreateBuffer(context, CL_MEM_COPY_HOST_PTR, sizeof(double) * (ny*nx) * 9, grid, &error);
And Enqueue and launch:
clEnqueueNDRangeKernel(queue, foo, 1, NULL, &global_ws, &local_ws, 0, NULL, NULL);
In the kernel function, simple arithmetic is performed on the 1st column of the grid:
__kernel void foo(__constant ocl_param* params, __global double* grid)
{
const int ii = get_global_id(0);
int jj;
jj=0;
if (ii < params->ny) {
grid[getIndexUsingMacro(ii,jj)] += params->someNumber;
}
}
And finally read back the buffer and check values.
clEnqueueReadBuffer(queue, cl_grid, CL_TRUE, 0, sizeof(double) * 9 * nx * ny, checkGrid, 0, NULL, NULL);
The problem is when the grid size (i.e. nx * ny * 9) exceeds 16384 * 9 * 8 bytes = 1152KB (* 8 since double precision is used).
if using openCL on CPU, an error CL_OUT_OF_RESOURCES is thrown when launching the kernel no matter what I set for global_ws and local_ws (I set them to 1 and the error is still thrown). The CPU is an Intel i5 2415m with 8GB of RAM and 3MB cache.
If using openCL on the GPU (NVIDIA TESLA M2050), no error is thrown. However, when reading back the value from the buffer, the grid is not changed at all. It means it returns the grid whose values are exactly the same as before it is sent to the kernel function.
For e.g. When I set nx = 30, ny = 546, nx*ny = 16380, everything runs fine. The grid returned with the results changed as expected. But when ny = 547, nx* ny = 16410, the problem occurs both on CPU and GPU as described above. The problem is the same if I swap nx and ny, hence, if nx = 547, ny = 30, it still happens. Can you guys suggest what might be the problem here ?
Many thanks
It looks like a synchronization issue. grid[index] += value with the same index value may be executed concurrently by several work items. This operation is not atomic, and all these work items will load grid[index], add their value, and store it back, possibly losing some increments in the process.
To solve this, you can synchronize these work items using barrier if they are in a single work group, or enqueuing more kernels otherwise.
Another possibility is to ensure only one work item is able to modify a given element of the grid (usually the best solution).
If several work items need to work on a common subset of the grid, using local memory and local memory barriers may be useful.

Resources