Basic OpenCL Mutex Implementation (Currently Hanging) - opencl

I am trying to write a mutex for OpenCL. The idea is for every single individual work item to be able to proceed atomically. Currently, I believe the problem may be that thread warps are unable to proceed when one thread in a warp gets the lock.
My current simple kernel below, for summing numbers. "numbers" is an array of floats as input. "sum" is a one element array for the result, and "semaphore" is a one element array for holding the semaphore. I based it heavily off the example here.
void acquire(__global int* semaphore) {
int occupied;
do {
occupied = atom_xchg(semaphore, 1);
} while (occupied>0);
}
void release(__global int* semaphore) {
atom_xchg(semaphore, 0); //the previous value, which is returned, is ignored
}
__kernel void test_kernel(__global float* numbers, __global float* sum, __global int* semaphore) {
int i = get_global_id(0);
acquire(semaphore);
*sum += numbers[i];
release(semaphore);
}
I am calling the kernel effectively like:
int numof_dimensions = 1;
size_t offset_global[1] = {0};
size_t size_global[1] = {4000}; //the length of the numbers array
size_t* size_local = NULL;
clEnqueueNDRangeKernel(command_queue, kernel, numof_dimensions,offset_global,size_global,size_local, 0,NULL, NULL);
As above, when running, the graphics card hangs, and the driver restarts itself. How can I fix it so that it doesn't?

What you are trying to do is not possible because of the GPU execution model, where all threads on a "processor" share the instruction pointer, even in branches. Here is a post that explains the problem in detail: http://vansa.ic.cz/author/admin/.
BTW, the example code that you found has the exact same problem and would never work.

The answer to this might seem obvious in retrospect, but it's not unless you thought of it.
Basically, the GPU's prediction of the ideal local group size (size of a thread warp) is greater than 1, and so thread warps lock up. To fix it, you just need to specify it to be 1 (i.e. "size_t size_local[1] = {1};"). Doing this produces a correct result.

Related

OpenCL: 3D array processing - Globale size limit

I'm working with an 3D array of dimension xdim=49, ydim=1024 and zdim=64. my DEVICE_MAX_WORK_ITEM_SIZES is only 512/512/512. If I declare my
size_t global_work_size = {xdim, ydim, zdim}; and launch an 3D kernel,
I'm getting wrong results since my ydim > 512. If all my dimensions are below 512, I'm getting the expected results. Please let me know if there's an alternative for this?
CL_DEVICE_MAX_WORK_ITEM_SIZES only limits the size of work groups, not the global work item size (yea, it's a terrible name for the constant). You are much more tightly restricted by CL_DEVICE_MAX_WORK_GROUP_SIZE which is the total number of items allowed in a work group (you'd typically hit this far sooner than CL_DEVICE_MAX_WORK_ITEM_SIZES because of multiplication.
So go ahead an launch your global work size of 49, 1024, 64. It should work. If it's not, you're using get_local_id instead of get_global_id or have some other bug. We regularly launch 2D kernels with 4096 x 4096 global work size.
See also Questions about global and local work size
If you don't use shared local memory, you don't need to worry about local work group sizes. In fact, you can pass NULL instead of a pointer to an array of sizes for local_work_size and let the runtime pick something (it helps if your global dimensions are easily divisible by small numbers).
Assuming the dimensions you provided are the size of your data, you can decrease the global work size by making each GPU thread calculate more data. What I mean is, every thread in your case will do one calculation and if you change your kernels to do let's say 2 calculations in y dimension, than you could cut the number of threads you are firing into half. The global_work_size decides how many threads in each direction you are executing. Let me give you an example:
Let's assume you have an array you want to do some calculations with and the array size you have is 2048. If you write your kernel in the following way, you are going to need 2048 as the global_work_size:
__kernel void calc (__global int *A, __global int *B)
{
int i = get_global_id(0);
B[i] = A[i] * 5;
}
The global work size in this case will be:
size_t global_work_size = {2048, 1, 1};
However, if you change your kernel into the following kernel, you can lower your global work size as well: ()
__kernel void new_calc (__global int *A, __global int *B)
{
int i = get_global_id(0);
for (int ind = 0; ind < 8; ind++)
B[i*8 + ind] = A[i*8 + ind] * 5;
}
Then this way, you can use global size as:
size_t global_work_size = {256, 1, 1};
Also with the second kernel, each of your threads will execute more work, resulting in more utilisation.

Using async_work_group_copy() with pointer?

__kernel void kmp(__global char pattern[1*4], __global char* string, __global int failure[1*4], __global int ret[1], int g_length, int l_length, int thread_num){
int pattern_num = 1;
int pattern_size = 4;
int gid = get_group_id(0);
int glid = get_global_id(0);
int lid = get_local_id(0);
int i, j, x = 0;
__local char *tmp_string;
event_t event;
if(l_length < pattern_size){
return;
}
event = async_work_group_copy(tmp_string, string+gid*g_length, g_length, 0);
wait_group_events(1, &event);
Those are some part of my code.
I want to find the matched pattern in the text.
First, initialize all my patterns and string(I read string from text and experimentally use one pattern only) on CPU side.
Second, transfer them to kernel named kmp.
(parameters l_length and g_length are the size of string which will be copied to lid and glid each. In other words, the pieces of string)
And lastly, I want to copy the divided string to local memory.
But there is a problem. I cannot get any valid result when I copy them using async_work_group_copy().
When I change __local char*tmp_string to array, the problem still remains.
What I want to do is 1)divide the string 2)copy them to each thread 3)and compute the matching number.
I wonder what's wrong in this code. Thanks!
OpenCL spec has this:
The async copy is performed by all work-items in a work-group and this
built-in function must therefore be encountered by all work-items in a
work-group executing the kernel with the same argument values;
otherwise the results are undefined.
so you shouldn't return early for any workitems in a group. Early return is better suited to CPU anyway. If this is GPU, just compute the last overflowing part using augmented/padded input-output buffers.
Otherwise, you can early return whole group(this should work since no workitem hitting any async copy instruction) and do the remaining work on the cpu, unless the device doesn't use any workitems(but a dedicated secret pipeline) for the async copy operation.
Maybe you can enqueue a second kernel(in another queue concurrently) to compute remaining latest items with workgroupsize=remaining_size instead of having extra buffer size or control logic.
tmp_string needs to be initialized/allocated if you are going to copy something to/from it. So you probably will need the array version of it.
async_work_group_copy is not a synchronization point so needs a barrier before it to get latest bits of local memory to use for async copy to global.
__kernel void foo(__global int *a, __global int *b)
{
int i=get_global_id(0);
int g=get_group_id(0);
int l=get_local_id(0);
int gs=get_local_size(0);
__local int tmp[256];
event_t evt=async_work_group_copy(tmp,&a[g*gs],gs,0);
// compute foobar here in async to copies
wait_group_events(1,&evt);
tmp[l]=tmp[l]+3; // compute foobar2 using local memory
barrier(CLK_LOCAL_MEM_FENCE);
event_t evt2=async_work_group_copy(&b[g*gs],tmp,gs,0);
// compute foobar3 here in async to copies
wait_group_events(1,&evt2);
}

OpenCL usage of register value crashes program

I finished writing an OpenCL kernel for thermodynamics calculations and observed a really weird bug.
My kernel looks like this:
__kernel void energy(... float3 dest, int nlocal, ...){
int i = get_global_id(0);
float3 ev = {0.0f, 0.0f, 0.0f};
for(...){
//some thermo calculations, adding values to evx and evy
ev.x +=...;
ev.y +=...;
}
//Then I want to save the result in dest[i].
//Program exits at next two line
dest[i].x = ev.x;
dest[i].y = ev.y;
I get an "unmapped Memory" and segfault error. I get the same error when trying to print out the value using printf. Seems like the program can't read the value. Writing to it works though!(Maybe because of some compiler optimizations)
Now if I use another float register value, I get the same error. But if I change the last lines to something like this (no use of ev.x or ev.y)
dest[i].x = i/nlocal*3.1f
dest[i].y = ...;
everything is going as expected and I get no error.
This works too:
int i = ...
float3 = {0.0f, ...}
dest[i].x = ev.x;
But somehow after the actual calculation it is not possible anymore.
The program is running on a Nvidia K40m, Kepler architecture.
This looks suspicious in your code:
kernel(... __global int* neigh
__global int* neighs = neigh+i;
...
int j = neighs[k*n];
...
Seems like you are passing a array of pointers in neigh, then getting the pointer and using it.
Pointers are not allowed in CL, if you pass pointers then you are addressing out of the GPU memory, and therefore crashing.
It is also possible that your vectors are simply not properly calculated, the sizes should be:
res, nneigh = GLOBAL_SIZE
neighs = max(nneigh[])*n
x = max(neighs[])
And also possible you did create the buffers smaller than they should be (remember they are floats, and float3, which use 32bits and 128bits per element). CL API calls are defined in bytes (you should use sizeof()), not in elements.
Okay I found the answer and the code above is working. I changed the kernel parameters for better understanding and corrected the mistake unconsiously when I posted the code here.
int numneigh = nneigh[i] (stands for number of neighbors) is correct
in the original code I did this:
int numneigh = neigh[i] (the neighbors)
Thanks for helping, and your guess that something is wrong with neigh/nneigh was correct, even though the mistake was not in code posted above :P

Use Comment to avoid OpenCL Error on NVIDIA

I wrote the following code for my test NVIDIA and AMD GPUs
kernel void computeLayerOutput_Rolled(
global Layer* layers,
global float* weights,
global float* output,
constant int* restrict netSpec,
int layer)
{
const int n = get_global_size(0);
const int nodeNumber = get_global_id(0); //There will be an offset depending on the layer we are operating on
int numberOfWeights;
float t;
//getPosition(i, netSpec, &layer, &nodeNumber);
numberOfWeights = layers[layer].nodes[nodeNumber].numberOfWeights;
//if (sizeof(Layer) > 60000) // This is the extra code add for nvidia
// exit(0);
t = 0;
for (unsigned int j = 0; j != numberOfWeights; ++j)
t += threeD_access(weights, layer, nodeNumber, j, MAXSIZE, MAXSIZE) *
twoD_access(output, layer-1, j, MAXSIZE);
twoD_access(output, layer, nodeNumber, MAXSIZE) = sigmoid(t);
}
At the beginning, I did not add the code that checking the size of Layer, and it works on AMD Kalindi GPU, but crash and report an error code -36 on NVIDIA Tesla C2075.
Since I had rewritten the struct type Layer and decreased the size of it a lot before, I decided to check the size of Layer to determine whether this struct defined well in kernel code. Then I added this code
if (sizeof(Layer) > 60000)
exit(0);
Then it is OK on NVIDIA. However, the strange thing is, when I add // before this just as the given code above, it still works. (I believe I do not need to make clean && make when I rewrite something in kernel code, but I still did it) Nevertheless, when I roll back to the version not contains this comment, it fails and the error code -36 appears again. It really puzzles me. I think two versions of my code are identical, isn't it?

OpenCL - is it possible to invoke another function from within a kernel?

I am following along with a tutorial located here: http://opencl.codeplex.com/wikipage?title=OpenCL%20Tutorials%20-%201
The kernel they have listed is this, which computes the sum of two numbers and stores it in the output variable:
__kernel void vector_add_gpu (__global const float* src_a,
__global const float* src_b,
__global float* res,
const int num)
{
/* get_global_id(0) returns the ID of the thread in execution.
As many threads are launched at the same time, executing the same kernel,
each one will receive a different ID, and consequently perform a different computation.*/
const int idx = get_global_id(0);
/* Now each work-item asks itself: "is my ID inside the vector's range?"
If the answer is YES, the work-item performs the corresponding computation*/
if (idx < num)
res[idx] = src_a[idx] + src_b[idx];
}
1) Say for example that the operation performed was much more complex than a summation - something that warrants its own function. Let's call it ComplexOp(in1, in2, out). How would I go about implementing this function such that vector_add_gpu() can call and use it? Can you give example code?
2) Now let's take the example to the extreme, and I now want to call a generic function that operates on the two numbers. How would I set it up so that the kernel can be passed a pointer to this function and call it as necessary?
Yes it is possible. You just have to remember that OpenCL is based on C99 with some caveats. You can create other functions either inside of the same kernel file or in a seperate file and just include it in the beginning. Auxiliary functions do not need to be declared as inline however, keep in mind that OpenCL will inline the functions when called. Pointers are also not available to use when calling auxiliary functions.
Example
float4 hit(float4 ray_p0, float4 ray_p1, float4 tri_v1, float4 tri_v2, float4 tri_v3)
{
//logic to detect if the ray intersects a triangle
}
__kernel void detection(__global float4* trilist, float4 ray_p0, float4 ray_p1)
{
int gid = get_global_id(0);
float4 hitlocation = hit(ray_p0, ray_p1, trilist[3*gid], trilist[3*gid+1], trilist[3*gid+2]);
}
You can have auxiliary functions for use in the kernel, see OpenCL user defined inline functions . You can not pass function pointers into the kernel.

Resources