Static variable in OpenCL C - opencl

I'm writing a renderer from scratch using openCL and I have a little compilation problem on my kernel with the error :
CL_BUILD_PROGRAM : error: program scope variable must reside in constant address space static float* objects;
The problem is that this program compiles on my desktop (with nvidia drivers) and doesn't work on my laptop (with nvidia drivers), also I have the exact same kernel file in another project that works fine on both computers...
Does anyone have an idea what I could be doing wrong ?
As a clarification, I'm coding a raymarcher which's kernel takes a list of objects "encoded" in a float array that is needed a lot in the program and that's why I need it accessible to the hole kernel.
Here is the kernel code simplified :
float* objects;
float4 getDistCol(float3 position) {
int arr_length = objects[0];
float4 distCol = {INFINITY, 0, 0, 0};
int index = 1;
while (index < arr_length) {
float objType = objects[index];
if (compare(objType, SPHERE)) {
// Treats the part of the buffer as a sphere
index += SPHERE_ATR_LENGTH;
} else if (compare(objType, PLANE)) {
//Treats the part of the buffer as a plane
index += PLANE_ATR_LENGTH;
} else {
float4 errCol = {500, 1, 0, 0};
return errCol;
}
}
}
__kernel void mkernel(__global int *image, __constant int *dimension,
__constant float *position, __constant float *aimDir, __global float *objs) {
objects = objs;
// Gets ray direction and stuf
// ...
// ...
float4 distCol = RayMarch(ro, rd);
float3 impact = rd*distCol.x + ro;
col = distCol.yzw * GetLight(impact);
image[dimension[0]*dimension[1] - idx*dimension[1]+idy] = toInt(col);
Where getDistCol(float3 position) gets called a lot by a lot of functions and I would like to avoid having to pass my float buffer to every function that needs to call getDistCol()...

There is no "static" variables allowed in OpenCL C that you can declare outside of kernels and use across kernels. Some compilers might still tolerate this, others might not. Nvidia has recently changed their OpenCL compiler from LLVM 3.4 to NVVM 7 in a driver update, so you may have the 2 different compilers on your desktop/laptop GPUs.
In your case, the solution is to hand the global kernel parameter pointer over to the function:
float4 getDistCol(float3 position, __global float *objects) {
int arr_length = objects[0]; // access objects normally, as you would in the kernel
// ...
}
kernel void mkernel(__global int *image, __constant int *dimension, __constant float *position, __constant float *aimDir, __global float *objs) {
// ...
getDistCol(position, objs); // hand global objs pointer over to function
// ...
}
Lonely variables out in the wild are only allowed as constant memory space, which is useful for large tables. They are cached in L2$, so read-only access is potentially faster. Example
constant float objects[1234] = {
1.0f, 2.0f, ...
};

Related

Force all threads in a work group to execute the same if/else branch

I would like to use the local/shared memory optimization to reduce global memory access, so I basically have this function
float __attribute__((always_inline)) test_unoptimized(const global float* data, ...) {
// ...
for(uint j=0; j<def_data_length; j++) {
const float x = data[j];
// do sime computation with x, like finding the minimum value ...
}
// ...
return x_min;
}
and do the usual local/shared memory optimization on it:
float __attribute__((always_inline)) test_optimized(const global float* data, ...) {
// ...
const uint lid = get_local_id(0); // shared memory optimization (only works with first ray)
local float cache_x[def_ws];
for(uint j=0; j<def_data_length; j+=def_ws) {
cache_x[lid] = data[j+lid];
barrier(CLK_LOCAL_MEM_FENCE);
#pragma unroll
for(uint k=0; k<min(def_ws, def_data_length-j); k++) {
const float x = cache_x[k];
// do sime computation with x, like finding the minimum value ...
}
barrier(CLK_LOCAL_MEM_FENCE);
}
// ...
return x_min;
}
Now the difficulty is that test_optimized is called in the kernel only in one of two possible if/else branches. If only some threads in a workgroup execute the else-branch, all other threads must not choose the if-branch for the local memory optimization in test_optimized to work. So I created a workaround: The condition for each thread in the workgroup is atomic_or-ed into an integer and then the integer, which is the same for all threads, is checked for branching. This ensures that, if 1 or more threads in the thread block choose the else-branch, all the others do too.
kernel void test_kernel(const global float* data, global float* result...) {
const uint n = get_global_id(0);
// ...
const bool condition = ...; // here I get some condition based on the thread ID n and global data
local uint condition_any; // make sure all threads within a workgroup are in the if/else part
condition_any = 0u;
barrier(CLK_LOCAL_MEM_FENCE);
atomic_or(&condition_any, condition);
barrier(CLK_LOCAL_MEM_FENCE);
if(condition_any==0u) {
// if-part is very short
result = 0;
return;
} else {
// else-part calls test_optimized function
const float x_min = test_optimized(data, ...);
result = condition ? x_min : 0;
}
}
The above code works flawlessly and is about 25% faster than with the test_unoptimized function. But atomically jamming a bit into the same local memory from all threads in the workgroup seems a bit like a hack to me and it only runs efficiently for small workgroup size (def_ws) 32, 64 or 128, but not 256 or greater.
Is this trick used in other codes and does it have a name?
If not: Is there a better way to do it?
With OpenCL 1.2 or older, I don't think there's a way to do this any faster. (I'm not aware of any relevant vendor extensions, but check your implementation's list for anything promising.)
With OpenCL 2.0+, you can use workgroup functions, in this case specifically work_group_any() for this sort of thing.

Using async_work_group_copy() with pointer?

__kernel void kmp(__global char pattern[1*4], __global char* string, __global int failure[1*4], __global int ret[1], int g_length, int l_length, int thread_num){
int pattern_num = 1;
int pattern_size = 4;
int gid = get_group_id(0);
int glid = get_global_id(0);
int lid = get_local_id(0);
int i, j, x = 0;
__local char *tmp_string;
event_t event;
if(l_length < pattern_size){
return;
}
event = async_work_group_copy(tmp_string, string+gid*g_length, g_length, 0);
wait_group_events(1, &event);
Those are some part of my code.
I want to find the matched pattern in the text.
First, initialize all my patterns and string(I read string from text and experimentally use one pattern only) on CPU side.
Second, transfer them to kernel named kmp.
(parameters l_length and g_length are the size of string which will be copied to lid and glid each. In other words, the pieces of string)
And lastly, I want to copy the divided string to local memory.
But there is a problem. I cannot get any valid result when I copy them using async_work_group_copy().
When I change __local char*tmp_string to array, the problem still remains.
What I want to do is 1)divide the string 2)copy them to each thread 3)and compute the matching number.
I wonder what's wrong in this code. Thanks!
OpenCL spec has this:
The async copy is performed by all work-items in a work-group and this
built-in function must therefore be encountered by all work-items in a
work-group executing the kernel with the same argument values;
otherwise the results are undefined.
so you shouldn't return early for any workitems in a group. Early return is better suited to CPU anyway. If this is GPU, just compute the last overflowing part using augmented/padded input-output buffers.
Otherwise, you can early return whole group(this should work since no workitem hitting any async copy instruction) and do the remaining work on the cpu, unless the device doesn't use any workitems(but a dedicated secret pipeline) for the async copy operation.
Maybe you can enqueue a second kernel(in another queue concurrently) to compute remaining latest items with workgroupsize=remaining_size instead of having extra buffer size or control logic.
tmp_string needs to be initialized/allocated if you are going to copy something to/from it. So you probably will need the array version of it.
async_work_group_copy is not a synchronization point so needs a barrier before it to get latest bits of local memory to use for async copy to global.
__kernel void foo(__global int *a, __global int *b)
{
int i=get_global_id(0);
int g=get_group_id(0);
int l=get_local_id(0);
int gs=get_local_size(0);
__local int tmp[256];
event_t evt=async_work_group_copy(tmp,&a[g*gs],gs,0);
// compute foobar here in async to copies
wait_group_events(1,&evt);
tmp[l]=tmp[l]+3; // compute foobar2 using local memory
barrier(CLK_LOCAL_MEM_FENCE);
event_t evt2=async_work_group_copy(&b[g*gs],tmp,gs,0);
// compute foobar3 here in async to copies
wait_group_events(1,&evt2);
}

Use Comment to avoid OpenCL Error on NVIDIA

I wrote the following code for my test NVIDIA and AMD GPUs
kernel void computeLayerOutput_Rolled(
global Layer* layers,
global float* weights,
global float* output,
constant int* restrict netSpec,
int layer)
{
const int n = get_global_size(0);
const int nodeNumber = get_global_id(0); //There will be an offset depending on the layer we are operating on
int numberOfWeights;
float t;
//getPosition(i, netSpec, &layer, &nodeNumber);
numberOfWeights = layers[layer].nodes[nodeNumber].numberOfWeights;
//if (sizeof(Layer) > 60000) // This is the extra code add for nvidia
// exit(0);
t = 0;
for (unsigned int j = 0; j != numberOfWeights; ++j)
t += threeD_access(weights, layer, nodeNumber, j, MAXSIZE, MAXSIZE) *
twoD_access(output, layer-1, j, MAXSIZE);
twoD_access(output, layer, nodeNumber, MAXSIZE) = sigmoid(t);
}
At the beginning, I did not add the code that checking the size of Layer, and it works on AMD Kalindi GPU, but crash and report an error code -36 on NVIDIA Tesla C2075.
Since I had rewritten the struct type Layer and decreased the size of it a lot before, I decided to check the size of Layer to determine whether this struct defined well in kernel code. Then I added this code
if (sizeof(Layer) > 60000)
exit(0);
Then it is OK on NVIDIA. However, the strange thing is, when I add // before this just as the given code above, it still works. (I believe I do not need to make clean && make when I rewrite something in kernel code, but I still did it) Nevertheless, when I roll back to the version not contains this comment, it fails and the error code -36 appears again. It really puzzles me. I think two versions of my code are identical, isn't it?

OpenCL Array Indexing Seems Broken

I've got a kernel with a simple array declaration and initialization, and an extra function "get_smooth_vertex(...)", which I have changed so as to demonstrate a problem:
//More const __constant declarations
const __constant int edge_parents[12][2] = { {0,1}, {0,2}, {1,3}, {2,3}, {0,4}, {1,5}, {2,6}, {3,7}, {4,5}, {4,6}, {5,7}, {6,7} };
//More Functions
float3 get_smooth_vertex(const int edge_index, const float* cube_potentials) {
int i1 = edge_parents[edge_index][0];
int i2 = edge_parents[edge_index][1];
if (i1==i2) return (float3)(0);\n"
return (float3)(1);\n"
}
__kernel void march(const __global float* potentials, __global float* vertices, __global float* normals, const __constant float4* points, const int numof_points) {
//Lots of stuff.
//Call get_smooth_vertex(...) a few times
//More stuff.
}
The if path in "get_smooth_vertex(...)" always seems to get executed! Now, I can't imagine why this would be, because each pair in "edge_parents" is different. I checked "edge_index", and it is always >= 0 and always <= 11. Furthermore, none of the variables are aliased in global or local scope. The kernel (and host code, FWIW) compiles with no warnings or errors.
So, I can't figure out what's wrong--why would the indices equal each other? Alignment, maybe? Am I just completely forgetting how C works or something? Watch—it's going to be royal user error . . .
I checked your code and the comparison works just fine (after removing the trailing \n") . You have probably made a mistake when evaluating the return value of get_smooth_vertex(). But this is hard to tell without code that shows how it is called.

Avoiding data alignment in OpenCL

I need to pass a complex data type to OpenCL as a buffer and I want (if possible) to avoid the buffer alignment.
In OpenCL I need to use two structures to differentiate the data passed in the buffer casting to them:
typedef struct
{
char a;
float2 position;
} s1;
typedef struct
{
char a;
float2 position;
char b;
} s2;
I define the kernel in this way:
__kernel void
Foo(
__global const void* bufferData,
const int amountElements // in the buffer
)
{
// Now I cast to one of the structs depending on an extra value
__global s1* x = (__global s1*)bufferData;
}
And it works well only when I align the data passed in the buffer.
The question is: Is there a way to use _attribute_ ((packed)) or _attribute_((aligned(1))) to avoid the alignment in data passed in the buffer?
If padding the smaller structure is not an option, I suggest passing another parameter to let your kernel function know what the type is - maybe just the size of the elements.
Since you have data types that are 9 and 10 bytes, it may be worth a try padding them both out to 12 bytes depending on how many of them you read within your kernel.
Something else you may be interested in is the extension: cl_khr_byte_addressable_store
http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/cl_khr_byte_addressable_store.html
update:
I didn't realize you were passing a mixed array, I thought It was uniform in type. If you want to track the type on a per-element basis, you should pass a list of the types (or codes). Using float2 on its own in bufferData would probably be faster as well.
__kernel void
Foo(
__global const float2* bufferData,
__global const char* bufferTypes,
const int amountElements // in the buffer
)

Resources