I have a problem to send some data to my device.
The data type I want to receive in my kernel is :
typedef struct{uint3 nbCells;
float3 worldOrigin;
float3 cellSize;
float3 gridSize;
float radius;
} grid_t;
and the data type I send from the host is :
typedef struct{
uint nbCells[3];
float worldOrigin[3];
float cellSize[3];
float gridSize[3];
float radius;
} grid_t;
but it didn't work well.
I send that:
8, 8, 8; 0, 0, 0; 1.03368e-06, 1.03368e-06, 1.03368e-06; 8.2694e-06, 8.2694e-06, 8.2694e-06; 3e-07
but in my kernel I receive that :
8, 8, 8; 0, 0 1.03368e-06; 1.03368e-06, 8.2694e-06, 8.2694e-06; 3e-07, 8.2694e-06, 0; 1.16428e-05
I known the float3 is in reality considered like float4 in Opencl, so I try with float4 and array of 4 float, but it didn't work too. I try to receive the data with array of 3 float instead float3 and it work perfectly. Its seems, in opencl, a structure composed with 3 float haven't got the same memory size that an array of 3 float.
And with the same structure, but with double instead of float, that work perfectly.
You should never try to match CL types to non-CL types. Unless you really know they do match.
If this is your kernel type:
typedef struct{
uint3 nbCells;
float3 worldOrigin;
float3 cellSize;
float3 gridSize;
float radius;
} grid_t;
This should be your host types:
typedef struct{
cl_uint3 nbCells;
cl_float3 worldOrigin;
cl_float3 cellSize;
cl_float3 gridSize;
cl_float radius;
} grid_t;
Thats the use case for the cl data types defined on the host side (when you include the cl.h).
Your error comes from using a float3 type (same as float4), and emulating that type with a float[3] instead of float[4].
Related
I'm writing a renderer from scratch using openCL and I have a little compilation problem on my kernel with the error :
CL_BUILD_PROGRAM : error: program scope variable must reside in constant address space static float* objects;
The problem is that this program compiles on my desktop (with nvidia drivers) and doesn't work on my laptop (with nvidia drivers), also I have the exact same kernel file in another project that works fine on both computers...
Does anyone have an idea what I could be doing wrong ?
As a clarification, I'm coding a raymarcher which's kernel takes a list of objects "encoded" in a float array that is needed a lot in the program and that's why I need it accessible to the hole kernel.
Here is the kernel code simplified :
float* objects;
float4 getDistCol(float3 position) {
int arr_length = objects[0];
float4 distCol = {INFINITY, 0, 0, 0};
int index = 1;
while (index < arr_length) {
float objType = objects[index];
if (compare(objType, SPHERE)) {
// Treats the part of the buffer as a sphere
index += SPHERE_ATR_LENGTH;
} else if (compare(objType, PLANE)) {
//Treats the part of the buffer as a plane
index += PLANE_ATR_LENGTH;
} else {
float4 errCol = {500, 1, 0, 0};
return errCol;
}
}
}
__kernel void mkernel(__global int *image, __constant int *dimension,
__constant float *position, __constant float *aimDir, __global float *objs) {
objects = objs;
// Gets ray direction and stuf
// ...
// ...
float4 distCol = RayMarch(ro, rd);
float3 impact = rd*distCol.x + ro;
col = distCol.yzw * GetLight(impact);
image[dimension[0]*dimension[1] - idx*dimension[1]+idy] = toInt(col);
Where getDistCol(float3 position) gets called a lot by a lot of functions and I would like to avoid having to pass my float buffer to every function that needs to call getDistCol()...
There is no "static" variables allowed in OpenCL C that you can declare outside of kernels and use across kernels. Some compilers might still tolerate this, others might not. Nvidia has recently changed their OpenCL compiler from LLVM 3.4 to NVVM 7 in a driver update, so you may have the 2 different compilers on your desktop/laptop GPUs.
In your case, the solution is to hand the global kernel parameter pointer over to the function:
float4 getDistCol(float3 position, __global float *objects) {
int arr_length = objects[0]; // access objects normally, as you would in the kernel
// ...
}
kernel void mkernel(__global int *image, __constant int *dimension, __constant float *position, __constant float *aimDir, __global float *objs) {
// ...
getDistCol(position, objs); // hand global objs pointer over to function
// ...
}
Lonely variables out in the wild are only allowed as constant memory space, which is useful for large tables. They are cached in L2$, so read-only access is potentially faster. Example
constant float objects[1234] = {
1.0f, 2.0f, ...
};
I currently have something like this in my kernel code:
func(__global float2 *array, __global float *buffer) {
float *vector[2];
vector[0] = array.s0;
vector[1] = array.s1;
So I can do something like this later in the code:
vector[vec_off][index] = buffer[i];
Basically, I want to be able to access the elements of a float2 in my code based on a calculated index. The point is to be able to easily expand it to a float4/float16 vector later on.
Currently I get a (-11) error (CL_BUILD_PROGRAM_FAILURE) when I try to do vector[0] = array.x; Which I guess means I'm not allowed to write it (like that?) in OpenCL.
If it's not just a syntax error, I should be able to do this by accessing each element of array using an offset, so I would have:
array.s0 = array
array.s1 = array + offset
...
array.sf = array + 15 * offset
I do not know however how a floatn is stored in memory. Is the .s1 part stored right after the .s0? Is that is the case, then offset would just be the size of array.s0, right?
Thank you.
To be able to use calculated index to access float2 elements you can use union or cast directly to float*:
1. Using union
Define the following union:
union float_type
{
float2 data2;
float data[2];
};
and then cast float2 array on the fly and access elements using calculated index:
func(__global float2 *array, __global float *buffer) {
float foo = ((__global union float_type*)array)[1].data[1];
}
2. cast to float*
func(__global float2 *array, __global float *buffer) {
float foo = ((__global float*)&array[1])[1];
}
I've been having trouble with a misaligned structure. Here are the structures involved:
struct Ray
{
float4 origin;
float4 dir;
float len;
float dummy [3];
};
struct RayStack
{
struct Ray r [STACK_DEPTH];
int depth [STACK_DEPTH];
float refr [STACK_DEPTH];
int top;
float dummy [3];
};
Incidentally, STACK_DEPTH is a multiple of 4. I've been careful to make sure that all the structures are a multiple of 16 in size and that float4's within are on an aligned boundary.
The problem is when I use it as a local variable, the structure RayStack is unaligned:
struct RayStack stack;
printf("stack: %p\n", &stack);
The stack address ends up ending in 8 and not 0 as I would want for a 16-byte aligned structure. This causes a crash on ATI cards (although Intel and nVidia are not bothered by it). I've tried placing __attribute__((aligned(16))) in the structure (before and after), and in the local variable definition and that doesn't change anything. Actually, adding a printf statement fixed the problem although I have no idea how.
Is there away to ensure that the local variable stack is aligned on a 16 byte boundary and stop the crashing on ATI cards.
Thanks!
You do know that arrays in structures have to be aligned to a 16 byte boundary?
What's the "dummy" array for? Padding? If so, don't use arrays for padding.
From my experience with Nvidia, ATI and Intel, the following is the most safe method:
struct Ray
{
float4 origin;
float4 dir;
float len;
float padding1;
float padding2;
float padding3;
};
struct RayStack
{
struct Ray r[STACK_DEPTH];
int depth[STACK_DEPTH];
float refr[STACK_DEPTH];
int top;
float padding1;
float padding2;
float padding3;
};
I would like to copy float2 values back to CPU. The results are correct in GPU side but some how the results are incorrect in CPU side. Can someone please help me
GPU code
#pragma OPENCL EXTENSION cl_amd_printf : enable
__kernel void matM(__global float* input, int width, int height, __global float2* output){
int X = get_global_id(0);
float2 V;
V.x = input [X];
V.y = input [X];
output[X] = V;
printf("%f\t %f\n",output[X].x,output[X].y);
}
CPU code
output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float2) * wid*ht, NULL, NULL);
clEnqueueReadBuffer( commands, output,CL_TRUE, 0, sizeof(cl_float2) * wid *ht, results, 0, NULL, NULL );
The printf inside GPU kernel prints correct results but the host side results are incorrect.
Thanks for helping
cl_float2 datatype can be used on host side to access float2 data,
but my problem was something else.
There was a mismatch in global ids,
I had two global ids and in line 3 should have been int X = get_global_id(0) + get_global_id(1).
I need to pass a complex data type to OpenCL as a buffer and I want (if possible) to avoid the buffer alignment.
In OpenCL I need to use two structures to differentiate the data passed in the buffer casting to them:
typedef struct
{
char a;
float2 position;
} s1;
typedef struct
{
char a;
float2 position;
char b;
} s2;
I define the kernel in this way:
__kernel void
Foo(
__global const void* bufferData,
const int amountElements // in the buffer
)
{
// Now I cast to one of the structs depending on an extra value
__global s1* x = (__global s1*)bufferData;
}
And it works well only when I align the data passed in the buffer.
The question is: Is there a way to use _attribute_ ((packed)) or _attribute_((aligned(1))) to avoid the alignment in data passed in the buffer?
If padding the smaller structure is not an option, I suggest passing another parameter to let your kernel function know what the type is - maybe just the size of the elements.
Since you have data types that are 9 and 10 bytes, it may be worth a try padding them both out to 12 bytes depending on how many of them you read within your kernel.
Something else you may be interested in is the extension: cl_khr_byte_addressable_store
http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/cl_khr_byte_addressable_store.html
update:
I didn't realize you were passing a mixed array, I thought It was uniform in type. If you want to track the type on a per-element basis, you should pass a list of the types (or codes). Using float2 on its own in bufferData would probably be faster as well.
__kernel void
Foo(
__global const float2* bufferData,
__global const char* bufferTypes,
const int amountElements // in the buffer
)