Whats the point of __ in OpenCL kernels? - opencl

So let's say I have two OCL kernels:
__kernel void vdotprod(
__global int* x,
__global int* y,
__global int* z,
__global int* d,
const int npoints)
and
kernel void vdotprod(
global int* x,
global int* y,
global int* z,
global int* d,
const int npoints)
Assuming all other aspects of code are the same (incl host code and all), does the __ affect anything? What is the purpose of the __?

The double underscore prefix does not affect the semantics of your OpenCL program.
All OpenCL specific keywords can optionally use a double underscore prefix. Use of this prefix is entirely down to programmer preference. For example, some people prefer to use the underscores as it emphasis where OpenCL extends the standard C99 language. Others prefer to omit them for brevity.

Related

memcpy in opencl 1.1

Is there any mechanisms like memcpy in opencl. I want to copy a struct of pointers on the gpu, and as we cannot copy them directly from host to device, so I decide to create sepeprate buffer for each pointer and then put all of them together in one struct on the device. So, I wanted to have a mechanism like memcpy to copy the data from one buffer to other one. Is there anything like that?
struct Grid_gpu {
cl_uint3 dims;
cl_uint* elements_beg_index;
cl_uint n_element_cells;
cl_uint* element_cells;
};
So, to do that I defined a struct on my device as follows:
typedef struct {
uint3 dims;
__global uint* element_begIndices;
__global uint n_element_cells;
__global uint* element_cells;
} Grid;
Then I just used the following kernel to fill the memory which is of type Grid:
// a kernel to fill a grid acceleration structure.
__kernel void fill_grid_accel(
uint3 dims,
__global uint* element_begIndices,
uint n_element_cells,
__global uint* element_cells,
__global Grid* grid){
grid->dims.x = dims.x;
grid->dims.y = dims.y;
grid->dims.z = dims.z;
grid->element_begIndices = element_begIndices;
grid->n_element_cells = n_element_cells;
grid->element_cells = element_cells;
}
Now, the grid memory contains the pointer is filled with required data.s

OpenCL Array Indexing Seems Broken

I've got a kernel with a simple array declaration and initialization, and an extra function "get_smooth_vertex(...)", which I have changed so as to demonstrate a problem:
//More const __constant declarations
const __constant int edge_parents[12][2] = { {0,1}, {0,2}, {1,3}, {2,3}, {0,4}, {1,5}, {2,6}, {3,7}, {4,5}, {4,6}, {5,7}, {6,7} };
//More Functions
float3 get_smooth_vertex(const int edge_index, const float* cube_potentials) {
int i1 = edge_parents[edge_index][0];
int i2 = edge_parents[edge_index][1];
if (i1==i2) return (float3)(0);\n"
return (float3)(1);\n"
}
__kernel void march(const __global float* potentials, __global float* vertices, __global float* normals, const __constant float4* points, const int numof_points) {
//Lots of stuff.
//Call get_smooth_vertex(...) a few times
//More stuff.
}
The if path in "get_smooth_vertex(...)" always seems to get executed! Now, I can't imagine why this would be, because each pair in "edge_parents" is different. I checked "edge_index", and it is always >= 0 and always <= 11. Furthermore, none of the variables are aliased in global or local scope. The kernel (and host code, FWIW) compiles with no warnings or errors.
So, I can't figure out what's wrong--why would the indices equal each other? Alignment, maybe? Am I just completely forgetting how C works or something? Watch—it's going to be royal user error . . .
I checked your code and the comparison works just fine (after removing the trailing \n") . You have probably made a mistake when evaluating the return value of get_smooth_vertex(). But this is hard to tell without code that shows how it is called.

Avoiding data alignment in OpenCL

I need to pass a complex data type to OpenCL as a buffer and I want (if possible) to avoid the buffer alignment.
In OpenCL I need to use two structures to differentiate the data passed in the buffer casting to them:
typedef struct
{
char a;
float2 position;
} s1;
typedef struct
{
char a;
float2 position;
char b;
} s2;
I define the kernel in this way:
__kernel void
Foo(
__global const void* bufferData,
const int amountElements // in the buffer
)
{
// Now I cast to one of the structs depending on an extra value
__global s1* x = (__global s1*)bufferData;
}
And it works well only when I align the data passed in the buffer.
The question is: Is there a way to use _attribute_ ((packed)) or _attribute_((aligned(1))) to avoid the alignment in data passed in the buffer?
If padding the smaller structure is not an option, I suggest passing another parameter to let your kernel function know what the type is - maybe just the size of the elements.
Since you have data types that are 9 and 10 bytes, it may be worth a try padding them both out to 12 bytes depending on how many of them you read within your kernel.
Something else you may be interested in is the extension: cl_khr_byte_addressable_store
http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/cl_khr_byte_addressable_store.html
update:
I didn't realize you were passing a mixed array, I thought It was uniform in type. If you want to track the type on a per-element basis, you should pass a list of the types (or codes). Using float2 on its own in bufferData would probably be faster as well.
__kernel void
Foo(
__global const float2* bufferData,
__global const char* bufferTypes,
const int amountElements // in the buffer
)

OpenCL scalar vs vector

I have simple kernel:
__kernel vecadd(__global const float *A,
__global const float *B,
__global float *C)
{
int idx = get_global_id(0);
C[idx] = A[idx] + B[idx];
}
Why when I change float to float4, kernel runs more than 30% slower?
All tutorials says, that using vector types speeds up computation...
On host side, memory alocated for float4 arguments is 16 bytes aligned and global_work_size for clEnqueueNDRangeKernel is 4 times smaller.
Kernel runs on AMD HD5770 GPU, AMD-APP-SDK-v2.6.
Device info for CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT returns 4.
EDIT:
global_work_size = 1024*1024 (and greater)
local_work_size = 256
Time measured using CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END.
For smaller global_work_size (8196 for float / 2048 for float4), vectorized version is faster, but I would like to know, why?
I don't know what are the tutorials you refer to, but they must be old.
Both ATI and NVIDIA use scalar gpu architectures for at least half-decade now.
Nowdays using vectors in your code is only for syntactical convenience, it bears no performance benefit over plain scalar code.
It turns out scalar architecture is better for GPUs than vectored - it is better at utilizing the hardware resources.
I am not sure why the vectors would be that much slower for you, without knowing more about workgroup and global size. I would expect it to at least the same performance.
If it is suitable for your kernel, can you start with C having the values in A? This would cut down memory access by 33%. Maybe this applies to your situation?
__kernel vecadd(__global const float4 *B,
__global float4 *C)
{
int idx = get_global_id(0);
C[idx] += B[idx];
}
Also, have you tired reading in the values to a private vector, then adding? Or maybe both strategies.
__kernel vecadd(__global const float4 *A,
__global const float4 *B,
__global float4 *C)
{
int idx = get_global_id(0);
float4 tmp = A[idx] + B[idx];
C[idx] = tmp;
}

OpenCL modulo of large numbers

I'm trying to calculate a mod b in OpenCL, where a is an array of ulong elements, and is twice the length of b.
__kernel void mod(__global ulong *a, __global ulong *b, __global ulong length) {
// length = len(a) = 2 * len(b)
...
}
What I want is something like a %= b, but with arrays. The arrays represent numbers of course, with their last element representing the least significant bits.
Is it possible to do this in-place (i.e. without allocating extra memory)? What is a good algorithm for calculating the medulus for large numbers?
Note that neither of the two numbers can be easily represented in another way (e.g. using exponents). Most of the times they will be pseudoprimes. Also, having some concurrency would be nice.
Pointers to any useful material on this are welcome.
EDIT: if that helps, length can be known at compile time.
EDIT: I'm sorry I wasn't clear here. I'm not working on an array of integers, I'm working on two big integers, for example a is 8Mb (a 67108864-bit number) and b is 4Mb (a 33554432-bit number). I work them in base 2^64, hence the arrays of ulong integers. Basically, those are just the digits of the number.
You just do:
__kernel void mod(__global ulong *a, __global ulong *b, __global ulong length) {
ulong id = get_global_id(0) ;
a[id] = a[id] % b[id];
}
I don't really understand your problem, the arrays size difers? Or maybe you want a more special calculation?

Resources