How to define functions in OpenCL? - opencl

How do i define functions in OpenCL? I tried to build one program for each function. And it didn't worked.
float AddVectors(float a, float b)
{
return a + b;
}
kernel void VectorAdd(
global read_only float* a,
global read_only float* b,
global write_only float* c )
{
int index = get_global_id(0);
//c[index] = a[index] + b[index];
c[index] = AddVectors(a[index], b[index]);
}

You don't need to create one program for each function, instead you create a program for a set of functions that are marked with __kernel (or kernel) and potentially auxiliary functions (like your AddVectors function) using for example clCreateProgramWithSource call.
Check out basic tutorials from Apple, AMD, NVIDIA..

Related

Static variable in OpenCL C

I'm writing a renderer from scratch using openCL and I have a little compilation problem on my kernel with the error :
CL_BUILD_PROGRAM : error: program scope variable must reside in constant address space static float* objects;
The problem is that this program compiles on my desktop (with nvidia drivers) and doesn't work on my laptop (with nvidia drivers), also I have the exact same kernel file in another project that works fine on both computers...
Does anyone have an idea what I could be doing wrong ?
As a clarification, I'm coding a raymarcher which's kernel takes a list of objects "encoded" in a float array that is needed a lot in the program and that's why I need it accessible to the hole kernel.
Here is the kernel code simplified :
float* objects;
float4 getDistCol(float3 position) {
int arr_length = objects[0];
float4 distCol = {INFINITY, 0, 0, 0};
int index = 1;
while (index < arr_length) {
float objType = objects[index];
if (compare(objType, SPHERE)) {
// Treats the part of the buffer as a sphere
index += SPHERE_ATR_LENGTH;
} else if (compare(objType, PLANE)) {
//Treats the part of the buffer as a plane
index += PLANE_ATR_LENGTH;
} else {
float4 errCol = {500, 1, 0, 0};
return errCol;
}
}
}
__kernel void mkernel(__global int *image, __constant int *dimension,
__constant float *position, __constant float *aimDir, __global float *objs) {
objects = objs;
// Gets ray direction and stuf
// ...
// ...
float4 distCol = RayMarch(ro, rd);
float3 impact = rd*distCol.x + ro;
col = distCol.yzw * GetLight(impact);
image[dimension[0]*dimension[1] - idx*dimension[1]+idy] = toInt(col);
Where getDistCol(float3 position) gets called a lot by a lot of functions and I would like to avoid having to pass my float buffer to every function that needs to call getDistCol()...
There is no "static" variables allowed in OpenCL C that you can declare outside of kernels and use across kernels. Some compilers might still tolerate this, others might not. Nvidia has recently changed their OpenCL compiler from LLVM 3.4 to NVVM 7 in a driver update, so you may have the 2 different compilers on your desktop/laptop GPUs.
In your case, the solution is to hand the global kernel parameter pointer over to the function:
float4 getDistCol(float3 position, __global float *objects) {
int arr_length = objects[0]; // access objects normally, as you would in the kernel
// ...
}
kernel void mkernel(__global int *image, __constant int *dimension, __constant float *position, __constant float *aimDir, __global float *objs) {
// ...
getDistCol(position, objs); // hand global objs pointer over to function
// ...
}
Lonely variables out in the wild are only allowed as constant memory space, which is useful for large tables. They are cached in L2$, so read-only access is potentially faster. Example
constant float objects[1234] = {
1.0f, 2.0f, ...
};

Using async_work_group_copy with a custom data type

I need to copy some data from __global to __local in openCL using async_work_group_copy. The issue is, I'm not using a built-in data type.
The code snip of what I have tried is as follows:
typedef struct Y
{
...
} Y;
typedef struct X
{
Y y[MAXSIZE];
} X;
kernel void krnl(global X* restrict x){
global const Y* l = x[a].y;
local Y* l2;
size_t sol2 = sizeof(l);
async_work_group_copy(l2, l, sol2, 0);
}
where 'a' is just a vector of int. This code does not work, specifically because the gen_type is not a built-in one. The specs (1.2) says:
We use the generic type name gentype to indicate the built-in data
types ... as the type for the arguments unless otherwise stated.
so how do I otherwisely state this data type?
OpenCL async_work_group_copy() is designed to copy N elements of a basic data type. However it does not really know what is being copied. So you can tell it to copy N bytes, containing any type inside (including structs). Similar to memcpy().
You can do:
kernel void krnl(global X* restrict x){
global const Y* l = x[a].y;
local Y l2;
size_t sol2 = sizeof(Y);
async_work_group_copy((local char *)&l2, (global char *)l, sol2, 0);
}
However, remember that you need to declare local memory explicitly in side the kernel, or dinamically from the API side and pass a pointer to local memory.
You can't just create a local pointer without any initialization and copy there. (see my code)

Can this parallelism be implemented in OpenCL

This is my first post. I'll try to keep it short because I value your time. This community has been incredible to me.
I am learning OpenCL and want to extract a little bit of parallelism from the below algorithm. I will only show you the part that I am working on, which I've also simplified as much as I can.
1) Inputs: Two 1D arrays of length (n): A, B, and value of n. Also values C[0], D[0].
2) Outputs: Two 1D arrays of length (n): C, D.
C[i] = function1(C[i-1])
D[i] = function2(C[i-1],D[i-1])
So these are recursive definitions, however the calculation of C & D for a given i value can be done in parallel (they are obviously more complicated, so as to make sense). A naive thought would be creating two work items for the following kernel:
__kernel void test (__global float* A, __global float* B, __global float* C,
__global float* D, int n, float C0, float D0) {
int i, j=get_global_id(0);
if (j==0) {
C[0] = C0;
for (i=1;i<=n-1;i++) {
C[i] = function1(C[i-1]);
[WAIT FOR W.I. 1 TO FINISH CALCULATING D[i]];
}
return;
}
else {
D[0] = D0;
for (i=1;i<=n-1;i++) {
D[i] = function2(C[i-1],D[i-1]);
[WAIT FOR W.I. 0 TO FINISH CALCULATING C[i]];
}
return;
}
}
Ideally each of the two work items (numbers 0,1) would do one initial comparison and then enter their respective loop, synchronizing for each iteration. Now given the SIMD implementation of GPUs, I assume that this will NOT work (work items would be waiting for all of the kernel code), however is it possible to assign this type of work to two CPU cores and have it work as expected? What will the barrier be in this case?
This can be implemented in opencl, but like the other answer says, you're going to be limited to 2 threads at best.
My version of your function should be called with a single work group having two work items.
__kernel void test (__global float* A, __global float* B, __global float* C, __global float* D, int n, float C0, float D0)
{
int i;
int gid = get_global_id(0);
local float prevC;
local float prevD;
if (gid == 0) {
C[0] = prevC = C0;
D[0] = prevD = D0;
}
barrier(CLK_LOCAL_MEM_FENCE);
for (i=1;i<=n-1;i++) {
if(gid == 0){
C[i] = function1(prevC);
}else if (gid == 1){
D[i] = function2(prevC, prevD);
}
barrier(CLK_LOCAL_MEM_FENCE);
prevC = C[i];
prevD = D[i];
}
}
This should run on any opencl hardware. If you don't care about saving all of the C and D values, you can simply return prevC and prevD in two floats rather than the entire list. This would also make it much faster due to sticking to a lower cache level (ie local memory) for all reading and writing of the intermediate values. The local memory boost should also apply to all opencl hardware.
So is there a point to running this on a GPU? Not for the parallelism. You are stuck with 2 threads. But if you don't need all values of C and D returned, you would probably see a significant speed up because of the much faster memory of GPUs.
All of this assumes that function1 and function2 aren't overly complex. If they are, just stick to CPUs -- and probably another multiprocessing technique such as OpenMP.
Dependency in your case is completely linear/recursive (i needs i-1). Not even logaritmic like other problems (reduction, sum, sort, etc.). And therefore this problem does not fit well in a SIMD device.
The best you can do is go a 2 threads approach in CPU. Thread 1 will "produce" data (C value), for thread 2.
A very naive approach for example:
Thread 1:
for(){
ProcessC(i);
atomic_inc(counter); //This function should unlock
}
Thread 2:
for(){
atomic_dec(counter); //This function should lock
ProcessD(i);
}
Where atomic_inc and atomic_dec can be implemented with counting semaphores for example.

OpenCL Array Indexing Seems Broken

I've got a kernel with a simple array declaration and initialization, and an extra function "get_smooth_vertex(...)", which I have changed so as to demonstrate a problem:
//More const __constant declarations
const __constant int edge_parents[12][2] = { {0,1}, {0,2}, {1,3}, {2,3}, {0,4}, {1,5}, {2,6}, {3,7}, {4,5}, {4,6}, {5,7}, {6,7} };
//More Functions
float3 get_smooth_vertex(const int edge_index, const float* cube_potentials) {
int i1 = edge_parents[edge_index][0];
int i2 = edge_parents[edge_index][1];
if (i1==i2) return (float3)(0);\n"
return (float3)(1);\n"
}
__kernel void march(const __global float* potentials, __global float* vertices, __global float* normals, const __constant float4* points, const int numof_points) {
//Lots of stuff.
//Call get_smooth_vertex(...) a few times
//More stuff.
}
The if path in "get_smooth_vertex(...)" always seems to get executed! Now, I can't imagine why this would be, because each pair in "edge_parents" is different. I checked "edge_index", and it is always >= 0 and always <= 11. Furthermore, none of the variables are aliased in global or local scope. The kernel (and host code, FWIW) compiles with no warnings or errors.
So, I can't figure out what's wrong--why would the indices equal each other? Alignment, maybe? Am I just completely forgetting how C works or something? Watch—it's going to be royal user error . . .
I checked your code and the comparison works just fine (after removing the trailing \n") . You have probably made a mistake when evaluating the return value of get_smooth_vertex(). But this is hard to tell without code that shows how it is called.

OpenCL - is it possible to invoke another function from within a kernel?

I am following along with a tutorial located here: http://opencl.codeplex.com/wikipage?title=OpenCL%20Tutorials%20-%201
The kernel they have listed is this, which computes the sum of two numbers and stores it in the output variable:
__kernel void vector_add_gpu (__global const float* src_a,
__global const float* src_b,
__global float* res,
const int num)
{
/* get_global_id(0) returns the ID of the thread in execution.
As many threads are launched at the same time, executing the same kernel,
each one will receive a different ID, and consequently perform a different computation.*/
const int idx = get_global_id(0);
/* Now each work-item asks itself: "is my ID inside the vector's range?"
If the answer is YES, the work-item performs the corresponding computation*/
if (idx < num)
res[idx] = src_a[idx] + src_b[idx];
}
1) Say for example that the operation performed was much more complex than a summation - something that warrants its own function. Let's call it ComplexOp(in1, in2, out). How would I go about implementing this function such that vector_add_gpu() can call and use it? Can you give example code?
2) Now let's take the example to the extreme, and I now want to call a generic function that operates on the two numbers. How would I set it up so that the kernel can be passed a pointer to this function and call it as necessary?
Yes it is possible. You just have to remember that OpenCL is based on C99 with some caveats. You can create other functions either inside of the same kernel file or in a seperate file and just include it in the beginning. Auxiliary functions do not need to be declared as inline however, keep in mind that OpenCL will inline the functions when called. Pointers are also not available to use when calling auxiliary functions.
Example
float4 hit(float4 ray_p0, float4 ray_p1, float4 tri_v1, float4 tri_v2, float4 tri_v3)
{
//logic to detect if the ray intersects a triangle
}
__kernel void detection(__global float4* trilist, float4 ray_p0, float4 ray_p1)
{
int gid = get_global_id(0);
float4 hitlocation = hit(ray_p0, ray_p1, trilist[3*gid], trilist[3*gid+1], trilist[3*gid+2]);
}
You can have auxiliary functions for use in the kernel, see OpenCL user defined inline functions . You can not pass function pointers into the kernel.

Resources