How can I read an array of structure (OpenCL kernel) - opencl

The requirement:
Let say we have 1) Five groups of colors, each group has three colors (the colors are generated dynamically in the CPU) and 2) a list of 1000 car, each car is represented in the list by its color (the color picked from the group).
And we want to pass three arguments to an OpenCL kernel: 1) a group of the generated color, 2) a car's color array (1D), and 3) an integer array (1D) to test the car color against the color group (doing a simple calculation).
The structures:
struct GeneratedColorGroup
{
float4 Color1; //16 =2^4
float4 Color2; //16 =2^4
float4 Color3; //16 =2^4
float4 Color4; //16 =2^4
}
struct ColorGroup
{
GeneratedColorGroup Colors[8]; //512 = 2^9
}
The kernel code:
__kernel void findCarColorRelation(
const __global ColorGroup *InColorGroups,
const __global float4* InCarColor,
const __global int* CarGroupIndicator
const int carsNumber)
{
int globalID = get_global_id( 0 );
if(globalID < carsNumber)
{
ColorGroup colorGroups;
float4 carColor;
colorGroups = InColorGroups[globalID];
carColor = InCarColor[globalID];
for(int groupIndex =0; groupIndex < 8; groupIndex++)
{
if(colorGroups[groupIndex].Color1 == carColor)
{
CarGroupIndicator[globalID] = groupIndex + 1 ;
break;
}
if(colorGroups[groupIndex].Color2 == carColor)
{
CarGroupIndicator[globalID] = groupIndex * 2 + 2;
break;
}
if(colorGroups[groupIndex].Color3 == carColor)
{
CarGroupIndicator[globalID] = groupIndex * 3 + 3;
break;
}
}
}
}
Now, we have 1000 items which mean the kernel is going to be executed 1000 time. That's OK.
The problem:
As you see, we have a global ColorGroup as an input to the kernel, this global memory has five items of "GeneratedColorGroup" type.
I tried to access these items as shown in the code above but I got an unexpected result. and the execution is very slow.
What is the wrong with my code?
Any help is highly appreciated.

When passing structs from a host to a device, make sure you declare the struct type with __attribute__ ((packed)) in both host and device code. Otherwise the host and the device compilers may create have a different memory layout for the struct, i.e. they can use a different size for a padding.
Using packed structs may cause a performance degaradation, because packed structs don't have padding at all, so data within a struct may not be properly aligned and an unaligned access is usually slow. In this case, you have to either manually insert a padding with char[], or use the __attribute__ ((aligned (N))) on a struct field (or on the struct itself).
See the OpenCL C specification for details on packed and aligned attributes:
https://www.khronos.org/registry/OpenCL/sdk/1.1/docs/man/xhtml/attributes-types.html

I'm wildly guessing the problem is
... CarGroupIndicator[globalID] = groupIndex + 1 ;
... CarGroupIndicator[globalID] = groupIndex * 2 + 2;
... CarGroupIndicator[globalID] = groupIndex * 3 + 3;
... which makes it impossible to tell from the result CarGroupIndicator[globalID] what was matched exactly. E.g. match on group 5 color 1 results in value 6, but so does group 2 color 2 and also group 1 color 3 result in value 6. What you want is something like this:
... CarGroupIndicator[globalID] = groupIndex;
... CarGroupIndicator[globalID] = groupIndex + 8;
... CarGroupIndicator[globalID] = groupIndex + 16;
.. then 0-7 are color1, 8-15 color2, 16-24 color3.

Related

How to find the minimum value in an array using OpenCL

I am learning opencl for the first time, and I am currently modifying the shortest path finding algorithm. I know that opencl usually uses the idea of parallel computing to solve problems. So I wonder if I can also use this parallel idea when I am dealing with finding the minimum value and its position in the array?
This is my previous attempt. I think that as long as the variable is the smallest, the result can be obtained regardless of whether the operation is locked or not. Unfortunately, when I use printf to view variables, although valid nodes have been judged, I can't get the correct results.
__kernel void findWay(__global int* A, __global int* B, __global int* minNode, __global int* minDis, __global int* isFinish)
{
//A: weightMatrix , B: usedNode
//dijkstra algorithm , src node is 0
size_t dst = get_global_id(1);
size_t src = get_global_id(0);
size_t vCount = get_global_size(0);
int index = dst * vCount + src;
while(isFinish[0] != vCount){
if((src == minNode[0])&&(B[dst] == 0)&&(A[index] != INT_MAX)){
A[dst*vCount] = min(A[dst*vCount + 0],A[minNode[0]*vCount + 0] + A[index]);
}
minDis[0] = INT_MAX;
barrier(CLK_GLOBAL_MEM_FENCE);
//here is the bug
if((src == 0) &&(B[dst] == 0)){
if(minDis[0] > A[index]){
minDis[0] = A[index];
minNode[0] = dst;
}
}
//=========
barrier(CLK_GLOBAL_MEM_FENCE);
B[minNode[0]] = 1;
if(index == 0){
isFinish[0]++;
}
}
}
In the end, I can only use a normal way to achieve this operation.
if((src == 0) &&(dst == 0)){
for(int i = 0 ; i < vCount ;i++){
if(B[i] == 0 && minDis[0] > A[i *vCount]){
minDis[0] = A[i*vCount];
minNode[0] = i;
}
}
I would like to ask about this search process, can the looping step be omitted?
Horizontal operations on the parallelized array are difficult. The general approach to them is binary-tree-like kernel passes. Start with the original array, make each GPU thread load 2 neighboring elements and choose the smaller one, write that in the same array to position of the first of the two elements. Next kernel loads two elements from the list of every second element, compares the two, writes the smaller one in the first position of the two. Repeat until there is only one element left.
I will illustrate it beloe. I mark values that are not touched by the kernel anymore with *.
original array: 5|2|1|6|9|3|4|8
after 1st kernel pass: 2 *|1 *|3 *|4 *
after 2nd kernel pass: 1 * * *|3 * * *
after 3nd kernel pass: 1 * * * * * * *
smallest element is 1.

Hough transform and OpenCL

I'm trying to implement Hough transform for circles in OpenCL, but i've encountered really weird problem. Every time i run the Hough kernel, i end up with slightly different accumulator, even though parameters are the same and accumulator is always a freshly zero'ed table (ex. http://imgur.com/a/VcIw1). My kernel code is as below:
#define BLOCK_LEN 256
__kernel void HoughCirclesKernel(
__global int* A,
__global int* imgData,
__global int* _width,
__global int* _height,
__global int* r
)
{
__local int imgBuff[BLOCK_LEN];
int localThreadIndex = get_local_id(0); //threadIdx.x
int globalThreadIndex = get_local_id(0) + get_group_id(0) * BLOCK_LEN; //threadIdx.x + blockIdx.x * Block_Len
int width = *_width; int height = *_height;
int radius = *r;
A[globalThreadIndex] = 0;
barrier(CLK_GLOBAL_MEM_FENCE);
if(globalThreadIndex < width*height)
{
imgBuff[localThreadIndex] = imgData[globalThreadIndex];
barrier(CLK_LOCAL_MEM_FENCE);
if(imgBuff[localThreadIndex] > 0)
{
float s1, c1;
for(int i = 0; i<180; i++)
{
s1 = sincos(i, &c1);
int centerX = globalThreadIndex % width + radius * c1;
int centerY = ((globalThreadIndex - centerX) / height) + radius * s1;
if(centerX < width && centerY < height)
atomic_inc(A + centerX + centerY * width);
}
}
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
Could this be the fault of how I am incrementing the accumulator?
if(globalThreadIndex < width*height)
{
imgBuff[localThreadIndex] = imgData[globalThreadIndex];
barrier(CLK_LOCAL_MEM_FENCE);
...
}
this is undefined behaviour since there is a barrier inside a branch.
All streaming units in a compute unit must enter same memory fence.
Try this:
if(globalThreadIndex < width*height)
{
imgBuff[localThreadIndex] = imgData[globalThreadIndex];
...
}
barrier(CLK_LOCAL_MEM_FENCE);
Alse there could be another issue if you are using multiple devices:
get_local_id(0) + get_group_id(0)
here get_group_id(0) is getting group id per device and it starts from 0 for all devices just as get_global_id starts zero too; so you should add proper offsets in the "ndrange" instruction when using multiple devices. Even though different devices can support same floatig point accuracy requirements, one of them may give better accuracy than other and can give slightly different results. If it is single device, then you should try lowering gpu frequencies as it may have defects or side effects of an overclock.
I have managed to solve my problem by finding and correcting three issues.
First of all the kernel code, the line:
int centerY = ((globalThreadIndex - centerX) / height) + radius * s1;
should be:
int centerY = (globalThreadIndex / width) + radius * s1;
The main change here was dividing by width, not height. This caused inaccuracy problems.
if(centerX < width && centerY < height)
The above condition was changed to:
if(x < width && x >= 0)
if(y < height && y >=0)
As for the accumulator problem, first I will post the code I used to create clBuffer (I am using OpenCL.net library for C#):
int[] a = new int[width*height]; //image size
ErrorCode error;
Mem cl_accumulator = (Mem)Cl.CreateBuffer(cl_context, MemFlags.ReadWrite, (IntPtr)(a.Length * sizeof(int)), out error);
CheckErr(error, "Cl.CreateBuffer");
The fix here was simple and pretty much self-explainatory:
int[] a = Enumerable.Repeat(0, width * height).ToArray();
ErrorCode error;
GCHandle accHandle = GCHandle.Alloc(a, GCHandleType.Pinned);
IntPtr accPtr = accHandle.AddrOfPinnedObject();
Mem cl_accumulator = (Mem)Cl.CreateBuffer(cl_context, MemFlags.ReadWrite | MemFlags.CopyHostPtr, (IntPtr)(a.Length * sizeof(int)), accPtr, out error);
CheckErr(error, "Cl.CreateBuffer");
I filled the accumulator table with zeros and then copied it to device buffer each time I executed the kernel.
The above errors caused the accumulator to look different and bit malformed each time I executed the kernel.

pointer to arrays of struct

struct a{
double array[2][3];
};
struct b{
double array[3][4];
};
void main(){
a x = {{1,2,3,4,5,6}};
b y = {{1,2,3,4,5,6,7,8,9,10,11,12}};
}
I have two structs, inside which there are two dim arrays with different sizes. If I want to define only one function, which can deal with both x and y (one for each time), i.e., the function allows both x.array and y.array to be its argument. How can I define the input argument? I think I should use a pointer.... But **x.array seems not to work.
For example, I want to write a function PrintArray which can print the input array.
void PrintArray( ){}
What should I input into the parenthesis? double ** seems not work for me... (we can let dimension to be the PrintArray's argument as well, telling them its 2*3 array)
Write a function that takes three parameters: a pointer, the number of rows, and the number of columns. When you call the function, reduce the array to a pointer.
void PrintArray(const double *a, int rows, int cols) {
int r, c;
for (r = 0; r < rows; ++r) {
for (c = 0; c < cols; ++c) {
printf("%3.1f ", a[r * cols + c]);
}
printf("\n");
}
}
int main(){
struct a x = {{{1,2,3},{4,5,6}}};
struct b y = {{{1,2,3,4},{5,6,7,8},{9,10,11,12}}};
PrintArray(&x.array[0][0], 2, 3);
PrintArray(&y.array[0][0], 3, 4);
return 0;
}

Different values between local and global memory after copy

I'm working in a GPU Kernel and I have some problems copying data from global to local memory
here is my kernel function:
__kernel void nQueens( __global int * data, __global int * result, int board_size)
so I want to copy from __global int * data to __local int aux_data[OBJ_SIZE]
I tried to copy like a normal array:
for(int i = 0; i < OBJ_SIZE; ++i)
{
aux_data[stack_size*OBJ_SIZE + i] = data[index*OBJ_SIZE + i];
}
and also with the functions to copy:
event_t e = async_work_group_copy ( aux_data, (data + (index*OBJ_SIZE)), OBJ_SIZE, 0);
wait_group_events (1, e);
And in both situations I get different values between the global and local memory.
I don't know what I'm doing wrong...
One of the problems with the way you are copying data in the first answer is that you are assigning data to parts of an array that don't exist. aux_data[stack_size*OBJ_SIZE + i] will overflow whenever stack_size > 1.
The problem with answer two might be that you need to pass an array of events, not just a single event.
One thing to make sure is to understand what index is referring to. I'm assuming for my solutions that it is referring to the group ID and not the thread ID. If it is indeed the thread ID, then we have other problems.
Possible Solution 1:
int gid = get_group_id(0);
int lid = get_local_id(0);
int l_s = get_local_id(0);
for(int i = lid; i < OBJ_SIZE; i += l_s)
{
aux_data[i] = data[gid*OBJ_SIZE + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
Possible Solution 2:
int gid = get_group_id(0);
event_t e = async_work_group_copy (aux_data, data + (gid*OBJ_SIZE), OBJ_SIZE, 0);
wait_group_events (1, &e);

OpenCL / try to understand Kernel Code

I am studying an OpenCL code wich simulates the N-body problem from the following tutorial :
http://www.browndeertechnology.com/docs/BDT_OpenCL_Tutorial_NBody-rev3.html
My main issue relies on the kernel code :
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
19 pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
20 barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */
21 for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */
22 float4 p2 = pblock[j]; /* Read a cached particle position */
23 float4 d = p2 - p;
24 float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
25 float f = p2.w*invr*invr*invr;
26 a += f*d; /* Accumulate acceleration */
27 }
28 barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in work-group */
29 }
I don't understand what exactly happens at the execution : the kernel code is executed n times where n is the number of work-items (which is also the number of threads) but in the above part of code, we use the local memory for each work-group (there are nb work-groups it seems)
So, at the execution, up to the first "barrier", do I fill locally the pblock array with the global values of pos_old ?
Always up to the first barrier, for another work-group, the pblock array will have contain the same values as the arrays of the others work-groups, since jb=0 before the barrier ?
It seems that's a way to share these arrays by all the work-groups but this is not totally clear for me.
Any help is welcome.
Can you post the entire kernel code please? I have to make assumptions about the params and private variables.
It looks like there are nt number of work items in the group, and ti represents the current work item. When the loop executes, each item in the group will copy only single element. Usually this copy is from a global data source. The first barrier forces the work item to wait until the other items have made their copy. This is necessary because every work item in the group needs to read the data copied from every other work item. The values should not be the same, because ti should be different for each work item. (jb*nt would still equal zero for the first loop though)
Here is the entire kernel code :
__kernel
void
nbody_sim(
__global float4* pos ,
__global float4* vel,
int numBodies,
float deltaTime,
float epsSqr,
__local float4* localPos,
__global float4* newPosition,
__global float4* newVelocity)
{
unsigned int tid = get_local_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
// Number of tiles we need to iterate
unsigned int numTiles = numBodies / localSize;
// position of this work-item
float4 myPos = pos[gid];
float4 acc = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
for(int i = 0; i < numTiles; ++i)
{
// load one tile into local memory
int idx = i * localSize + tid;
localPos[tid] = pos[idx];
// Synchronize to make sure data is available for processing
barrier(CLK_LOCAL_MEM_FENCE);
// calculate acceleration effect due to each body
// a[i->j] = m[j] * r[i->j] / (r^2 + epsSqr)^(3/2)
for(int j = 0; j < localSize; ++j)
{
// Calculate acceleartion caused by particle j on particle i
float4 r = localPos[j] - myPos;
float distSqr = r.x * r.x + r.y * r.y + r.z * r.z;
float invDist = 1.0f / sqrt(distSqr + epsSqr);
float invDistCube = invDist * invDist * invDist;
float s = localPos[j].w * invDistCube;
// accumulate effect of all particles
acc += s * r;
}
// Synchronize so that next tile can be loaded
barrier(CLK_LOCAL_MEM_FENCE);
}
float4 oldVel = vel[gid];
// updated position and velocity
float4 newPos = myPos + oldVel * deltaTime + acc * 0.5f * deltaTime * deltaTime;
newPos.w = myPos.w;
float4 newVel = oldVel + acc * deltaTime;
// write to global memory
newPosition[gid] = newPos;
newVelocity[gid] = newVel;
}
There are "numTiles" work-groups with "localSize" work-items for each work-group.
"gid" is the global index and "tid" is the local index.
Let's start at the first iteration of the loop "for(int i = 0; i < numTiles; ++i)" with "i=0":
If I take for example :
numTiles = 4, localSize = 25 and numBodies = 100 = number of work-items.
Then, at the execution, if I have gid = 80, then tid = 5, idx = 5 and the first assignement will be : localPos[5] = pos[5]
Now, I take gid = 5, then tid = 5 and idx = 5, I will have the same assignement with : localPos[5] = pos[5]
So, from what I understand, in the first iteration and after the first "barrier", each work-items contains the same Local array "localPos", i.e the sub-array of the first global block, which is "pos[0:24]".
Is this a good explanation of what happens ?

Resources