OpenCL template matching with larger template is slower than OpenCV CPU version - opencl

I am new to opencl, now I am workin on a optimisation of template matching with OpenCL. I have done some experiements with smaller templates and found that the my OpenCL implementation is faster than OpenCV's CPU version. But in this particular case the template size is really big(2048x2048) and the original image size is ( 3072x3072), OpenCV cpu implementation(137 seconds) is far ahead ahead of OpenCL( 2000 seconds ). Kindly suggest some way to optimise my code showed below.
void __kernel corrln(global const unsigned char* ref_image, global const
unsigned char* template, global float* corrln )
{
const uint Width = get_global_size(0);
const int2 pos = {get_global_id(0), get_global_id(1)};
float sum = 0;
for(int y = pos.y; y < 2048; y++ )
{
for(int x =pos.x; x < 2048; x++ )
{
const int2 xy = { x, y };
const int2 txy = { x - pos.x, y - pos.y };
sum += ref_image[index(xy, Width)] * template[index(txy,
2048)];
}
}
corrln[index(pos, Width)]= sum;
}

Considering your ref_image to be of reasonable size less than 2048 (say, 1024x1024), and ND size to be equal to ref_image size, every WI (Work Item) is doing different amount of calculations.
WI with pos.x == 0 & pos.y == 0 is doing 2048 * 2048 = 4M calculations within 2 loops, WI with pos.x == 1023 & pos.y == 1023 is doing 1024 * 1024M calculations within 2 loops. That's too much job for single WI.
Try to dice up this task in such fashion that every WI will do some reasonable fixed amount of calculations. Say, for 1st column of ref_image, do multiple launches of kernel, each of which will process 16 columns to the right and calculate & accumulate corrln array and then go to 2nd column, etc.
Kernel might look like this (just for illustration!!!):
void __kernel corrln(
global const unsigned char* ref_image,
global const unsigned char* template,
global float* corrln )
{
const uint Width = get_global_size(0);
const int2 pos = {get_global_id(0), get_global_id(1)};
uchar16 ref = vload16(index(xy, Width), ref_image);
uchar16 tpl = vload16(index(xy, Width), template);
float sum = corrln[index(pos, Width)] + dot(ref, tpl);
corrln[index(pos, Width)]= sum;
}

Related

Hough transform and OpenCL

I'm trying to implement Hough transform for circles in OpenCL, but i've encountered really weird problem. Every time i run the Hough kernel, i end up with slightly different accumulator, even though parameters are the same and accumulator is always a freshly zero'ed table (ex. http://imgur.com/a/VcIw1). My kernel code is as below:
#define BLOCK_LEN 256
__kernel void HoughCirclesKernel(
__global int* A,
__global int* imgData,
__global int* _width,
__global int* _height,
__global int* r
)
{
__local int imgBuff[BLOCK_LEN];
int localThreadIndex = get_local_id(0); //threadIdx.x
int globalThreadIndex = get_local_id(0) + get_group_id(0) * BLOCK_LEN; //threadIdx.x + blockIdx.x * Block_Len
int width = *_width; int height = *_height;
int radius = *r;
A[globalThreadIndex] = 0;
barrier(CLK_GLOBAL_MEM_FENCE);
if(globalThreadIndex < width*height)
{
imgBuff[localThreadIndex] = imgData[globalThreadIndex];
barrier(CLK_LOCAL_MEM_FENCE);
if(imgBuff[localThreadIndex] > 0)
{
float s1, c1;
for(int i = 0; i<180; i++)
{
s1 = sincos(i, &c1);
int centerX = globalThreadIndex % width + radius * c1;
int centerY = ((globalThreadIndex - centerX) / height) + radius * s1;
if(centerX < width && centerY < height)
atomic_inc(A + centerX + centerY * width);
}
}
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
Could this be the fault of how I am incrementing the accumulator?
if(globalThreadIndex < width*height)
{
imgBuff[localThreadIndex] = imgData[globalThreadIndex];
barrier(CLK_LOCAL_MEM_FENCE);
...
}
this is undefined behaviour since there is a barrier inside a branch.
All streaming units in a compute unit must enter same memory fence.
Try this:
if(globalThreadIndex < width*height)
{
imgBuff[localThreadIndex] = imgData[globalThreadIndex];
...
}
barrier(CLK_LOCAL_MEM_FENCE);
Alse there could be another issue if you are using multiple devices:
get_local_id(0) + get_group_id(0)
here get_group_id(0) is getting group id per device and it starts from 0 for all devices just as get_global_id starts zero too; so you should add proper offsets in the "ndrange" instruction when using multiple devices. Even though different devices can support same floatig point accuracy requirements, one of them may give better accuracy than other and can give slightly different results. If it is single device, then you should try lowering gpu frequencies as it may have defects or side effects of an overclock.
I have managed to solve my problem by finding and correcting three issues.
First of all the kernel code, the line:
int centerY = ((globalThreadIndex - centerX) / height) + radius * s1;
should be:
int centerY = (globalThreadIndex / width) + radius * s1;
The main change here was dividing by width, not height. This caused inaccuracy problems.
if(centerX < width && centerY < height)
The above condition was changed to:
if(x < width && x >= 0)
if(y < height && y >=0)
As for the accumulator problem, first I will post the code I used to create clBuffer (I am using OpenCL.net library for C#):
int[] a = new int[width*height]; //image size
ErrorCode error;
Mem cl_accumulator = (Mem)Cl.CreateBuffer(cl_context, MemFlags.ReadWrite, (IntPtr)(a.Length * sizeof(int)), out error);
CheckErr(error, "Cl.CreateBuffer");
The fix here was simple and pretty much self-explainatory:
int[] a = Enumerable.Repeat(0, width * height).ToArray();
ErrorCode error;
GCHandle accHandle = GCHandle.Alloc(a, GCHandleType.Pinned);
IntPtr accPtr = accHandle.AddrOfPinnedObject();
Mem cl_accumulator = (Mem)Cl.CreateBuffer(cl_context, MemFlags.ReadWrite | MemFlags.CopyHostPtr, (IntPtr)(a.Length * sizeof(int)), accPtr, out error);
CheckErr(error, "Cl.CreateBuffer");
I filled the accumulator table with zeros and then copied it to device buffer each time I executed the kernel.
The above errors caused the accumulator to look different and bit malformed each time I executed the kernel.

OpenCL: Optimize matrix multiplication for uchar

I adapted the attached kernel from one of the NVIDIA OpenCL examples and compared performance to clblasSgemm, and found that they perform equally fast (at least on my setup). I am launching it with a {16, 16} local work size.
Now, assume matrices A and B are both uchar, and C accordingly uint. Is there any way to optimize the multiplication? Simply replacing the types degraded performance. I tried hand-vectorizing with uchar4 and uchar16, but that made it slower.
Any suggestions welcome! (I am new to GPU programming and OpenCL)
/*
* This software contains source code provided by NVIDIA Corporation.
*/
#define BLOCK_SIZE 16
__kernel void mat_mul(const __global float* A, const __global float* B,
__global float* C,
const int A_cols, const int B_cols) {
// Block index
const int bx = get_group_id(0);
const int by = get_group_id(1);
// Thread index
const int tx = get_local_id(0);
const int ty = get_local_id(1);
// Index of the first sub-matrix of A processed by the block
const int a0 = A_cols * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
const int a1 = a0 + A_cols - 1;
const int a_step = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block
const int b0 = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
const int b_step = BLOCK_SIZE * B_cols;
// Csub is used to store the element of the block sub-matrix
// that is computed by the thread
float Csub = 0;
__local float As[BLOCK_SIZE][BLOCK_SIZE];
__local float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Loop over all the sub-matrices of A and B required to compute the
// block sub-matrix
for (int a=a0, b=b0; a<=a1; a+=a_step, b+=b_step) {
// Load the matrices from device memory to shared memory;
// each thread loads one element of each matrix
As[ty][tx] = A[a + A_cols * ty + tx];
Bs[ty][tx] = B[b + B_cols * ty + tx];
// Synchronize to make sure the matrices are loaded
barrier(CLK_LOCAL_MEM_FENCE);
// Multiply the two matrices together;
// each thread computes one element of the block sub-matrix
#pragma unroll
for (int k=0; k<BLOCK_SIZE; ++k) {
Csub += As[ty][k] * Bs[k][tx];
}
// Synchronize to make sure that the preceding computation is done
// before loading two new sub-matrices of A and B in the next
// iteration
barrier(CLK_LOCAL_MEM_FENCE);
}
// Write the block sub-matrix to device memory;
// each thread writes one element
C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = Csub;
}
There is very simple way to measure if your kernel is good. Calculate it's OPS & bandwidth (how many data in form of matrix are you processing per second). Then compare it to theoretical limits. You will get factor, limiting performance. Usually, it's load-store operations.

substitutions for cl_khr_int64_base_atomics

I have an ATI Firepro V4800 graphics card which does not support cl_khr_int64_base_atomics. I am trying to adapt the RadixSort algo for long integers. The algo uses atomic_inc, the 64-bit of which is atom_inc, which I cannot use in the kernel. So, my question is, is there a piece of code which performs the same function as atomic_inc which can be used? The piece of kernel code is given below:
__kernel void histogram(__global uint* unsortedData,
__global uint* buckets,
uint shiftCount,
__local uint* sharedArray)
{
size_t localId = get_local_id(0);
size_t globalId = get_global_id(0);
size_t groupId = get_group_id(0);
size_t groupSize = get_local_size(0);
uint numGroups = get_global_size(0) / get_local_size(0);
// Initialize shared array to zero //
sharedArray[localId] = 0;
barrier(CLK_LOCAL_MEM_FENCE);
// Calculate thread-histograms //
uint value = unsortedData[globalId];
value = value >> shiftCount & 0xFFU;
atomic_inc(sharedArray+value);
barrier(CLK_LOCAL_MEM_FENCE);
// Copy calculated histogram bin to global memory //
uint bucketPos = groupId * groupSize + localId ;
//uint bucketPos = localId * numGroups + groupId ;
buckets[bucketPos] = sharedArray[localId];
}
Any suggestions? Thank you.
Edit:
Another way for the same is given in this blogsite: http://suhorukov.blogspot.in/2011/12/opencl-11-atomic-operations-on-floating.html. This gives a very generic implementation of the Atomic Inc.
You could try something like this:
void atomInc64 (__local uint *counter)
{
uint old, carry;
old = atomic_inc (&counter [0]);
carry = old == 0xFFFFFFFF;
atomic_add (&counter [1], carry);
}
Where counter is an array of two 32-bit integers. While the two halves don't increment at exactly the same time, the total should be correct when the program completes.

Higher radix (or better) formulation for Stockham FFT

Background
I've implemented this algorithm from Microsoft Research for a radix-2 FFT (Stockham auto sort) using OpenCL.
I use floating point textures (256 cols X N rows) for input and output in the kernel, because I will need to sample at non-integral points and I thought it better to delegate that to the texture sampling hardware. Note that my FFTs are always of 256-point sequences (every row in my texture). At this point, my N is 16384 or 32768 depending on the GPU i'm using and the max 2D texture size allowed.
I also need to perform the FFT of 4 real-valued sequences at once, so the kernel performs the FFT(a, b, c, d) as FFT(a + ib, c + id) from which I can extract the 4 complex sequences out later using an O(n) algorithm. I can elaborate on this if someone wishes - but I don't believe it falls in the scope of this question.
Kernel Source
const sampler_t fftSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
__kernel void FFT_Stockham(read_only image2d_t input, write_only image2d_t output, int fftSize, int size)
{
int x = get_global_id(0);
int y = get_global_id(1);
int b = floor(x / convert_float(fftSize)) * (fftSize / 2);
int offset = x % (fftSize / 2);
int x0 = b + offset;
int x1 = x0 + (size / 2);
float4 val0 = read_imagef(input, fftSampler, (int2)(x0, y));
float4 val1 = read_imagef(input, fftSampler, (int2)(x1, y));
float angle = -6.283185f * (convert_float(x) / convert_float(fftSize));
// TODO: Convert the two calculations below into lookups from a __constant buffer
float tA = native_cos(angle);
float tB = native_sin(angle);
float4 coeffs1 = (float4)(tA, tB, tA, tB);
float4 coeffs2 = (float4)(-tB, tA, -tB, tA);
float4 result = val0 + coeffs1 * val1.xxzz + coeffs2 * val1.yyww;
write_imagef(output, (int2)(x, y), result);
}
The host code simply invokes this kernel log2(256) times, ping-ponging the input and output textures.
Note: I tried removing the native_cos and native_sin to see if that impacted timing, but it doesn't seem to change things by very much. Not the factor I'm looking for, in any case.
Access pattern
Knowing that I am probably memory-bandwidth bound, here is the memory access pattern (per-row) for my radix-2 FFT.
X0 - element 1 to combine (read)
X1 - element 2 to combine (read)
X - element to write to (write)
Question
So my question is - can someone help me with/point me toward a higher-radix formulation for this algorithm? I ask because most FFTs are optimized for large cases and single real/complex valued sequences. Their kernel generators are also very case dependent and break down quickly when I try to muck with their internals.
Are there other options better than simply going to a radix-8 or 16 kernel?
Some of my constraints are - I have to use OpenCL (no cuFFT). I also cannot use clAmdFft from ACML for this purpose. It would be nice to also talk about CPU optimizations (this kernel SUCKS big time on the CPU) - but getting it to run in fewer iterations on the GPU is my main use-case.
Thanks in advance for reading through all this and trying to help!
I tried several versions, but the one with the best performance on CPU and GPU was a radix-16 kernel for my specific case.
Here is the kernel for reference. It was taken from Eric Bainville's (most excellent) website and used with full attribution.
// #define M_PI 3.14159265358979f
//Global size is x.Length/2, Scale = 1 for direct, 1/N to inverse (iFFT)
__kernel void ConjugateAndScale(__global float4* x, const float Scale)
{
int i = get_global_id(0);
float temp = Scale;
float4 t = (float4)(temp, -temp, temp, -temp);
x[i] *= t;
}
// Return a*EXP(-I*PI*1/2) = a*(-I)
float2 mul_p1q2(float2 a) { return (float2)(a.y,-a.x); }
// Return a^2
float2 sqr_1(float2 a)
{ return (float2)(a.x*a.x-a.y*a.y,2.0f*a.x*a.y); }
// Return the 2x DFT2 of the four complex numbers in A
// If A=(a,b,c,d) then return (a',b',c',d') where (a',c')=DFT2(a,c)
// and (b',d')=DFT2(b,d).
float8 dft2_4(float8 a) { return (float8)(a.lo+a.hi,a.lo-a.hi); }
// Return the DFT of 4 complex numbers in A
float8 dft4_4(float8 a)
{
// 2x DFT2
float8 x = dft2_4(a);
// Shuffle, twiddle, and 2x DFT2
return dft2_4((float8)(x.lo.lo,x.hi.lo,x.lo.hi,mul_p1q2(x.hi.hi)));
}
// Complex product, multiply vectors of complex numbers
#define MUL_RE(a,b) (a.even*b.even - a.odd*b.odd)
#define MUL_IM(a,b) (a.even*b.odd + a.odd*b.even)
float2 mul_1(float2 a, float2 b)
{ float2 x; x.even = MUL_RE(a,b); x.odd = MUL_IM(a,b); return x; }
float4 mul_1_F4(float4 a, float4 b)
{ float4 x; x.even = MUL_RE(a,b); x.odd = MUL_IM(a,b); return x; }
float4 mul_2(float4 a, float4 b)
{ float4 x; x.even = MUL_RE(a,b); x.odd = MUL_IM(a,b); return x; }
// Return the DFT2 of the two complex numbers in vector A
float4 dft2_2(float4 a) { return (float4)(a.lo+a.hi,a.lo-a.hi); }
// Return cos(alpha)+I*sin(alpha) (3 variants)
float2 exp_alpha_1(float alpha)
{
float cs,sn;
// sn = sincos(alpha,&cs); // sincos
//cs = native_cos(alpha); sn = native_sin(alpha); // native sin+cos
cs = cos(alpha); sn = sin(alpha); // sin+cos
return (float2)(cs,sn);
}
// Return cos(alpha)+I*sin(alpha) (3 variants)
float4 exp_alpha_1_F4(float alpha)
{
float cs,sn;
// sn = sincos(alpha,&cs); // sincos
// cs = native_cos(alpha); sn = native_sin(alpha); // native sin+cos
cs = cos(alpha); sn = sin(alpha); // sin+cos
return (float4)(cs,sn,cs,sn);
}
// mul_p*q*(a) returns a*EXP(-I*PI*P/Q)
#define mul_p0q1(a) (a)
#define mul_p0q2 mul_p0q1
//float2 mul_p1q2(float2 a) { return (float2)(a.y,-a.x); }
__constant float SQRT_1_2 = 0.707106781186548; // cos(Pi/4)
#define mul_p0q4 mul_p0q2
float2 mul_p1q4(float2 a) { return (float2)(SQRT_1_2)*(float2)(a.x+a.y,-a.x+a.y); }
#define mul_p2q4 mul_p1q2
float2 mul_p3q4(float2 a) { return (float2)(SQRT_1_2)*(float2)(-a.x+a.y,-a.x-a.y); }
__constant float COS_8 = 0.923879532511287; // cos(Pi/8)
__constant float SIN_8 = 0.382683432365089; // sin(Pi/8)
#define mul_p0q8 mul_p0q4
float2 mul_p1q8(float2 a) { return mul_1((float2)(COS_8,-SIN_8),a); }
#define mul_p2q8 mul_p1q4
float2 mul_p3q8(float2 a) { return mul_1((float2)(SIN_8,-COS_8),a); }
#define mul_p4q8 mul_p2q4
float2 mul_p5q8(float2 a) { return mul_1((float2)(-SIN_8,-COS_8),a); }
#define mul_p6q8 mul_p3q4
float2 mul_p7q8(float2 a) { return mul_1((float2)(-COS_8,-SIN_8),a); }
// Compute in-place DFT2 and twiddle
#define DFT2_TWIDDLE(a,b,t) { float2 tmp = t(a-b); a += b; b = tmp; }
// T = N/16 = number of threads.
// P is the length of input sub-sequences, 1,16,256,...,N/16.
__kernel void FFT_Radix16(__global const float4 * x, __global float4 * y, int pp)
{
int p = pp;
int t = get_global_size(0); // number of threads
int i = get_global_id(0); // current thread
////// y[i] = 2*x[i];
////// return;
int k = i & (p-1); // index in input sequence, in 0..P-1
// Inputs indices are I+{0,..,15}*T
x += i;
// Output indices are J+{0,..,15}*P, where
// J is I with four 0 bits inserted at bit log2(P)
y += ((i-k)<<4) + k;
// Load
float4 u[16];
for (int m=0;m<16;m++) u[m] = x[m*t];
// Twiddle, twiddling factors are exp(_I*PI*{0,..,15}*K/4P)
float alpha = -M_PI*(float)k/(float)(8*p);
for (int m=1;m<16;m++) u[m] = mul_1_F4(exp_alpha_1_F4(m * alpha), u[m]);
// 8x in-place DFT2 and twiddle (1)
DFT2_TWIDDLE(u[0].lo,u[8].lo,mul_p0q8);
DFT2_TWIDDLE(u[0].hi,u[8].hi,mul_p0q8);
DFT2_TWIDDLE(u[1].lo,u[9].lo,mul_p1q8);
DFT2_TWIDDLE(u[1].hi,u[9].hi,mul_p1q8);
DFT2_TWIDDLE(u[2].lo,u[10].lo,mul_p2q8);
DFT2_TWIDDLE(u[2].hi,u[10].hi,mul_p2q8);
DFT2_TWIDDLE(u[3].lo,u[11].lo,mul_p3q8);
DFT2_TWIDDLE(u[3].hi,u[11].hi,mul_p3q8);
DFT2_TWIDDLE(u[4].lo,u[12].lo,mul_p4q8);
DFT2_TWIDDLE(u[4].hi,u[12].hi,mul_p4q8);
DFT2_TWIDDLE(u[5].lo,u[13].lo,mul_p5q8);
DFT2_TWIDDLE(u[5].hi,u[13].hi,mul_p5q8);
DFT2_TWIDDLE(u[6].lo,u[14].lo,mul_p6q8);
DFT2_TWIDDLE(u[6].hi,u[14].hi,mul_p6q8);
DFT2_TWIDDLE(u[7].lo,u[15].lo,mul_p7q8);
DFT2_TWIDDLE(u[7].hi,u[15].hi,mul_p7q8);
// 8x in-place DFT2 and twiddle (2)
DFT2_TWIDDLE(u[0].lo,u[4].lo,mul_p0q4);
DFT2_TWIDDLE(u[0].hi,u[4].hi,mul_p0q4);
DFT2_TWIDDLE(u[1].lo,u[5].lo,mul_p1q4);
DFT2_TWIDDLE(u[1].hi,u[5].hi,mul_p1q4);
DFT2_TWIDDLE(u[2].lo,u[6].lo,mul_p2q4);
DFT2_TWIDDLE(u[2].hi,u[6].hi,mul_p2q4);
DFT2_TWIDDLE(u[3].lo,u[7].lo,mul_p3q4);
DFT2_TWIDDLE(u[3].hi,u[7].hi,mul_p3q4);
DFT2_TWIDDLE(u[8].lo,u[12].lo,mul_p0q4);
DFT2_TWIDDLE(u[8].hi,u[12].hi,mul_p0q4);
DFT2_TWIDDLE(u[9].lo,u[13].lo,mul_p1q4);
DFT2_TWIDDLE(u[9].hi,u[13].hi,mul_p1q4);
DFT2_TWIDDLE(u[10].lo,u[14].lo,mul_p2q4);
DFT2_TWIDDLE(u[10].hi,u[14].hi,mul_p2q4);
DFT2_TWIDDLE(u[11].lo,u[15].lo,mul_p3q4);
DFT2_TWIDDLE(u[11].hi,u[15].hi,mul_p3q4);
// 8x in-place DFT2 and twiddle (3)
DFT2_TWIDDLE(u[0].lo,u[2].lo,mul_p0q2);
DFT2_TWIDDLE(u[0].hi,u[2].hi,mul_p0q2);
DFT2_TWIDDLE(u[1].lo,u[3].lo,mul_p1q2);
DFT2_TWIDDLE(u[1].hi,u[3].hi,mul_p1q2);
DFT2_TWIDDLE(u[4].lo,u[6].lo,mul_p0q2);
DFT2_TWIDDLE(u[4].hi,u[6].hi,mul_p0q2);
DFT2_TWIDDLE(u[5].lo,u[7].lo,mul_p1q2);
DFT2_TWIDDLE(u[5].hi,u[7].hi,mul_p1q2);
DFT2_TWIDDLE(u[8].lo,u[10].lo,mul_p0q2);
DFT2_TWIDDLE(u[8].hi,u[10].hi,mul_p0q2);
DFT2_TWIDDLE(u[9].lo,u[11].lo,mul_p1q2);
DFT2_TWIDDLE(u[9].hi,u[11].hi,mul_p1q2);
DFT2_TWIDDLE(u[12].lo,u[14].lo,mul_p0q2);
DFT2_TWIDDLE(u[12].hi,u[14].hi,mul_p0q2);
DFT2_TWIDDLE(u[13].lo,u[15].lo,mul_p1q2);
DFT2_TWIDDLE(u[13].hi,u[15].hi,mul_p1q2);
// 8x DFT2 and store (reverse binary permutation)
y[0] = u[0] + u[1];
y[p] = u[8] + u[9];
y[2*p] = u[4] + u[5];
y[3*p] = u[12] + u[13];
y[4*p] = u[2] + u[3];
y[5*p] = u[10] + u[11];
y[6*p] = u[6] + u[7];
y[7*p] = u[14] + u[15];
y[8*p] = u[0] - u[1];
y[9*p] = u[8] - u[9];
y[10*p] = u[4] - u[5];
y[11*p] = u[12] - u[13];
y[12*p] = u[2] - u[3];
y[13*p] = u[10] - u[11];
y[14*p] = u[6] - u[7];
y[15*p] = u[14] - u[15];
}
Note that I have modified the kernel to perform the FFT of 2 complex-valued sequences at once instead of one. Also, since I only need the FFT of 256 elements at a time in a much larger sequence, I perform only 2 runs of this kernel, which leaves me with 256-length DFTs in the larger array.
Here's some of the relevant host code as well.
var ev = new[] { new Cl.Event() };
var pEv = new[] { new Cl.Event() };
int fftSize = 1;
int iter = 0;
int n = distributionSize >> 5;
while (fftSize <= n)
{
Cl.SetKernelArg(fftKernel, 0, memA);
Cl.SetKernelArg(fftKernel, 1, memB);
Cl.SetKernelArg(fftKernel, 2, fftSize);
Cl.EnqueueNDRangeKernel(commandQueue, fftKernel, 1, null, globalWorkgroupSize, localWorkgroupSize,
(uint)(iter == 0 ? 0 : 1),
iter == 0 ? null : pEv,
out ev[0]).Check();
if (iter > 0)
pEv[0].Dispose();
Swap(ref ev, ref pEv);
Swap(ref memA, ref memB); // ping-pong
fftSize = fftSize << 4;
iter++;
Cl.Finish(commandQueue);
}
Swap(ref memA, ref memB);
Hope this helps someone!

OpenCL / try to understand Kernel Code

I am studying an OpenCL code wich simulates the N-body problem from the following tutorial :
http://www.browndeertechnology.com/docs/BDT_OpenCL_Tutorial_NBody-rev3.html
My main issue relies on the kernel code :
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
19 pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
20 barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */
21 for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */
22 float4 p2 = pblock[j]; /* Read a cached particle position */
23 float4 d = p2 - p;
24 float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
25 float f = p2.w*invr*invr*invr;
26 a += f*d; /* Accumulate acceleration */
27 }
28 barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in work-group */
29 }
I don't understand what exactly happens at the execution : the kernel code is executed n times where n is the number of work-items (which is also the number of threads) but in the above part of code, we use the local memory for each work-group (there are nb work-groups it seems)
So, at the execution, up to the first "barrier", do I fill locally the pblock array with the global values of pos_old ?
Always up to the first barrier, for another work-group, the pblock array will have contain the same values as the arrays of the others work-groups, since jb=0 before the barrier ?
It seems that's a way to share these arrays by all the work-groups but this is not totally clear for me.
Any help is welcome.
Can you post the entire kernel code please? I have to make assumptions about the params and private variables.
It looks like there are nt number of work items in the group, and ti represents the current work item. When the loop executes, each item in the group will copy only single element. Usually this copy is from a global data source. The first barrier forces the work item to wait until the other items have made their copy. This is necessary because every work item in the group needs to read the data copied from every other work item. The values should not be the same, because ti should be different for each work item. (jb*nt would still equal zero for the first loop though)
Here is the entire kernel code :
__kernel
void
nbody_sim(
__global float4* pos ,
__global float4* vel,
int numBodies,
float deltaTime,
float epsSqr,
__local float4* localPos,
__global float4* newPosition,
__global float4* newVelocity)
{
unsigned int tid = get_local_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
// Number of tiles we need to iterate
unsigned int numTiles = numBodies / localSize;
// position of this work-item
float4 myPos = pos[gid];
float4 acc = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
for(int i = 0; i < numTiles; ++i)
{
// load one tile into local memory
int idx = i * localSize + tid;
localPos[tid] = pos[idx];
// Synchronize to make sure data is available for processing
barrier(CLK_LOCAL_MEM_FENCE);
// calculate acceleration effect due to each body
// a[i->j] = m[j] * r[i->j] / (r^2 + epsSqr)^(3/2)
for(int j = 0; j < localSize; ++j)
{
// Calculate acceleartion caused by particle j on particle i
float4 r = localPos[j] - myPos;
float distSqr = r.x * r.x + r.y * r.y + r.z * r.z;
float invDist = 1.0f / sqrt(distSqr + epsSqr);
float invDistCube = invDist * invDist * invDist;
float s = localPos[j].w * invDistCube;
// accumulate effect of all particles
acc += s * r;
}
// Synchronize so that next tile can be loaded
barrier(CLK_LOCAL_MEM_FENCE);
}
float4 oldVel = vel[gid];
// updated position and velocity
float4 newPos = myPos + oldVel * deltaTime + acc * 0.5f * deltaTime * deltaTime;
newPos.w = myPos.w;
float4 newVel = oldVel + acc * deltaTime;
// write to global memory
newPosition[gid] = newPos;
newVelocity[gid] = newVel;
}
There are "numTiles" work-groups with "localSize" work-items for each work-group.
"gid" is the global index and "tid" is the local index.
Let's start at the first iteration of the loop "for(int i = 0; i < numTiles; ++i)" with "i=0":
If I take for example :
numTiles = 4, localSize = 25 and numBodies = 100 = number of work-items.
Then, at the execution, if I have gid = 80, then tid = 5, idx = 5 and the first assignement will be : localPos[5] = pos[5]
Now, I take gid = 5, then tid = 5 and idx = 5, I will have the same assignement with : localPos[5] = pos[5]
So, from what I understand, in the first iteration and after the first "barrier", each work-items contains the same Local array "localPos", i.e the sub-array of the first global block, which is "pos[0:24]".
Is this a good explanation of what happens ?

Resources