Calculate the middle of two unit length 3D vectors without a square root? - math

With two 3D, unit length vectors, is there a way to calculate a unit length vector in-between these, without re-normalizing? (more specifically without a square root).
Currently I'm just adding them both and normalizing, but for efficiency I thought there might be some better way.
(for the purpose of this question, ignore the case when both vectors are directly opposite)

It's not an answer to the original question; I am rather trying to resolve the issues between two answers and it wouldn't fit into a comment.
The trigonometric approach is 4x slower than your original version with the square-root function on my machine (Linux, Intel Core i5). Your mileage will vary.
The asm (""); is always a bad smell with his siblings volatile and (void) x.
Running a tight loop many-many times is a very unreliable way of benchmarking.
What to do instead?
Analyze the generated assembly code to see what the compiler actually did to your source code.
Use a profiler. I can recommend perf or Intel VTune.
If you look at the assembly code of your micro-benchmark, you will see that the compiler is very smart and figured out that v1 and v2 are not changing and eliminated as much work as it could at compile time. At runtime, no calls were made to sqrtf or to acosf and cosf. That explains why you did not see any difference between the two approaches.
Here is an edited version of your benchmark. I scrambled it a bit and guarded against division by zero with 1.0e-6f. (It doesn't change the conclusions.)
#include <stdio.h>
#include <math.h>
#ifdef USE_NORMALIZE
#warning "Using normalize"
void mid_v3_v3v3_slerp(float res[3], const float v1[3], const float v2[3])
{
float m;
float v[3] = { (v1[0] + v2[0]), (v1[1] + v2[1]), (v1[2] + v2[2]) };
m = 1.0f / sqrtf(v[0] * v[0] + v[1] * v[1] + v[2] * v[2] + 1.0e-6f);
v[0] *= m;
v[1] *= m;
v[2] *= m;
res[0] = v[0];
res[1] = v[1];
res[2] = v[2];
}
#else
#warning "Not using normalize"
void mid_v3_v3v3_slerp(float v[3], const float v1[3], const float v2[3])
{
const float dot_product = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
const float theta = acosf(dot_product);
const float n = 1.0f / (2.0f * cosf(theta * 0.5f) + 1.0e-6f);
v[0] = (v1[0] + v2[0]) * n;
v[1] = (v1[1] + v2[1]) * n;
v[2] = (v1[2] + v2[2]) * n;
}
#endif
int main(void)
{
unsigned long long int i = 20000000;
float v1[3] = {-0.8659117221832275, 0.4995948076248169, 0.024538060650229454};
float v2[3] = {0.7000154256820679, 0.7031427621841431, -0.12477479875087738};
float v[3] = { 0.0, 0.0, 0.0 };
while (--i) {
mid_v3_v3v3_slerp( v, v1, v2);
mid_v3_v3v3_slerp(v1, v, v2);
mid_v3_v3v3_slerp(v1, v2, v );
}
printf("done %f %f %f\n", v[0], v[1], v[2]);
return 0;
}
I compiled it with gcc -ggdb3 -O3 -Wall -Wextra -fwhole-program -DUSE_NORMALIZE -march=native -static normal.c -lm and profiled the code with perf.
The trigonometric approach is 4x slower and it is because the expensive cosf and acosf functions.
I have tested the Intel C++ Compiler as well: icc -Ofast -Wall -Wextra -ip -xHost normal.c; the conclusion is the same, although gcc generates approximately 10% slower code (for -Ofast as well).
I wouldn't even try to implement an approximate sqrtf: It is already an intrinsic and chances are, your approximation will only be slower...
Having said all these, I don't know the answer to the original question. I thought about it and I also suspect that the there might be another way that doesn't involve the square-root function.
Interesting question in theory; in practice, I doubt that getting rid of that square-root would make any difference in speed in your application.

First, find the angle between your two vectors. From the principle of scalar projection, we know that
|a| * cos(theta) = a . b_hat
. being the dot product operator, |a| being the length of a, theta being the angle between a and b, and b_hat being the normalized form of b.
In your situation, a and b are already unit vectors, so this simplifies to:
cos(theta) = a . b
which we can rearrange to:
theta = acos(a . b)
Lay vectors A and B end to end, and complete the triangle by drawing a line from the start of the first vector to the end of the second. Since two sides are of equal length, we know the triangle is isoceles, so it's easy to determine all of the angles if you already know theta.
That line with length N is the middle vector. We can normalize it if we divide it by N.
From the law of sines, we know that
sin(theta/2)/1 = sin(180-theta)/N
Which we can rearrange to get
N = sin(180-theta) / sin(theta/2)
Note that you'll divide by zero when calculating N if A and B are equal, so it may be useful to check for that corner case before starting.
Summary:
dot_product = a.x * b.x + a.y * b.y + a.z * b.z
theta = acos(dot_product)
N = sin(180-theta) / sin(theta/2)
middle_vector = [(a.x + b.x) / N, (a.y + b.y) / N, (a.z + b.z) / N]

Based on the answers I made some speed comparison.
Edit. with this nieve benchmark, GCC optimizes out the trigonometry yeilding approx the same speeds for both methods, Read #Ali's post for a more complete explanation.
In summery, using re-normalizing is approx 4x faster.
#include <stdio.h>
#include <math.h>
/* gcc mid_v3_v3v3_slerp.c -lm -O3 -o mid_v3_v3v3_slerp_a
* gcc mid_v3_v3v3_slerp.c -lm -O3 -o mid_v3_v3v3_slerp_b -DUSE_NORMALIZE
*
* time ./mid_v3_v3v3_slerp_a
* time ./mid_v3_v3v3_slerp_b
*/
#ifdef USE_NORMALIZE
#warning "Using normalize"
void mid_v3_v3v3_slerp(float v[3], const float v1[3], const float v2[3])
{
float m;
v[0] = (v1[0] + v2[0]);
v[1] = (v1[1] + v2[1]);
v[2] = (v1[2] + v2[2]);
m = 1.0f / sqrtf(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
v[0] *= m;
v[1] *= m;
v[2] *= m;
}
#else
#warning "Not using normalize"
void mid_v3_v3v3_slerp(float v[3], const float v1[3], const float v2[3])
{
const float dot_product = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
const float theta = acosf(dot_product);
const float n = 1.0f / (2.0f * cosf(theta * 0.5f));
v[0] = (v1[0] + v2[0]) * n;
v[1] = (v1[1] + v2[1]) * n;
v[2] = (v1[2] + v2[2]) * n;
}
#endif
int main(void)
{
unsigned long long int i = 10000000000;
const float v1[3] = {-0.8659117221832275, 0.4995948076248169, 0.024538060650229454};
const float v2[3] = {0.7000154256820679, 0.7031427621841431, -0.12477479875087738};
float v[3];
while (--i) {
asm (""); /* prevent compiler from optimizing the loop away */
mid_v3_v3v3_slerp(v, v1, v2);
}
printf("done %f %f %f\n", v[0], v[1], v[2]);
return 0;
}

Related

Explicit FDM with CUDA [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am working to implement CUDA for the following code. The first version has been written serially and the second version is written with CUDA. I am sure about its results in serial version. I expect that the second version that I have added CUDA functionality also give me the same result, but it seems that kernel function does not do any thing and it gives me the initial value of u and v. I know due to lack of my experience, the bug may be obvious, but I cannot figure it out. Also, please do not recommend using flatten array, because it is harder for me to understand the indexing in code.
First version:
#include <fstream>
#include <iostream>
#include <math.h>
#include <vector>
#include <chrono>
#include <omp.h>
using namespace std;
const int M = 1024;
const int N = 1024;
const double A = 1;
const double B = 3;
const double Du = 5 * pow(10, -5);
const double Dv = 5 * pow(10, -6);
const int Max_Itr = 1000;
const double h = 1.0 / static_cast<double>(M - 1);
const double delta_t = 0.0025;
const double s1 = (Du * delta_t) / pow(h, 2);
const double s2 = (Dv * delta_t) / pow(h, 2);
int main() {
double** u=new double* [M];
double** v=new double* [M];
for (int i=0; i<M; i++){
u[i]=new double [N];
v[i]=new double [N];
}
for (int j = 0; j < M; j++) {
for (int i = 0; i < N;i++) {
u[i][j]=0.02;
v[i][j]=0.02;
}
}
for (int k = 1; k < Max_Itr; k++) {
for (int i = 1; i < N - 1; i++) {
for (int j = 1; j < M - 1; j++) {
u[i][j] = ((1 - (4 * s1)) * u[i][j]) + (s1 * (u[i + 1][j] + u[i - 1][j] + u[i][j + 1] + u[i][j - 1])) +
(A * delta_t) + (delta_t * pow(u[i][j], 2) * v[i][j]) - (delta_t * (B + 1) * u[i][j]);
v[i][j] = ((1 - (4 * s2)) * v[i][j]) + (s2 * (v[i + 1][j] + v[i - 1][j] + v[i][j + 1] + v[i][j - 1])) + (B * delta_t * u[i][j])
- (delta_t * pow(u[i][j], 2) * v[i][j]);
}
}
}
cout<<"u: "<<u[512][512]<<" v: "<<v[512][512]<<endl;
return 0;
}
Second version:
#include <fstream>
#include <iostream>
#include <math.h>
#include <vector>
using namespace std;
#define M 1024
#define N 1024
__global__ void my_kernel(double** v, double** u){
int i= blockIdx.y * blockDim.y + threadIdx.y;
int j= blockIdx.x * blockDim.x + threadIdx.x;
double A = 1;
double B = 3;
int Max_Itr = 1000;
double delta_t = 0.0025;
double Du = 5 * powf(10, -5);
double Dv = 5 * powf(10, -6);
double h = 1.0 / (M - 1);
double s1 = (Du * delta_t) / powf(h, 2);
double s2 = (Dv * delta_t) / powf(h, 2);
for (int k = 1; k < Max_Itr; k++) {
u[i][j] = ((1 - (4 * s1))
* u[i][j]) + (s1 * (u[i + 1][j] + u[i - 1][j] + u[i][j + 1] + u[i][j - 1])) +
(A * delta_t) + (delta_t * pow(u[i][j], 2) * v[i][j]) - (delta_t * (B + 1) * u[i][j]);
v[i][j] = ((1 - (4 * s2))
* v[i][j]) + (s2 * (v[i + 1][j] + v[i - 1][j] + v[i][j + 1] + v[i][j - 1])) + (B * delta_t * u[i][j])
- (delta_t * pow(u[i][j], 2) * v[i][j]);
__syncthreads();
}
}
int main() {
double** u=new double* [M];
double** v=new double* [M];
for (int i=0; i<M; i++){
u[i]=new double [N];
v[i]=new double [N];
}
dim3 blocks(32,32);
dim3 grids(M/32 +1, N/32 + 1);
for (int j = 0; j < M; j++) {
for (int i = 0; i < N;i++) {
u[i][j]=0.02;
v[i][j]=0.02;
}
}
double **u_d, **v_d;
int d_size = N * M * sizeof(double);
cudaMalloc(&u_d, d_size);
cudaMalloc(&v_d, d_size);
cudaMemcpy(u_d, u, d_size, cudaMemcpyHostToDevice);
cudaMemcpy(v_d, v, d_size, cudaMemcpyHostToDevice);
my_kernel<<<grids, blocks>>> (v_d,u_d);
cudaDeviceSynchronize();
cudaMemcpy(v, v_d, d_size, cudaMemcpyDeviceToHost);
cudaMemcpy(u, u_d, d_size, cudaMemcpyDeviceToHost);
cout<<"u: "<<u[512][512]<<" v: "<<v[512][512]<<endl;
return 0;
}
What I expect from the second version is :
u: 0.2815 v: 1.7581
Your two-dimensional array - in the first version of the program - is implemented using an array of pointers, each of which to a separately-allocated array of double values.
In your second version, you are using the same pointer-to-pointer-to-double type, but - you're not allocating any space for the actual data, just for the array of pointers (and not copying any of the data to the GPU - just the pointers; which are useless to copy anyway, since they're pointers to host-side memory.)
What is most likely happening is that your kernel attempts to access memory at an invalid address, and its execution is aborted.
If you were to properly check for errors, as #njuffa noted, you would know that is what happened.
Now, you could avoid having to make multiple memory allocations if you were to use a single data area instead of separate allocations for each second-dimension 1D array; and that is true both for the first and the second version of your program. That would not quite be array flattening. See an explanation of how to do this (C-language-style) on this page.
Note, however, that double-dereferencing, which you insist on performing in your kernel, is likely slowing it down significantly.

OpenCL Matrix Multiplication Altera Example

I am very new to OpenCL and am going through the Altera OpenCL examples.
In their matrix multiplication example, they have used the concept of blocks, where dimensions of the input matrices are multiple of block size. Here's the code:
void matrixMult( // Input and output matrices
__global float *restrict C,
__global float *A,
__global float *B,
// Widths of matrices.
int A_width, int B_width)
{
// Local storage for a block of input matrices A and B
__local float A_local[BLOCK_SIZE][BLOCK_SIZE];
__local float B_local[BLOCK_SIZE][BLOCK_SIZE];
// Block index
int block_x = get_group_id(0);
int block_y = get_group_id(1);
// Local ID index (offset within a block)
int local_x = get_local_id(0);
int local_y = get_local_id(1);
// Compute loop bounds
int a_start = A_width * BLOCK_SIZE * block_y;
int a_end = a_start + A_width - 1;
int b_start = BLOCK_SIZE * block_x;
float running_sum = 0.0f;
for (int a = a_start, b = b_start; a <= a_end; a += BLOCK_SIZE, b += (BLOCK_SIZE * B_width))
{
A_local[local_y][local_x] = A[a + A_width * local_y + local_x];
B_local[local_x][local_y] = B[b + B_width * local_y + local_x];
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; ++k)
{
running_sum += A_local[local_y][k] * B_local[local_x][k];
}
}
// Store result in matrix C
C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = running_sum;
}
Assume block size is 2, then: block_x and block_y are both 0; and local_x and local_y are both 0.
Then A_local[0][0] would be A[0] and B_local[0][0] would be B[0].
Sizes of A_local and B_local are 4 elements each.
In that case, how would A_local and B_local access other elements of the block in that iteration?
Also would separate threads/cores be assigned for each local_x and local_y?
There is definitely a barrier missing in your code sample. The outer for loop as you have it will only produce correct results if all work items are executing instructions in lockstep fashion, thus guaranteeing the local memory is populated before the for k loop.
Maybe this is the case for Altera and other FPGAs, but this is not correct for CPUs and GPUs.
You should add barrier(CLK_LOCAL_MEM_FENCE); if you are getting unexpected results, or want to be compatible with other type of hardware.
float running_sum = 0.0f;
for (int a = a_start, b = b_start; a <= a_end; a += BLOCK_SIZE, b += (BLOCK_SIZE * B_width))
{
A_local[local_y][local_x] = A[a + A_width * local_y + local_x];
B_local[local_x][local_y] = B[b + B_width * local_y + local_x];
barrier(CLK_LOCAL_MEM_FENCE);
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; ++k)
{
running_sum += A_local[local_y][k] * B_local[local_x][k];
}
}
A_local and B_local are both shared by all work items of the work group, so all their elements are loaded in parallel (by all work items of the work group) at each step of the encompassing for loop.
Then each work item uses some of the loaded values (not necessarily the values the work item loaded itself) to do its share of the computation.
And finally, the work item stores its individual result into the global output matrix.
It is a classical tiled implementation of a matrix-matrix multiplication. However, I'm really surprised not to see any sort of call to a memory synchronisation function, such as work_group_barrier(CLK_LOCAL_MEM_FENCE) between the load of A_local and B_local and their use in the k loop... But I might very well have overlooked something here.

Hough transform and OpenCL

I'm trying to implement Hough transform for circles in OpenCL, but i've encountered really weird problem. Every time i run the Hough kernel, i end up with slightly different accumulator, even though parameters are the same and accumulator is always a freshly zero'ed table (ex. http://imgur.com/a/VcIw1). My kernel code is as below:
#define BLOCK_LEN 256
__kernel void HoughCirclesKernel(
__global int* A,
__global int* imgData,
__global int* _width,
__global int* _height,
__global int* r
)
{
__local int imgBuff[BLOCK_LEN];
int localThreadIndex = get_local_id(0); //threadIdx.x
int globalThreadIndex = get_local_id(0) + get_group_id(0) * BLOCK_LEN; //threadIdx.x + blockIdx.x * Block_Len
int width = *_width; int height = *_height;
int radius = *r;
A[globalThreadIndex] = 0;
barrier(CLK_GLOBAL_MEM_FENCE);
if(globalThreadIndex < width*height)
{
imgBuff[localThreadIndex] = imgData[globalThreadIndex];
barrier(CLK_LOCAL_MEM_FENCE);
if(imgBuff[localThreadIndex] > 0)
{
float s1, c1;
for(int i = 0; i<180; i++)
{
s1 = sincos(i, &c1);
int centerX = globalThreadIndex % width + radius * c1;
int centerY = ((globalThreadIndex - centerX) / height) + radius * s1;
if(centerX < width && centerY < height)
atomic_inc(A + centerX + centerY * width);
}
}
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
Could this be the fault of how I am incrementing the accumulator?
if(globalThreadIndex < width*height)
{
imgBuff[localThreadIndex] = imgData[globalThreadIndex];
barrier(CLK_LOCAL_MEM_FENCE);
...
}
this is undefined behaviour since there is a barrier inside a branch.
All streaming units in a compute unit must enter same memory fence.
Try this:
if(globalThreadIndex < width*height)
{
imgBuff[localThreadIndex] = imgData[globalThreadIndex];
...
}
barrier(CLK_LOCAL_MEM_FENCE);
Alse there could be another issue if you are using multiple devices:
get_local_id(0) + get_group_id(0)
here get_group_id(0) is getting group id per device and it starts from 0 for all devices just as get_global_id starts zero too; so you should add proper offsets in the "ndrange" instruction when using multiple devices. Even though different devices can support same floatig point accuracy requirements, one of them may give better accuracy than other and can give slightly different results. If it is single device, then you should try lowering gpu frequencies as it may have defects or side effects of an overclock.
I have managed to solve my problem by finding and correcting three issues.
First of all the kernel code, the line:
int centerY = ((globalThreadIndex - centerX) / height) + radius * s1;
should be:
int centerY = (globalThreadIndex / width) + radius * s1;
The main change here was dividing by width, not height. This caused inaccuracy problems.
if(centerX < width && centerY < height)
The above condition was changed to:
if(x < width && x >= 0)
if(y < height && y >=0)
As for the accumulator problem, first I will post the code I used to create clBuffer (I am using OpenCL.net library for C#):
int[] a = new int[width*height]; //image size
ErrorCode error;
Mem cl_accumulator = (Mem)Cl.CreateBuffer(cl_context, MemFlags.ReadWrite, (IntPtr)(a.Length * sizeof(int)), out error);
CheckErr(error, "Cl.CreateBuffer");
The fix here was simple and pretty much self-explainatory:
int[] a = Enumerable.Repeat(0, width * height).ToArray();
ErrorCode error;
GCHandle accHandle = GCHandle.Alloc(a, GCHandleType.Pinned);
IntPtr accPtr = accHandle.AddrOfPinnedObject();
Mem cl_accumulator = (Mem)Cl.CreateBuffer(cl_context, MemFlags.ReadWrite | MemFlags.CopyHostPtr, (IntPtr)(a.Length * sizeof(int)), accPtr, out error);
CheckErr(error, "Cl.CreateBuffer");
I filled the accumulator table with zeros and then copied it to device buffer each time I executed the kernel.
The above errors caused the accumulator to look different and bit malformed each time I executed the kernel.

Higher radix (or better) formulation for Stockham FFT

Background
I've implemented this algorithm from Microsoft Research for a radix-2 FFT (Stockham auto sort) using OpenCL.
I use floating point textures (256 cols X N rows) for input and output in the kernel, because I will need to sample at non-integral points and I thought it better to delegate that to the texture sampling hardware. Note that my FFTs are always of 256-point sequences (every row in my texture). At this point, my N is 16384 or 32768 depending on the GPU i'm using and the max 2D texture size allowed.
I also need to perform the FFT of 4 real-valued sequences at once, so the kernel performs the FFT(a, b, c, d) as FFT(a + ib, c + id) from which I can extract the 4 complex sequences out later using an O(n) algorithm. I can elaborate on this if someone wishes - but I don't believe it falls in the scope of this question.
Kernel Source
const sampler_t fftSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
__kernel void FFT_Stockham(read_only image2d_t input, write_only image2d_t output, int fftSize, int size)
{
int x = get_global_id(0);
int y = get_global_id(1);
int b = floor(x / convert_float(fftSize)) * (fftSize / 2);
int offset = x % (fftSize / 2);
int x0 = b + offset;
int x1 = x0 + (size / 2);
float4 val0 = read_imagef(input, fftSampler, (int2)(x0, y));
float4 val1 = read_imagef(input, fftSampler, (int2)(x1, y));
float angle = -6.283185f * (convert_float(x) / convert_float(fftSize));
// TODO: Convert the two calculations below into lookups from a __constant buffer
float tA = native_cos(angle);
float tB = native_sin(angle);
float4 coeffs1 = (float4)(tA, tB, tA, tB);
float4 coeffs2 = (float4)(-tB, tA, -tB, tA);
float4 result = val0 + coeffs1 * val1.xxzz + coeffs2 * val1.yyww;
write_imagef(output, (int2)(x, y), result);
}
The host code simply invokes this kernel log2(256) times, ping-ponging the input and output textures.
Note: I tried removing the native_cos and native_sin to see if that impacted timing, but it doesn't seem to change things by very much. Not the factor I'm looking for, in any case.
Access pattern
Knowing that I am probably memory-bandwidth bound, here is the memory access pattern (per-row) for my radix-2 FFT.
X0 - element 1 to combine (read)
X1 - element 2 to combine (read)
X - element to write to (write)
Question
So my question is - can someone help me with/point me toward a higher-radix formulation for this algorithm? I ask because most FFTs are optimized for large cases and single real/complex valued sequences. Their kernel generators are also very case dependent and break down quickly when I try to muck with their internals.
Are there other options better than simply going to a radix-8 or 16 kernel?
Some of my constraints are - I have to use OpenCL (no cuFFT). I also cannot use clAmdFft from ACML for this purpose. It would be nice to also talk about CPU optimizations (this kernel SUCKS big time on the CPU) - but getting it to run in fewer iterations on the GPU is my main use-case.
Thanks in advance for reading through all this and trying to help!
I tried several versions, but the one with the best performance on CPU and GPU was a radix-16 kernel for my specific case.
Here is the kernel for reference. It was taken from Eric Bainville's (most excellent) website and used with full attribution.
// #define M_PI 3.14159265358979f
//Global size is x.Length/2, Scale = 1 for direct, 1/N to inverse (iFFT)
__kernel void ConjugateAndScale(__global float4* x, const float Scale)
{
int i = get_global_id(0);
float temp = Scale;
float4 t = (float4)(temp, -temp, temp, -temp);
x[i] *= t;
}
// Return a*EXP(-I*PI*1/2) = a*(-I)
float2 mul_p1q2(float2 a) { return (float2)(a.y,-a.x); }
// Return a^2
float2 sqr_1(float2 a)
{ return (float2)(a.x*a.x-a.y*a.y,2.0f*a.x*a.y); }
// Return the 2x DFT2 of the four complex numbers in A
// If A=(a,b,c,d) then return (a',b',c',d') where (a',c')=DFT2(a,c)
// and (b',d')=DFT2(b,d).
float8 dft2_4(float8 a) { return (float8)(a.lo+a.hi,a.lo-a.hi); }
// Return the DFT of 4 complex numbers in A
float8 dft4_4(float8 a)
{
// 2x DFT2
float8 x = dft2_4(a);
// Shuffle, twiddle, and 2x DFT2
return dft2_4((float8)(x.lo.lo,x.hi.lo,x.lo.hi,mul_p1q2(x.hi.hi)));
}
// Complex product, multiply vectors of complex numbers
#define MUL_RE(a,b) (a.even*b.even - a.odd*b.odd)
#define MUL_IM(a,b) (a.even*b.odd + a.odd*b.even)
float2 mul_1(float2 a, float2 b)
{ float2 x; x.even = MUL_RE(a,b); x.odd = MUL_IM(a,b); return x; }
float4 mul_1_F4(float4 a, float4 b)
{ float4 x; x.even = MUL_RE(a,b); x.odd = MUL_IM(a,b); return x; }
float4 mul_2(float4 a, float4 b)
{ float4 x; x.even = MUL_RE(a,b); x.odd = MUL_IM(a,b); return x; }
// Return the DFT2 of the two complex numbers in vector A
float4 dft2_2(float4 a) { return (float4)(a.lo+a.hi,a.lo-a.hi); }
// Return cos(alpha)+I*sin(alpha) (3 variants)
float2 exp_alpha_1(float alpha)
{
float cs,sn;
// sn = sincos(alpha,&cs); // sincos
//cs = native_cos(alpha); sn = native_sin(alpha); // native sin+cos
cs = cos(alpha); sn = sin(alpha); // sin+cos
return (float2)(cs,sn);
}
// Return cos(alpha)+I*sin(alpha) (3 variants)
float4 exp_alpha_1_F4(float alpha)
{
float cs,sn;
// sn = sincos(alpha,&cs); // sincos
// cs = native_cos(alpha); sn = native_sin(alpha); // native sin+cos
cs = cos(alpha); sn = sin(alpha); // sin+cos
return (float4)(cs,sn,cs,sn);
}
// mul_p*q*(a) returns a*EXP(-I*PI*P/Q)
#define mul_p0q1(a) (a)
#define mul_p0q2 mul_p0q1
//float2 mul_p1q2(float2 a) { return (float2)(a.y,-a.x); }
__constant float SQRT_1_2 = 0.707106781186548; // cos(Pi/4)
#define mul_p0q4 mul_p0q2
float2 mul_p1q4(float2 a) { return (float2)(SQRT_1_2)*(float2)(a.x+a.y,-a.x+a.y); }
#define mul_p2q4 mul_p1q2
float2 mul_p3q4(float2 a) { return (float2)(SQRT_1_2)*(float2)(-a.x+a.y,-a.x-a.y); }
__constant float COS_8 = 0.923879532511287; // cos(Pi/8)
__constant float SIN_8 = 0.382683432365089; // sin(Pi/8)
#define mul_p0q8 mul_p0q4
float2 mul_p1q8(float2 a) { return mul_1((float2)(COS_8,-SIN_8),a); }
#define mul_p2q8 mul_p1q4
float2 mul_p3q8(float2 a) { return mul_1((float2)(SIN_8,-COS_8),a); }
#define mul_p4q8 mul_p2q4
float2 mul_p5q8(float2 a) { return mul_1((float2)(-SIN_8,-COS_8),a); }
#define mul_p6q8 mul_p3q4
float2 mul_p7q8(float2 a) { return mul_1((float2)(-COS_8,-SIN_8),a); }
// Compute in-place DFT2 and twiddle
#define DFT2_TWIDDLE(a,b,t) { float2 tmp = t(a-b); a += b; b = tmp; }
// T = N/16 = number of threads.
// P is the length of input sub-sequences, 1,16,256,...,N/16.
__kernel void FFT_Radix16(__global const float4 * x, __global float4 * y, int pp)
{
int p = pp;
int t = get_global_size(0); // number of threads
int i = get_global_id(0); // current thread
////// y[i] = 2*x[i];
////// return;
int k = i & (p-1); // index in input sequence, in 0..P-1
// Inputs indices are I+{0,..,15}*T
x += i;
// Output indices are J+{0,..,15}*P, where
// J is I with four 0 bits inserted at bit log2(P)
y += ((i-k)<<4) + k;
// Load
float4 u[16];
for (int m=0;m<16;m++) u[m] = x[m*t];
// Twiddle, twiddling factors are exp(_I*PI*{0,..,15}*K/4P)
float alpha = -M_PI*(float)k/(float)(8*p);
for (int m=1;m<16;m++) u[m] = mul_1_F4(exp_alpha_1_F4(m * alpha), u[m]);
// 8x in-place DFT2 and twiddle (1)
DFT2_TWIDDLE(u[0].lo,u[8].lo,mul_p0q8);
DFT2_TWIDDLE(u[0].hi,u[8].hi,mul_p0q8);
DFT2_TWIDDLE(u[1].lo,u[9].lo,mul_p1q8);
DFT2_TWIDDLE(u[1].hi,u[9].hi,mul_p1q8);
DFT2_TWIDDLE(u[2].lo,u[10].lo,mul_p2q8);
DFT2_TWIDDLE(u[2].hi,u[10].hi,mul_p2q8);
DFT2_TWIDDLE(u[3].lo,u[11].lo,mul_p3q8);
DFT2_TWIDDLE(u[3].hi,u[11].hi,mul_p3q8);
DFT2_TWIDDLE(u[4].lo,u[12].lo,mul_p4q8);
DFT2_TWIDDLE(u[4].hi,u[12].hi,mul_p4q8);
DFT2_TWIDDLE(u[5].lo,u[13].lo,mul_p5q8);
DFT2_TWIDDLE(u[5].hi,u[13].hi,mul_p5q8);
DFT2_TWIDDLE(u[6].lo,u[14].lo,mul_p6q8);
DFT2_TWIDDLE(u[6].hi,u[14].hi,mul_p6q8);
DFT2_TWIDDLE(u[7].lo,u[15].lo,mul_p7q8);
DFT2_TWIDDLE(u[7].hi,u[15].hi,mul_p7q8);
// 8x in-place DFT2 and twiddle (2)
DFT2_TWIDDLE(u[0].lo,u[4].lo,mul_p0q4);
DFT2_TWIDDLE(u[0].hi,u[4].hi,mul_p0q4);
DFT2_TWIDDLE(u[1].lo,u[5].lo,mul_p1q4);
DFT2_TWIDDLE(u[1].hi,u[5].hi,mul_p1q4);
DFT2_TWIDDLE(u[2].lo,u[6].lo,mul_p2q4);
DFT2_TWIDDLE(u[2].hi,u[6].hi,mul_p2q4);
DFT2_TWIDDLE(u[3].lo,u[7].lo,mul_p3q4);
DFT2_TWIDDLE(u[3].hi,u[7].hi,mul_p3q4);
DFT2_TWIDDLE(u[8].lo,u[12].lo,mul_p0q4);
DFT2_TWIDDLE(u[8].hi,u[12].hi,mul_p0q4);
DFT2_TWIDDLE(u[9].lo,u[13].lo,mul_p1q4);
DFT2_TWIDDLE(u[9].hi,u[13].hi,mul_p1q4);
DFT2_TWIDDLE(u[10].lo,u[14].lo,mul_p2q4);
DFT2_TWIDDLE(u[10].hi,u[14].hi,mul_p2q4);
DFT2_TWIDDLE(u[11].lo,u[15].lo,mul_p3q4);
DFT2_TWIDDLE(u[11].hi,u[15].hi,mul_p3q4);
// 8x in-place DFT2 and twiddle (3)
DFT2_TWIDDLE(u[0].lo,u[2].lo,mul_p0q2);
DFT2_TWIDDLE(u[0].hi,u[2].hi,mul_p0q2);
DFT2_TWIDDLE(u[1].lo,u[3].lo,mul_p1q2);
DFT2_TWIDDLE(u[1].hi,u[3].hi,mul_p1q2);
DFT2_TWIDDLE(u[4].lo,u[6].lo,mul_p0q2);
DFT2_TWIDDLE(u[4].hi,u[6].hi,mul_p0q2);
DFT2_TWIDDLE(u[5].lo,u[7].lo,mul_p1q2);
DFT2_TWIDDLE(u[5].hi,u[7].hi,mul_p1q2);
DFT2_TWIDDLE(u[8].lo,u[10].lo,mul_p0q2);
DFT2_TWIDDLE(u[8].hi,u[10].hi,mul_p0q2);
DFT2_TWIDDLE(u[9].lo,u[11].lo,mul_p1q2);
DFT2_TWIDDLE(u[9].hi,u[11].hi,mul_p1q2);
DFT2_TWIDDLE(u[12].lo,u[14].lo,mul_p0q2);
DFT2_TWIDDLE(u[12].hi,u[14].hi,mul_p0q2);
DFT2_TWIDDLE(u[13].lo,u[15].lo,mul_p1q2);
DFT2_TWIDDLE(u[13].hi,u[15].hi,mul_p1q2);
// 8x DFT2 and store (reverse binary permutation)
y[0] = u[0] + u[1];
y[p] = u[8] + u[9];
y[2*p] = u[4] + u[5];
y[3*p] = u[12] + u[13];
y[4*p] = u[2] + u[3];
y[5*p] = u[10] + u[11];
y[6*p] = u[6] + u[7];
y[7*p] = u[14] + u[15];
y[8*p] = u[0] - u[1];
y[9*p] = u[8] - u[9];
y[10*p] = u[4] - u[5];
y[11*p] = u[12] - u[13];
y[12*p] = u[2] - u[3];
y[13*p] = u[10] - u[11];
y[14*p] = u[6] - u[7];
y[15*p] = u[14] - u[15];
}
Note that I have modified the kernel to perform the FFT of 2 complex-valued sequences at once instead of one. Also, since I only need the FFT of 256 elements at a time in a much larger sequence, I perform only 2 runs of this kernel, which leaves me with 256-length DFTs in the larger array.
Here's some of the relevant host code as well.
var ev = new[] { new Cl.Event() };
var pEv = new[] { new Cl.Event() };
int fftSize = 1;
int iter = 0;
int n = distributionSize >> 5;
while (fftSize <= n)
{
Cl.SetKernelArg(fftKernel, 0, memA);
Cl.SetKernelArg(fftKernel, 1, memB);
Cl.SetKernelArg(fftKernel, 2, fftSize);
Cl.EnqueueNDRangeKernel(commandQueue, fftKernel, 1, null, globalWorkgroupSize, localWorkgroupSize,
(uint)(iter == 0 ? 0 : 1),
iter == 0 ? null : pEv,
out ev[0]).Check();
if (iter > 0)
pEv[0].Dispose();
Swap(ref ev, ref pEv);
Swap(ref memA, ref memB); // ping-pong
fftSize = fftSize << 4;
iter++;
Cl.Finish(commandQueue);
}
Swap(ref memA, ref memB);
Hope this helps someone!

OpenCL traversal kernel - further optimization

Currently, I have an OpenCL kernel for like traversal as below. I'd be glad if someone had some point on optimization of this quite large kernel.
The thing is, I'm running this code with SAH BVH and I'd like to get performance similar to Timo Aila with his traversals in his paper (Understanding the Efficiency of Ray Traversal on GPUs), of course his code uses SplitBVH (which I might consider using in place of SAH BVH, but in my opinion it has really slow build times). But I'm asking about traversal, not BVH (also I've so far worked only with scenes, where SplitBVH won't give you much advantages over SAH BVH).
First of all, here is what I have so far (standard while-while traversal kernel).
__constant sampler_t sampler = CLK_FILTER_NEAREST;
// Inline definition of horizontal max
inline float max4(float a, float b, float c, float d)
{
return max(max(max(a, b), c), d);
}
// Inline definition of horizontal min
inline float min4(float a, float b, float c, float d)
{
return min(min(min(a, b), c), d);
}
// Traversal kernel
__kernel void traverse( __read_only image2d_t nodes,
__global const float4* triangles,
__global const float4* rays,
__global float4* result,
const int num,
const int w,
const int h)
{
// Ray index
int idx = get_global_id(0);
if(idx < num)
{
// Stack
int todo[32];
int todoOffset = 0;
// Current node
int nodeNum = 0;
float tmin = 0.0f;
float depth = 2e30f;
// Fetch ray origin, direction and compute invdirection
float4 origin = rays[2 * idx + 0];
float4 direction = rays[2 * idx + 1];
float4 invdir = native_recip(direction);
float4 temp = (float4)(0.0f, 0.0f, 0.0f, 1.0f);
// Traversal loop
while(true)
{
// Fetch node information
int2 nodeCoord = (int2)((nodeNum << 2) % w, (nodeNum << 2) / w);
int4 specs = read_imagei(nodes, sampler, nodeCoord + (int2)(3, 0));
// While node isn't leaf
while(specs.z == 0)
{
// Fetch child bounding boxes
float4 n0xy = read_imagef(nodes, sampler, nodeCoord);
float4 n1xy = read_imagef(nodes, sampler, nodeCoord + (int2)(1, 0));
float4 nz = read_imagef(nodes, sampler, nodeCoord + (int2)(2, 0));
// Test ray against child bounding boxes
float oodx = origin.x * invdir.x;
float oody = origin.y * invdir.y;
float oodz = origin.z * invdir.z;
float c0lox = n0xy.x * invdir.x - oodx;
float c0hix = n0xy.y * invdir.x - oodx;
float c0loy = n0xy.z * invdir.y - oody;
float c0hiy = n0xy.w * invdir.y - oody;
float c0loz = nz.x * invdir.z - oodz;
float c0hiz = nz.y * invdir.z - oodz;
float c1loz = nz.z * invdir.z - oodz;
float c1hiz = nz.w * invdir.z - oodz;
float c0min = max4(min(c0lox, c0hix), min(c0loy, c0hiy), min(c0loz, c0hiz), tmin);
float c0max = min4(max(c0lox, c0hix), max(c0loy, c0hiy), max(c0loz, c0hiz), depth);
float c1lox = n1xy.x * invdir.x - oodx;
float c1hix = n1xy.y * invdir.x - oodx;
float c1loy = n1xy.z * invdir.y - oody;
float c1hiy = n1xy.w * invdir.y - oody;
float c1min = max4(min(c1lox, c1hix), min(c1loy, c1hiy), min(c1loz, c1hiz), tmin);
float c1max = min4(max(c1lox, c1hix), max(c1loy, c1hiy), max(c1loz, c1hiz), depth);
bool traverseChild0 = (c0max >= c0min);
bool traverseChild1 = (c1max >= c1min);
nodeNum = specs.x;
int nodeAbove = specs.y;
// We hit just one out of 2 childs
if(traverseChild0 != traverseChild1)
{
if(traverseChild1)
{
nodeNum = nodeAbove;
}
}
// We hit either both or none
else
{
// If we hit none, pop node from stack (or exit traversal, if stack is empty)
if (!traverseChild0)
{
if(todoOffset == 0)
{
break;
}
nodeNum = todo[--todoOffset];
}
// If we hit both
else
{
// Sort them (so nearest goes 1st, further 2nd)
if(c1min < c0min)
{
unsigned int tmp = nodeNum;
nodeNum = nodeAbove;
nodeAbove = tmp;
}
// Push further on stack
todo[todoOffset++] = nodeAbove;
}
}
// Fetch next node information
nodeCoord = (int2)((nodeNum << 2) % w, (nodeNum << 2) / w);
specs = read_imagei(nodes, sampler, nodeCoord + (int2)(3, 0));
}
// If node is leaf & has some primitives
if(specs.z > 0)
{
// Loop through primitives & perform intersection with them (Woop triangles)
for(int i = specs.x; i < specs.y; i++)
{
// Fetch first point from global memory
float4 v0 = triangles[i * 4 + 0];
float o_z = v0.w - origin.x * v0.x - origin.y * v0.y - origin.z * v0.z;
float i_z = 1.0f / (direction.x * v0.x + direction.y * v0.y + direction.z * v0.z);
float t = o_z * i_z;
if(t > 0.0f && t < depth)
{
// Fetch second point from global memory
float4 v1 = triangles[i * 4 + 1];
float o_x = v1.w + origin.x * v1.x + origin.y * v1.y + origin.z * v1.z;
float d_x = direction.x * v1.x + direction.y * v1.y + direction.z * v1.z;
float u = o_x + t * d_x;
if(u >= 0.0f && u <= 1.0f)
{
// Fetch third point from global memory
float4 v2 = triangles[i * 4 + 2];
float o_y = v2.w + origin.x * v2.x + origin.y * v2.y + origin.z * v2.z;
float d_y = direction.x * v2.x + direction.y * v2.y + direction.z * v2.z;
float v = o_y + t * d_y;
if(v >= 0.0f && u + v <= 1.0f)
{
// We got successful hit, store the information
depth = t;
temp.x = u;
temp.y = v;
temp.z = t;
temp.w = as_float(i);
}
}
}
}
}
// Pop node from stack (if empty, finish traversal)
if(todoOffset == 0)
{
break;
}
nodeNum = todo[--todoOffset];
}
// Store the ray traversal result in global memory
result[idx] = temp;
}
}
First question of the day is, how could one write his Persistent while-while and Speculative while-while kernel in OpenCL?
Ad Persistent while-while, do I get it right, that I actually just start kernel with global work size equivalent to local work size, and both these numbers should be equal to warp/wavefront size of the GPU?
I get that with CUDA the persistent thread implementation looks like this:
do
{
volatile int& jobIndexBase = nextJobArray[threadIndex.y];
if(threadIndex.x == 0)
{
jobIndexBase = atomicAdd(&warpCounter, WARP_SIZE);
}
index = jobIndexBase + threadIndex.x;
if(index >= totalJobs)
return;
/* Perform work for task numbered 'index' */
}
while(true);
How could equivalent in OpenCL look like, I know I'll have to do some barriers in there, I also know that one should be after the score where I atomically add WARP_SIZE to warpCounter.
Ad Speculative traversal - well I probably don't have any ideas how this should be implemented in OpenCL, so any hints are welcome. I also don't have idea where to put barriers (because putting them around simulated __any will result in driver crash).
If you made it here, thanks for reading & any hints, answers, etc. are welcome!
An optimization you can do is use vector variables and the fused multiply add function to speed up your set up math. As for the rest of the kernel, It is slow because it is branchy. If you can make assumptions on the signal data you might be able to reduce the execution time by reducing the code branches. I have not checked the float4 swizles (the .xxyy and .x .y .z .w after the float 4 variables) so just check that.
float4 n0xy = read_imagef(nodes, sampler, nodeCoord);
float4 n1xy = read_imagef(nodes, sampler, nodeCoord + (int2)(1, 0));
float4 nz = read_imagef(nodes, sampler, nodeCoord + (int2)(2, 0));
float4 oodf4 = -origin * invdir;
float4 c0xyf4 = fma(n0xy,invdir.xxyy,oodf4);
float4 c0zc1z = fma(nz,(float4)(invdir.z),oodf4);
float c0min = max4(min(c0xyf4.x, c0xyf4.y), min(c0xyf4.z, c0xyf4.w), min(c0zc1z.z, c0zc1z.w), tmin);
float c0max = min4(max(c0xyf4.x, c0xyf4.y), max(c0xyf4.z, c0xyf4.w), max(c0zc1z.z, c0zc1z.w), depth);
float4 c1xy = fma(n1xy,invdir.xxyy,oodf4);
float c1min = max4(min(c1xy.x, c1xy.y), min(c1xy.z, c1xy.w), min(c0zc1z.z, c0zc1z.w), tmin);
float c1max = min4(max(c1xy.x, c1xy.y), max(c1xy.z, c1xy.w), max(c0zc1z.z, c0zc1z.w), depth);

Resources