Weird output on recursive QuickSort - recursion

I'm trying to implement a recursive quicksort in C that does all swapping by using bitwise XOR operations. Here is what I've got so far:
//bitwise recursive quicksort
void quicksort(int *int_array,int p, int r){
if(p<r){
int q = part(int_array, p, r);
quicksort(int_array,p, q-1);
quicksort(int_array, q+1, r);
}
}
//Partition
int part(int *int_array, int p, int r){
int pivot = int_array[r];
int i = p-1;
int j;
for(j = p; j<=r-1; j++){
if(int_array[j] <= pivot){
i++;
int_array[i] = int_array[i] ^ int_array[j];
int_array[j] = int_array[i] ^ int_array[j];
int_array[i] = int_array[i] ^ int_array[j];
}
}
int_array[i+1] = int_array[i+1] ^ int_array[r];
int_array[r] = int_array[i+1] ^ int_array[r];
int_array[i+1] = int_array[i+1] ^ int_array[r];
return i+1;
}
When I run this code on an array of 20 ints, 19 out of 20 of them get changed to 0. Any idea why? I can't see anything wrong with the XOR swapping. Any help appreciated, thanks!

The XOR swap algorithm doesn't work when swapping an item with itself, because any number XORed with itself will be 0, the algorithm relies on there being two locations. So after the first line you have just wiped the value.
XOR swap algorithm:
However, the algorithm fails if x and y use the same storage location, since the value stored in that location will be zeroed out by the first XOR instruction, and then remain zero; it will not be "swapped with itself". Note that this is not the same as if x and y have the same values. The trouble only comes when x and y use the same storage location, in which case their values must already be equal.
You can just put a test around your swaps to make sure you never try swapping an element with itself:
int part(int *int_array, int p, int r){
int pivot = int_array[r];
int i = p-1;
int j;
for(j = p; j<=r-1; j++){
if(int_array[j] <= pivot){
i++;
if(i != j) // avoid XORing item with itself
{
int_array[i] = int_array[i] ^ int_array[j];
int_array[j] = int_array[i] ^ int_array[j];
int_array[i] = int_array[i] ^ int_array[j];
}
}
}
if(i+1 != r) // avoid XORing item with itself
{
int_array[i+1] = int_array[i+1] ^ int_array[r];
int_array[r] = int_array[i+1] ^ int_array[r];
int_array[i+1] = int_array[i+1] ^ int_array[r];
}
return i+1;
}

Related

Trying to implement a maths equation in a method

Code a method called calcSeries that calculates and returns the value of y in the following series:
y=1+∑ni=1i2i∗x
where n and x are two input integers and y is the returned double value.
Example:
Input: x=8, n=4
So, the series is y=1+1/(1∗8)+2^2/(2∗8)+3^2/(3∗8)+4^2/(4∗8)
Output: 2.25
I am having trouble with going about how to design the code. So far I have:
public double calcSeries(int n, int x){
double y = 0.0;
int a = 1;
double b = Math.pow(n,2)/(n*x);
for (int i = 1; i < (n + 1); i++) {
}
return y;
}
You've already written most of it. Just added the declaration of b inside the loop, since it's value depends on the current i.
public double calcSeries(int n, int x){
double y = 1.0;
for (int i = 1; i < (n + 1); i++) {
double b = Math.pow(i,2)/(i*x);
y += b;
}
return y;
}

Different results GPU & CPU when more than one 8 work items per group

I'm new in open cl. And tried as my first work to write code that checks intersection between many polylines to single polygon.
I'm running the code in both cpu and gpu.. and get different results.
First I sent NULL as local parameter when called clEnqueueNDRangeKernel.
clEnqueueNDRangeKernel(command_queue, kIntersect, 1, NULL, &global, null, 2, &evtCalcBounds, &evtKernel);
After trying many things i saw that if i send 1 as local it is working good. and returning the same results for the cpu and gpu.
size_t local = 1;
clEnqueueNDRangeKernel(command_queue, kIntersect, 1, NULL, &global, &local, 2, &evtCalcBounds, &evtKernel);
Played abit more and found that the cpu returns false result when i run the kernel with local 8 or more (for some reason).
I'm not using any local memory, just globals and privates.
I didn't added the code because i think it is irrelevant to the problem (note that for single work group it is working good), and it is long. If it is needed, i will try to simplify it.
The code flow is going like this:
I have polylines coordinates stored in a big buffer. and the single polygon in another. In addition i'm providing another buffer with single int that holds the current results count. All buffers are __global arguments.
In the kernel i'm simply checking intersection between all the lines of the "polyline[get_global(0)]" with the lines of the polygon. If true,
i'm using atomic_inc for the results count. There is no read and write memory from the same buffer, no barriers or mem fences,... the atomic_inc is the only thread safe mechanism i'm using.
-- UPDATE --
Added my code:
I know that i can maybe have better use of open cl functions for calculating some vectors, but for now, i'm simply convert code from my old regular CPU single threaded program to CL. so this is not my concern now.
bool isPointInPolygon(float x, float y, __global float* polygon) {
bool blnInside = false;
uint length = convert_uint(polygon[4]);
int s = 5;
uint j = length - 1;
for (uint i = 0; i < length; j = i++) {
uint realIdx = s + i * 2;
uint realInvIdx = s + j * 2;
if (((polygon[realIdx + 1] > y) != (polygon[realInvIdx + 1] > y)) &&
(x < (polygon[realInvIdx] - polygon[realIdx]) * (y - polygon[realIdx + 1]) / (polygon[realInvIdx + 1] - polygon[realIdx + 1]) + polygon[realIdx]))
blnInside = !blnInside;
}
return blnInside;
}
bool isRectanglesIntersected(float p_dblMinX1, float p_dblMinY1,
float p_dblMaxX1, float p_dblMaxY1,
float p_dblMinX2, float p_dblMinY2,
float p_dblMaxX2, float p_dblMaxY2) {
bool blnResult = true;
if (p_dblMinX1 > p_dblMaxX2 ||
p_dblMaxX1 < p_dblMinX2 ||
p_dblMinY1 > p_dblMaxY2 ||
p_dblMaxY1 < p_dblMinY2) {
blnResult = false;
}
return blnResult;
}
bool isLinesIntersects(
double Ax, double Ay,
double Bx, double By,
double Cx, double Cy,
double Dx, double Dy) {
double distAB, theCos, theSin, newX, ABpos;
// Fail if either line is undefined.
if (Ax == Bx && Ay == By || Cx == Dx && Cy == Dy)
return false;
// (1) Translate the system so that point A is on the origin.
Bx -= Ax; By -= Ay;
Cx -= Ax; Cy -= Ay;
Dx -= Ax; Dy -= Ay;
// Discover the length of segment A-B.
distAB = sqrt(Bx*Bx + By*By);
// (2) Rotate the system so that point B is on the positive X axis.
theCos = Bx / distAB;
theSin = By / distAB;
newX = Cx*theCos + Cy*theSin;
Cy = Cy*theCos - Cx*theSin; Cx = newX;
newX = Dx*theCos + Dy*theSin;
Dy = Dy*theCos - Dx*theSin; Dx = newX;
// Fail if the lines are parallel.
return (Cy != Dy);
}
bool isPolygonInersectsPolyline(__global float* polygon, __global float* polylines, uint startIdx) {
uint polylineLength = convert_uint(polylines[startIdx]);
uint start = startIdx + 1;
float x1 = polylines[start];
float y1 = polylines[start + 1];
float x2;
float y2;
int polygonLength = convert_uint(polygon[4]);
int polygonLength2 = polygonLength * 2;
int startPolygonIdx = 5;
for (int currPolyineIdx = 0; currPolyineIdx < polylineLength - 1; currPolyineIdx++)
{
x2 = polylines[start + (currPolyineIdx*2) + 2];
y2 = polylines[start + (currPolyineIdx*2) + 3];
float polyX1 = polygon[0];
float polyY1 = polygon[1];
for (int currPolygonIdx = 0; currPolygonIdx < polygonLength; ++currPolygonIdx)
{
float polyX2 = polygon[startPolygonIdx + (currPolygonIdx * 2 + 2) % polygonLength2];
float polyY2 = polygon[startPolygonIdx + (currPolygonIdx * 2 + 3) % polygonLength2];
if (isLinesIntersects(x1, y1, x2, y2, polyX1, polyY1, polyX2, polyY2)) {
return true;
}
polyX1 = polyX2;
polyY1 = polyY2;
}
x1 = x2;
y1 = y2;
}
// No intersection found till now so we check containing
return isPointInPolygon(x1, y1, polygon);
}
__kernel void calcIntersections(__global float* polylines, // My flat points array - [pntCount, x,y,x,y,...., pntCount, x,y,... ]
__global float* pBounds, // The rectangle bounds of each polyline - set of 4 values [top, left, bottom, right....]
__global uint* pStarts, // The start index of each polyline in the polylines array
__global float* polygon, // The polygon i want to intersect with - first 4 items are the rectangle bounds [top, left, bottom, right, pntCount, x,y,x,y,x,y....]
__global float* output, // Result array for saving the intersections polylines indices
__global uint* resCount) // The result count
{
int i = get_global_id(0);
uint start = convert_uint(pStarts[i]);
if (isRectanglesIntersected(pBounds[i * 4], pBounds[i * 4 + 1], pBounds[i * 4 + 2], pBounds[i * 4 + 3],
polygon[0], polygon[1], polygon[2], polygon[3])) {
if (isPolygonInersectsPolyline(polygon, polylines, start)){
int oldVal = atomic_inc(resCount);
output[oldVal] = i;
}
}
}
Can anyone explain it to me ?

sequence of numbers using recursion

I want to compute sequence of numbers like this:
n*(n-1)+n*(n-1)*(n-2)+n*(n-1)*(n-2)*(n-3)+n*(n-1)*(n-2)*(n-3)*(n-4)+...+n(n-1)...(n-n)
For example n=5 and sum equals 320.
I have a function, which compute one element:
int fac(int n, int s)
{
if (n > s)
return n*fac(n - 1, s);
return 1;
}
Recomputing the factorial for each summand is quite wasteful. Instead, I'd suggest to use memoization. If you reorder
n*(n-1) + n*(n-1)*(n-2) + n*(n-1)*(n-2)*(n-3) + n*(n-1)*(n-2)*(n-3)*...*1
you get
n*(n-1)*(n-2)*(n-3)*...*1 + n*(n-1)*(n-2)*(n-3) + n*(n-1)*(n-2) + n*(n-1)
Notice how you start with the product of 1..n, then you add the product of 1..n divided by 1, then you add the product divided by 1*2 etc.
I think a much more efficient definition of your function is (in Python):
def f(n):
p = product(range(1, n+1))
sum_ = p
for i in range(1, n-1):
p /= i
sum_ += p
return sum_
A recursive version of this definition is:
def f(n):
def go(sum_, i):
if i >= n-1:
return sum_
return sum_ + go(sum_ / i, i+1)
return go(product(range(1, n+1)), 1)
Last but not least, you can also define the function without any explicit recursion by using reduce to generate the list of summands (this is a more 'functional' -- as in functional programming -- style):
def f(n):
summands, _ = reduce(lambda (lst, p), i: (lst + [p], p / i),
range(1, n),
([], product(range(1, n+1))))
return sum(summands)
This style is very concise in functional programming languages such as Haskell; Haskell has a function call scanl which simplifies generating the summands so that the definition is just:
f n = sum $ scanl (/) (product [1..n]) [1..(n-2)]
Something like this?
function fac(int n, int s)
{
if (n >= s)
return n * fac(n - 1, s);
return 1;
}
int sum = 0;
int s = 4;
n = 5;
while(s > 0)
{
sum += fac(n, s);
s--;
}
print sum; //320
Loop-free version:
int fac(int n, int s)
{
if (n >= s)
return n * fac(n - 1, s);
return 1;
}
int compute(int n, int s, int sum = 0)
{
if(s > 0)
return compute(n, s - 1, sum + fac(n, s));
return sum;
}
print compute(5, 4); //320
Ok ther is not mutch to write. I would suggest 2 methodes if you want to solve this recursiv. (Becaus of the recrusiv faculty the complexity is a mess and runtime will increase drasticaly with big numbers!)
int func(int n){
return func(n, 2);
}
int func(int n, int i){
if (i < n){
return n*(fac(n-1,n-i)+func(n, i + 1));
}else return 0;
}
int fac(int i,int a){
if(i>a){
return i*fac(i-1, a);
}else return 1;
}

Higher radix (or better) formulation for Stockham FFT

Background
I've implemented this algorithm from Microsoft Research for a radix-2 FFT (Stockham auto sort) using OpenCL.
I use floating point textures (256 cols X N rows) for input and output in the kernel, because I will need to sample at non-integral points and I thought it better to delegate that to the texture sampling hardware. Note that my FFTs are always of 256-point sequences (every row in my texture). At this point, my N is 16384 or 32768 depending on the GPU i'm using and the max 2D texture size allowed.
I also need to perform the FFT of 4 real-valued sequences at once, so the kernel performs the FFT(a, b, c, d) as FFT(a + ib, c + id) from which I can extract the 4 complex sequences out later using an O(n) algorithm. I can elaborate on this if someone wishes - but I don't believe it falls in the scope of this question.
Kernel Source
const sampler_t fftSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
__kernel void FFT_Stockham(read_only image2d_t input, write_only image2d_t output, int fftSize, int size)
{
int x = get_global_id(0);
int y = get_global_id(1);
int b = floor(x / convert_float(fftSize)) * (fftSize / 2);
int offset = x % (fftSize / 2);
int x0 = b + offset;
int x1 = x0 + (size / 2);
float4 val0 = read_imagef(input, fftSampler, (int2)(x0, y));
float4 val1 = read_imagef(input, fftSampler, (int2)(x1, y));
float angle = -6.283185f * (convert_float(x) / convert_float(fftSize));
// TODO: Convert the two calculations below into lookups from a __constant buffer
float tA = native_cos(angle);
float tB = native_sin(angle);
float4 coeffs1 = (float4)(tA, tB, tA, tB);
float4 coeffs2 = (float4)(-tB, tA, -tB, tA);
float4 result = val0 + coeffs1 * val1.xxzz + coeffs2 * val1.yyww;
write_imagef(output, (int2)(x, y), result);
}
The host code simply invokes this kernel log2(256) times, ping-ponging the input and output textures.
Note: I tried removing the native_cos and native_sin to see if that impacted timing, but it doesn't seem to change things by very much. Not the factor I'm looking for, in any case.
Access pattern
Knowing that I am probably memory-bandwidth bound, here is the memory access pattern (per-row) for my radix-2 FFT.
X0 - element 1 to combine (read)
X1 - element 2 to combine (read)
X - element to write to (write)
Question
So my question is - can someone help me with/point me toward a higher-radix formulation for this algorithm? I ask because most FFTs are optimized for large cases and single real/complex valued sequences. Their kernel generators are also very case dependent and break down quickly when I try to muck with their internals.
Are there other options better than simply going to a radix-8 or 16 kernel?
Some of my constraints are - I have to use OpenCL (no cuFFT). I also cannot use clAmdFft from ACML for this purpose. It would be nice to also talk about CPU optimizations (this kernel SUCKS big time on the CPU) - but getting it to run in fewer iterations on the GPU is my main use-case.
Thanks in advance for reading through all this and trying to help!
I tried several versions, but the one with the best performance on CPU and GPU was a radix-16 kernel for my specific case.
Here is the kernel for reference. It was taken from Eric Bainville's (most excellent) website and used with full attribution.
// #define M_PI 3.14159265358979f
//Global size is x.Length/2, Scale = 1 for direct, 1/N to inverse (iFFT)
__kernel void ConjugateAndScale(__global float4* x, const float Scale)
{
int i = get_global_id(0);
float temp = Scale;
float4 t = (float4)(temp, -temp, temp, -temp);
x[i] *= t;
}
// Return a*EXP(-I*PI*1/2) = a*(-I)
float2 mul_p1q2(float2 a) { return (float2)(a.y,-a.x); }
// Return a^2
float2 sqr_1(float2 a)
{ return (float2)(a.x*a.x-a.y*a.y,2.0f*a.x*a.y); }
// Return the 2x DFT2 of the four complex numbers in A
// If A=(a,b,c,d) then return (a',b',c',d') where (a',c')=DFT2(a,c)
// and (b',d')=DFT2(b,d).
float8 dft2_4(float8 a) { return (float8)(a.lo+a.hi,a.lo-a.hi); }
// Return the DFT of 4 complex numbers in A
float8 dft4_4(float8 a)
{
// 2x DFT2
float8 x = dft2_4(a);
// Shuffle, twiddle, and 2x DFT2
return dft2_4((float8)(x.lo.lo,x.hi.lo,x.lo.hi,mul_p1q2(x.hi.hi)));
}
// Complex product, multiply vectors of complex numbers
#define MUL_RE(a,b) (a.even*b.even - a.odd*b.odd)
#define MUL_IM(a,b) (a.even*b.odd + a.odd*b.even)
float2 mul_1(float2 a, float2 b)
{ float2 x; x.even = MUL_RE(a,b); x.odd = MUL_IM(a,b); return x; }
float4 mul_1_F4(float4 a, float4 b)
{ float4 x; x.even = MUL_RE(a,b); x.odd = MUL_IM(a,b); return x; }
float4 mul_2(float4 a, float4 b)
{ float4 x; x.even = MUL_RE(a,b); x.odd = MUL_IM(a,b); return x; }
// Return the DFT2 of the two complex numbers in vector A
float4 dft2_2(float4 a) { return (float4)(a.lo+a.hi,a.lo-a.hi); }
// Return cos(alpha)+I*sin(alpha) (3 variants)
float2 exp_alpha_1(float alpha)
{
float cs,sn;
// sn = sincos(alpha,&cs); // sincos
//cs = native_cos(alpha); sn = native_sin(alpha); // native sin+cos
cs = cos(alpha); sn = sin(alpha); // sin+cos
return (float2)(cs,sn);
}
// Return cos(alpha)+I*sin(alpha) (3 variants)
float4 exp_alpha_1_F4(float alpha)
{
float cs,sn;
// sn = sincos(alpha,&cs); // sincos
// cs = native_cos(alpha); sn = native_sin(alpha); // native sin+cos
cs = cos(alpha); sn = sin(alpha); // sin+cos
return (float4)(cs,sn,cs,sn);
}
// mul_p*q*(a) returns a*EXP(-I*PI*P/Q)
#define mul_p0q1(a) (a)
#define mul_p0q2 mul_p0q1
//float2 mul_p1q2(float2 a) { return (float2)(a.y,-a.x); }
__constant float SQRT_1_2 = 0.707106781186548; // cos(Pi/4)
#define mul_p0q4 mul_p0q2
float2 mul_p1q4(float2 a) { return (float2)(SQRT_1_2)*(float2)(a.x+a.y,-a.x+a.y); }
#define mul_p2q4 mul_p1q2
float2 mul_p3q4(float2 a) { return (float2)(SQRT_1_2)*(float2)(-a.x+a.y,-a.x-a.y); }
__constant float COS_8 = 0.923879532511287; // cos(Pi/8)
__constant float SIN_8 = 0.382683432365089; // sin(Pi/8)
#define mul_p0q8 mul_p0q4
float2 mul_p1q8(float2 a) { return mul_1((float2)(COS_8,-SIN_8),a); }
#define mul_p2q8 mul_p1q4
float2 mul_p3q8(float2 a) { return mul_1((float2)(SIN_8,-COS_8),a); }
#define mul_p4q8 mul_p2q4
float2 mul_p5q8(float2 a) { return mul_1((float2)(-SIN_8,-COS_8),a); }
#define mul_p6q8 mul_p3q4
float2 mul_p7q8(float2 a) { return mul_1((float2)(-COS_8,-SIN_8),a); }
// Compute in-place DFT2 and twiddle
#define DFT2_TWIDDLE(a,b,t) { float2 tmp = t(a-b); a += b; b = tmp; }
// T = N/16 = number of threads.
// P is the length of input sub-sequences, 1,16,256,...,N/16.
__kernel void FFT_Radix16(__global const float4 * x, __global float4 * y, int pp)
{
int p = pp;
int t = get_global_size(0); // number of threads
int i = get_global_id(0); // current thread
////// y[i] = 2*x[i];
////// return;
int k = i & (p-1); // index in input sequence, in 0..P-1
// Inputs indices are I+{0,..,15}*T
x += i;
// Output indices are J+{0,..,15}*P, where
// J is I with four 0 bits inserted at bit log2(P)
y += ((i-k)<<4) + k;
// Load
float4 u[16];
for (int m=0;m<16;m++) u[m] = x[m*t];
// Twiddle, twiddling factors are exp(_I*PI*{0,..,15}*K/4P)
float alpha = -M_PI*(float)k/(float)(8*p);
for (int m=1;m<16;m++) u[m] = mul_1_F4(exp_alpha_1_F4(m * alpha), u[m]);
// 8x in-place DFT2 and twiddle (1)
DFT2_TWIDDLE(u[0].lo,u[8].lo,mul_p0q8);
DFT2_TWIDDLE(u[0].hi,u[8].hi,mul_p0q8);
DFT2_TWIDDLE(u[1].lo,u[9].lo,mul_p1q8);
DFT2_TWIDDLE(u[1].hi,u[9].hi,mul_p1q8);
DFT2_TWIDDLE(u[2].lo,u[10].lo,mul_p2q8);
DFT2_TWIDDLE(u[2].hi,u[10].hi,mul_p2q8);
DFT2_TWIDDLE(u[3].lo,u[11].lo,mul_p3q8);
DFT2_TWIDDLE(u[3].hi,u[11].hi,mul_p3q8);
DFT2_TWIDDLE(u[4].lo,u[12].lo,mul_p4q8);
DFT2_TWIDDLE(u[4].hi,u[12].hi,mul_p4q8);
DFT2_TWIDDLE(u[5].lo,u[13].lo,mul_p5q8);
DFT2_TWIDDLE(u[5].hi,u[13].hi,mul_p5q8);
DFT2_TWIDDLE(u[6].lo,u[14].lo,mul_p6q8);
DFT2_TWIDDLE(u[6].hi,u[14].hi,mul_p6q8);
DFT2_TWIDDLE(u[7].lo,u[15].lo,mul_p7q8);
DFT2_TWIDDLE(u[7].hi,u[15].hi,mul_p7q8);
// 8x in-place DFT2 and twiddle (2)
DFT2_TWIDDLE(u[0].lo,u[4].lo,mul_p0q4);
DFT2_TWIDDLE(u[0].hi,u[4].hi,mul_p0q4);
DFT2_TWIDDLE(u[1].lo,u[5].lo,mul_p1q4);
DFT2_TWIDDLE(u[1].hi,u[5].hi,mul_p1q4);
DFT2_TWIDDLE(u[2].lo,u[6].lo,mul_p2q4);
DFT2_TWIDDLE(u[2].hi,u[6].hi,mul_p2q4);
DFT2_TWIDDLE(u[3].lo,u[7].lo,mul_p3q4);
DFT2_TWIDDLE(u[3].hi,u[7].hi,mul_p3q4);
DFT2_TWIDDLE(u[8].lo,u[12].lo,mul_p0q4);
DFT2_TWIDDLE(u[8].hi,u[12].hi,mul_p0q4);
DFT2_TWIDDLE(u[9].lo,u[13].lo,mul_p1q4);
DFT2_TWIDDLE(u[9].hi,u[13].hi,mul_p1q4);
DFT2_TWIDDLE(u[10].lo,u[14].lo,mul_p2q4);
DFT2_TWIDDLE(u[10].hi,u[14].hi,mul_p2q4);
DFT2_TWIDDLE(u[11].lo,u[15].lo,mul_p3q4);
DFT2_TWIDDLE(u[11].hi,u[15].hi,mul_p3q4);
// 8x in-place DFT2 and twiddle (3)
DFT2_TWIDDLE(u[0].lo,u[2].lo,mul_p0q2);
DFT2_TWIDDLE(u[0].hi,u[2].hi,mul_p0q2);
DFT2_TWIDDLE(u[1].lo,u[3].lo,mul_p1q2);
DFT2_TWIDDLE(u[1].hi,u[3].hi,mul_p1q2);
DFT2_TWIDDLE(u[4].lo,u[6].lo,mul_p0q2);
DFT2_TWIDDLE(u[4].hi,u[6].hi,mul_p0q2);
DFT2_TWIDDLE(u[5].lo,u[7].lo,mul_p1q2);
DFT2_TWIDDLE(u[5].hi,u[7].hi,mul_p1q2);
DFT2_TWIDDLE(u[8].lo,u[10].lo,mul_p0q2);
DFT2_TWIDDLE(u[8].hi,u[10].hi,mul_p0q2);
DFT2_TWIDDLE(u[9].lo,u[11].lo,mul_p1q2);
DFT2_TWIDDLE(u[9].hi,u[11].hi,mul_p1q2);
DFT2_TWIDDLE(u[12].lo,u[14].lo,mul_p0q2);
DFT2_TWIDDLE(u[12].hi,u[14].hi,mul_p0q2);
DFT2_TWIDDLE(u[13].lo,u[15].lo,mul_p1q2);
DFT2_TWIDDLE(u[13].hi,u[15].hi,mul_p1q2);
// 8x DFT2 and store (reverse binary permutation)
y[0] = u[0] + u[1];
y[p] = u[8] + u[9];
y[2*p] = u[4] + u[5];
y[3*p] = u[12] + u[13];
y[4*p] = u[2] + u[3];
y[5*p] = u[10] + u[11];
y[6*p] = u[6] + u[7];
y[7*p] = u[14] + u[15];
y[8*p] = u[0] - u[1];
y[9*p] = u[8] - u[9];
y[10*p] = u[4] - u[5];
y[11*p] = u[12] - u[13];
y[12*p] = u[2] - u[3];
y[13*p] = u[10] - u[11];
y[14*p] = u[6] - u[7];
y[15*p] = u[14] - u[15];
}
Note that I have modified the kernel to perform the FFT of 2 complex-valued sequences at once instead of one. Also, since I only need the FFT of 256 elements at a time in a much larger sequence, I perform only 2 runs of this kernel, which leaves me with 256-length DFTs in the larger array.
Here's some of the relevant host code as well.
var ev = new[] { new Cl.Event() };
var pEv = new[] { new Cl.Event() };
int fftSize = 1;
int iter = 0;
int n = distributionSize >> 5;
while (fftSize <= n)
{
Cl.SetKernelArg(fftKernel, 0, memA);
Cl.SetKernelArg(fftKernel, 1, memB);
Cl.SetKernelArg(fftKernel, 2, fftSize);
Cl.EnqueueNDRangeKernel(commandQueue, fftKernel, 1, null, globalWorkgroupSize, localWorkgroupSize,
(uint)(iter == 0 ? 0 : 1),
iter == 0 ? null : pEv,
out ev[0]).Check();
if (iter > 0)
pEv[0].Dispose();
Swap(ref ev, ref pEv);
Swap(ref memA, ref memB); // ping-pong
fftSize = fftSize << 4;
iter++;
Cl.Finish(commandQueue);
}
Swap(ref memA, ref memB);
Hope this helps someone!

Optimization of Fibonacci sequence generating algorithm

As we all know, the simplest algorithm to generate Fibonacci sequence is as follows:
if(n<=0) return 0;
else if(n==1) return 1;
f(n) = f(n-1) + f(n-2);
But this algorithm has some repetitive calculation. For example, if you calculate f(5), it will calculate f(4) and f(3). When you calculate f(4), it will again calculate both f(3) and f(2). Could someone give me a more time-efficient recursive algorithm?
I have read about some of the methods for calculating Fibonacci with efficient time complexity following are some of them -
Method 1 - Dynamic Programming
Now here the substructure is commonly known hence I'll straightly Jump to the solution -
static int fib(int n)
{
int f[] = new int[n+2]; // 1 extra to handle case, n = 0
int i;
f[0] = 0;
f[1] = 1;
for (i = 2; i <= n; i++)
{
f[i] = f[i-1] + f[i-2];
}
return f[n];
}
A space-optimized version of above can be done as follows -
static int fib(int n)
{
int a = 0, b = 1, c;
if (n == 0)
return a;
for (int i = 2; i <= n; i++)
{
c = a + b;
a = b;
b = c;
}
return b;
}
Method 2- ( Using power of the matrix {{1,1},{1,0}} )
This an O(n) which relies on the fact that if we n times multiply the matrix M = {{1,1},{1,0}} to itself (in other words calculate power(M, n )), then we get the (n+1)th Fibonacci number as the element at row and column (0, 0) in the resultant matrix. This solution would have O(n) time.
The matrix representation gives the following closed expression for the Fibonacci numbers:
fibonaccimatrix
static int fib(int n)
{
int F[][] = new int[][]{{1,1},{1,0}};
if (n == 0)
return 0;
power(F, n-1);
return F[0][0];
}
/*multiplies 2 matrices F and M of size 2*2, and
puts the multiplication result back to F[][] */
static void multiply(int F[][], int M[][])
{
int x = F[0][0]*M[0][0] + F[0][1]*M[1][0];
int y = F[0][0]*M[0][1] + F[0][1]*M[1][1];
int z = F[1][0]*M[0][0] + F[1][1]*M[1][0];
int w = F[1][0]*M[0][1] + F[1][1]*M[1][1];
F[0][0] = x;
F[0][1] = y;
F[1][0] = z;
F[1][1] = w;
}
/*function that calculates F[][] raise to the power n and puts the
result in F[][]*/
static void power(int F[][], int n)
{
int i;
int M[][] = new int[][]{{1,1},{1,0}};
// n - 1 times multiply the matrix to {{1,0},{0,1}}
for (i = 2; i <= n; i++)
multiply(F, M);
}
This can be optimized to work in O(Logn) time complexity. We can do recursive multiplication to get power(M, n) in the previous method.
static int fib(int n)
{
int F[][] = new int[][]{{1,1},{1,0}};
if (n == 0)
return 0;
power(F, n-1);
return F[0][0];
}
static void multiply(int F[][], int M[][])
{
int x = F[0][0]*M[0][0] + F[0][1]*M[1][0];
int y = F[0][0]*M[0][1] + F[0][1]*M[1][1];
int z = F[1][0]*M[0][0] + F[1][1]*M[1][0];
int w = F[1][0]*M[0][1] + F[1][1]*M[1][1];
F[0][0] = x;
F[0][1] = y;
F[1][0] = z;
F[1][1] = w;
}
static void power(int F[][], int n)
{
if( n == 0 || n == 1)
return;
int M[][] = new int[][]{{1,1},{1,0}};
power(F, n/2);
multiply(F, F);
if (n%2 != 0)
multiply(F, M);
}
Method 3 (O(log n) Time)
Below is one more interesting recurrence formula that can be used to find nth Fibonacci Number in O(log n) time.
If n is even then k = n/2:
F(n) = [2*F(k-1) + F(k)]*F(k)
If n is odd then k = (n + 1)/2
F(n) = F(k)*F(k) + F(k-1)*F(k-1)
How does this formula work?
The formula can be derived from the above matrix equation.
fibonaccimatrix
Taking determinant on both sides, we get
(-1)n = Fn+1Fn-1 – Fn2
Moreover, since AnAm = An+m for any square matrix A, the following identities can be derived (they are obtained from two different coefficients of the matrix product)
FmFn + Fm-1Fn-1 = Fm+n-1
By putting n = n+1,
FmFn+1 + Fm-1Fn = Fm+n
Putting m = n
F2n-1 = Fn2 + Fn-12
F2n = (Fn-1 + Fn+1)Fn = (2Fn-1 + Fn)Fn (Source: Wiki)
To get the formula to be proved, we simply need to do the following
If n is even, we can put k = n/2
If n is odd, we can put k = (n+1)/2
public static int fib(int n)
{
if (n == 0)
return 0;
if (n == 1 || n == 2)
return (f[n] = 1);
// If fib(n) is already computed
if (f[n] != 0)
return f[n];
int k = (n & 1) == 1? (n + 1) / 2
: n / 2;
// Applyting above formula [See value
// n&1 is 1 if n is odd, else 0.
f[n] = (n & 1) == 1? (fib(k) * fib(k) +
fib(k - 1) * fib(k - 1))
: (2 * fib(k - 1) + fib(k))
* fib(k);
return f[n];
}
Method 4 - Using a formula
In this method, we directly implement the formula for the nth term in the Fibonacci series. Time O(1) Space O(1)
Fn = {[(√5 + 1)/2] ^ n} / √5
static int fib(int n) {
double phi = (1 + Math.sqrt(5)) / 2;
return (int) Math.round(Math.pow(phi, n)
/ Math.sqrt(5));
}
Reference: http://www.maths.surrey.ac.uk/hosted-sites/R.Knott/Fibonacci/fibFormula.html
Look here for implementation in Erlang which uses formula
. It shows nice linear resulting behavior because in O(M(n) log n) part M(n) is exponential for big numbers. It calculates fib of one million in 2s where result has 208988 digits. The trick is that you can compute exponentiation in O(log n) multiplications using (tail) recursive formula (tail means with O(1) space when used proper compiler or rewrite to cycle):
% compute X^N
power(X, N) when is_integer(N), N >= 0 ->
power(N, X, 1).
power(0, _, Acc) ->
Acc;
power(N, X, Acc) ->
if N rem 2 =:= 1 ->
power(N - 1, X, Acc * X);
true ->
power(N div 2, X * X, Acc)
end.
where X and Acc you substitute with matrices. X will be initiated with and Acc with identity I equals to .
One simple way is to calculate it iteratively instead of recursively. This will calculate F(n) in linear time.
def fib(n):
a,b = 0,1
for i in range(n):
a,b = a+b,a
return a
Hint: One way you achieve faster results is by using Binet's formula:
Here is a way of doing it in Python:
from decimal import *
def fib(n):
return int((Decimal(1.6180339)**Decimal(n)-Decimal(-0.6180339)**Decimal(n))/Decimal(2.236067977))
you can save your results and use them :
public static long[] fibs;
public long fib(int n) {
fibs = new long[n];
return internalFib(n);
}
public long internalFib(int n) {
if (n<=2) return 1;
fibs[n-1] = fibs[n-1]==0 ? internalFib(n-1) : fibs[n-1];
fibs[n-2] = fibs[n-2]==0 ? internalFib(n-2) : fibs[n-2];
return fibs[n-1]+fibs[n-2];
}
F(n) = (φ^n)/√5 and round to nearest integer, where φ is the golden ratio....
φ^n can be calculated in O(lg(n)) time hence F(n) can be calculated in O(lg(n)) time.
// D Programming Language
void vFibonacci ( const ulong X, const ulong Y, const int Limit ) {
// Equivalent : if ( Limit != 10 ). Former ( Limit ^ 0xA ) is More Efficient However.
if ( Limit ^ 0xA ) {
write ( Y, " " ) ;
vFibonacci ( Y, Y + X, Limit + 1 ) ;
} ;
} ;
// Call As
// By Default the Limit is 10 Numbers
vFibonacci ( 0, 1, 0 ) ;
EDIT: I actually think Hynek Vychodil's answer is superior to mine, but I'm leaving this here just in case someone is looking for an alternate method.
I think the other methods are all valid, but not optimal. Using Binet's formula should give you the right answer in principle, but rounding to the closest integer will give some problems for large values of n. The other solutions will unnecessarily recalculate the values upto n every time you call the function, and so the function is not optimized for repeated calling.
In my opinion the best thing to do is to define a global array and then to add new values to the array IF needed. In Python:
import numpy
fibo=numpy.array([1,1])
last_index=fibo.size
def fib(n):
global fibo,last_index
if (n>0):
if(n>last_index):
for i in range(last_index+1,n+1):
fibo=numpy.concatenate((fibo,numpy.array([fibo[i-2]+fibo[i-3]])))
last_index=fibo.size
return fibo[n-1]
else:
print "fib called for index less than 1"
quit()
Naturally, if you need to call fib for n>80 (approximately) then you will need to implement arbitrary precision integers, which is easy to do in python.
This will execute faster, O(n)
def fibo(n):
a, b = 0, 1
for i in range(n):
if i == 0:
print(i)
elif i == 1:
print(i)
else:
temp = a
a = b
b += temp
print(b)
n = int(input())
fibo(n)

Resources