Knapsack algorithm: strange behavior with pown() on the gpu

Knapsack algorithm: strange behavior with pown() on the gpu - opencl

The version on the CPU OCL produces right results, where the GPU OCL in some places gives slightly different results in some places that after all influence the correctness of the result. I have debugged on Intel OCL SDK where I get right results. I haven't spotted any race condition or concurrent access to the memory. This problem has appeared after I have introduced in the kernel (one line of code) pown function.
void kernel knapsack(global int *input_f, global int *output_f, global uint *m_d, int cmax, int weightk, int pk, int maxelem, int i){
int c = get_global_id(0)+cmax;
if(get_global_id(0)<maxelem){
if(input_f[c] < input_f[c - weightk] + pk){
output_f[c] = input_f[c - weightk] + pk;
m_d[c-1] = pown(2.0,i); *//previous version: m_d[c-1] = 1;*
}
else{
output_f[c] = input_f[c];
}
}
}
The purpose of pown is to compress the m_d buffer which holds the outcomes.
For example
1 0 1 0 2^0+2^2, 2^1, 2^0, 2^1
0 1 0 1 =>
1 0 0 0
On the gpu I get something like this:
2^0+2^2, 2^1, 2^0+2^2, 2^1 in the
3rd column I access to pown one more again, when I'm not supposed to.
This gives me that "slight" different result.
Here you can find full code
This work is based on this article:
Solving knapsack problems on GPU by V. Boyera, D. El Baza, M. Elkihel
related work: Accelerating the knapsack problem on GPUs by Bharath Suri

Related

Bizzare output from Arduinio

I'm trying to do something relatively simple with the Arduino (trying to get some lights to light up like a Simon says game) but I'm getting some really bizarre outputs
I got some really bizarre output on the pins so I took those parts of the code out to see it on a serial monitor to see what the contents of the array that holds the sequence of lights (colors) are. It just really doesn't make sense to me.
What my code is supposed to do is append a random number from 1-4 onto colors[] and then read them back out onto the serial monitor
bool lightValue[4] = { 0 };
// stores whether each LED is lit or not
int buttonPressed;
// 0 - no button pressed / nothing
// 1 - red
// 2 - yellow
// 3 - green
// 4 - blue
int colors[] = { 0 };
// probably not the best name for this variable but oh well
// stores all the colors displayed for the simon says
// refer above to see what each number means
int colorNum = 0;
// again not the best name, stores the number of colors displayed so far
// AKA the length of colors[]
int randomNum;
// will store a random number
// variables
void setup() {
randomSeed(analogRead(0));
Serial.begin(9600);
Serial.println();
Serial.println("PROGRAM START");
}
// pinModes. Lots of pinModes.
void loop() {
randomNum = random(1,5);
colors[colorNum] = randomNum;
Serial.println();
Serial.print(colorNum);
Serial.print(" ");
colorNum++;
// adds another random color onto the end of the color sequence
for (int i = 0; i < colorNum; i++) {
Serial.print(colors[i]);
delay(500);
}
}
Some examples of outputs I got:
0 3
0 1
2 13520
3 145202
4 1552024
5 16520241
6 175202414
7 1852024141
8 19520241414
9 1105202414141
10 11152024141414
0 1
2 13520
3 145202
4 1552024
5 16520241
6 175202414
7 1852024141
8 19520241414
9 1105202414141
10 11152024141414
colorNum, the main increment of this loop for some reason skips over one. The first and second output do not match, the third item in the array is 520, and for some reason, the second item is incrementing by 1 every step. Also, it stops at 10 for some reason.
The only thing I could chalk this inconsistent behavior to is accessing some piece of memory where it shouldn't, but I can't come up for the life of me where I horribly messed up.

int colors[] = { 0 };
defines an integer array with a single element 0.
Here colors[colorNum] = randomNum; you're assigning numbers to indices outside of that array for colorNum > 0. You shouldn't do that. This memory region is not reserved for colors!
So who stops your compiler from storing colorNum right after colors?
So when you assing a value to colors[1] you could very well change the value of colorNum. Same for your loop control variable i.
So the second value is incremented because you're incrementing colorNum which is at the same memory location as colorNum[1].
The print for colorNum == 1 is missing because you assigned 5 to colors[2] which is at the same memory location as your loop control variable i. As 5 > colorNum the loop does not run.
I just did this on a 32bit C++ compiler:
int colors[] = {0};
int colorNum = 0;
int i = 0;
And the addresses printed:
colors[0] # 0x7fff029a5ac4
colorNum # 0x7fff029a5ac8
colors[1] # 0x7fff029a5ac8
i # 0x7fff029a5acc
colors[2] # 0x7fff029a5acc
Note that colorNum is just 4 bytes after colors[0] which is the same address as colors[1]!
Anyway you shouldn't just fill memory in an infinite loop in the first place.
You're on a micro controller where memory is a limited resource

Logic of computing a^b, and is power a keyword?

I found the following code that is meant to compute a^b (Cracking the Coding Interview, Ch. VI Big O).
What's the logic of return a * power(a, b - 1); ? Is it recursion
of some sort?
Is power a keyword here or just pseudocode?
int power(int a, int b)
{ if (b < 0) {
return a; // error
} else if (b == 0) {
return 1;
} else {
return a * power(a, b - 1);
}
}

Power is just the name of the function.
Ya this is RECURSION as we are representing a given problem in terms of smaller problem of similar type.
let a=2 and b=4 =calculate= power(2,4) -- large problem (original one)
Now we will represent this in terms of smaller one
i.e 2*power(2,4-1) -- smaller problem of same type power(2,3)
i.e a*power(a,b-1)
If's in the start are for controlling the base cases i.e when b goes below 1

This is a recursive function. That is, the function is defined in terms of itself, with a base case that prevents the recursion from running indefinitely.
power is the name of the function.
For example, 4^3 is equal to 4 * 4^2. That is, 4 raised to the third power can be calculated by multiplying 4 and 4 raised to the second power. And 4^2 can be calculated as 4 * 4^1, which can be simplified to 4 * 4, since the base case of the recursion specifies that 4^1 = 4. Combining this together, 4^3 = 4 * 4^2 = 4 * 4 * 4^1 = 4 * 4 * 4 = 64.

power here is just the name of the function that is defined and NOT a keyword.
Now, let consider that you want to find 2^10. You can write the same thing as 2*(2^9), as 2*2*(2^8), as 2*2*2*(2^7) and so on till 2*2*2*2*2*2*2*2*2*(2^1).
This is what a * power(a, b - 1) is doing in a recursive manner.
Here is a dry run of the code for finding 2^4:
The initial call to the function will be power(2,4), the complete stack trace is shown below
power(2,4) ---> returns a*power(2,3), i.e, 2*4=16
|
power(2,3) ---> returns a*power(2,2), i.e, 2*3=8
|
power(2,2) ---> returns a*power(2,1), i.e, 2*2=4
|
power(2,1) ---> returns a*power(2,0), i.e, 2*1=2
|
power(2,0) ---> returns 1 as b == 0

Efficient method for imposing (some cases of) periodic boundary conditions on floats?

Some cases of periodic boundary conditions (PBC) can be imposed very efficiently on integers by simply doing:
myWrappedWithinPeriodicBoundary = myUIntValue & mask
This works when the boundary is the half open range [0, upperBound), where the (exclusive) upperBound is 2^exp so that
mask = (1 << exp) - 1
For example:
let pbcUpperBoundExp = 2 // so the periodic boundary will be [0, 4)
let mask = (1 << pbcUpperBoundExp) - 1
for x in -7 ... 7 { print(x & mask, terminator: " ") }
(in Swift) will print:
1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Question: Is there any (roughly similar) efficient method for imposing (some cases of) PBCs on floating point-numbers (32 or 64-bit IEEE-754)?

There are several reasonable approaches:
fmod(x,1)
modf(x,&dummy) — has the advantage of knowing its divisor statically, but in my testing comes from libc.so.6 even with -ffast-math
x-floor(x) (suggested by Jens in a comment) — supports negative inputs directly
Manual bit-twiddling direct implementation
Manual bit-twiddling implementation of floor
The first two preserve the sign of their input; you can add 1 if it's negative.
The two bit manipulations are very similar: you identify which significand bits correspond to the integer portion, and mask them (for the direct implementation) or the rest (to implement floor) off. The direct implementation can be completed either with a floating-point division or with a shift to reassemble the double manually; the former is 28% faster even given hardware CLZ. The floor implementation can immediately reconstitute a double: floor never changes the exponent of its argument unless it returns 0. About 20 lines of C are required.
The following timing is with double and gcc -O3, with timing loops over representative inputs into which the operative code was inlined.
fmod: 41.8 ns
modf: 19.6 ns
floor: 10.6 ns
With -ffast-math:
fmod: 26.2 ns
modf: 30.0 ns
floor: 21.9 ns
Bit manipulation:
direct: 18.0 ns
floor: 20.6 ns
The manual implementations are competitive, but the floor technique is the best. Oddly, two of the three library functions perform better without -ffast-math: that is, as a PLT function call than as an inlined builtin function.

I'm adding this answer to my own question since it describes the, at the time of writing, best solution I have found. It's in Swift 4.1 (should be straight forward to translate into C) and it's been tested in various use cases:
extension BinaryFloatingPoint {
/// Returns the value after restricting it to the periodic boundary
/// condition [0, 1).
/// See https://forums.swift.org/t/why-no-fraction-in-floatingpoint/10337
#_transparent
func wrappedToUnitRange() -> Self {
let fract = self - self.rounded(.down)
// Have to clamp to just below 1 because very small negative values
// will otherwise return an out of range result of 1.0.
// Turns out this:
if fract >= 1.0 { return Self(1).nextDown } else { return fract }
// is faster than this:
//return min(fract, Self(1).nextDown)
}
#_transparent
func wrapped(to range: Range<Self>) -> Self {
let measure = range.upperBound - range.lowerBound
let recipMeasure = Self(1) / measure
let scaled = (self - range.lowerBound) * recipMeasure
return scaled.wrappedToUnitRange() * measure + range.lowerBound
}
#_transparent
func wrappedIteratively(to range: Range<Self>) -> Self {
var v = self
let measure = range.upperBound - range.lowerBound
while v >= range.upperBound { v = v - measure }
while v < range.lowerBound { v = v + measure }
return v
}
}
On my MacBook Pro with a 2 GHz Intel Core i7,
a hundred million (probably inlined) calls to wrapped(to range:) on random (finite) Double values takes 0.6 seconds, which is about 166 million calls per second (not multi threaded). The range being statically known or not, or having bounds or measure that is a power of two etc, can make some difference but not as much as one could perhaps have thought.
wrappedToUnitRange() takes about 0.2 seconds, meaning 500 million calls per second on my system.
Given the right scenario, wrappedIteratively(to range:) is as fast as wrappedToUnitRange().
The timings have been made by comparing a baseline test (without wrapping some value, but still using it to compute eg a simple xor checksum) to the same test where a value is wrapped. The difference in time between these are the times I have given for the wrapping calls.
I have used Swift development toolchain 2018-02-21, compiling with -O -whole-module-optimization -static-stdlib -gnone. And care has been taken to make the tests relevant, ie preventing dead code removal, using true random input of different distributions etc. Writing the wrapping functions generically, like this extension on BinaryFloatingPoint, turned out to be optimized into equivalent code as if I had written separate specialized versions for eg Float and Double.
It would be interesting to see someone more skilled than me investigating this further (C or Swift or any other language doesn't matter).
EDIT:
For anyone interested, here is some versions for simd float2:
extension float2 {
#_transparent
func wrappedInUnitRange() -> float2 {
return simd.fract(self)
}
#_transparent
func wrappedToMinusOneToOne() -> float2 {
let scaled = (self + float2(1, 1)) * float2(0.5, 0.5)
let scaledFract = scaled - floor(scaled)
let wrapped = simd_muladd(scaledFract, float2(2, 2), float2(-1, -1))
// Note that we have to make sure the result is not out of bounds, like
// simd fract does:
let oneNextDown = Float(bitPattern:
0b0_01111110_11111111111111111111111)
let oneNextDownFloat2 = float2(oneNextDown, oneNextDown)
return simd.min(wrapped, oneNextDownFloat2)
}
#_transparent
func wrapped(toLowerBound lowerBound: float2,
upperBound: float2) -> float2
{
let measure = upperBound - lowerBound
let recipMeasure = simd_precise_recip(measure)
let scaled = (self - lowerBound) * recipMeasure
let scaledFract = scaled - floor(scaled)
// Note that we have to make sure the result is not out of bounds, like
// simd fract does:
let wrapped = simd_muladd(scaledFract, measure, lowerBound)
let maxX = upperBound.x.nextDown // For some reason, this won't be
let maxY = upperBound.y.nextDown // optimized even when upperBound is
// statically known, and there is no similar simd function available.
let maxValue = float2(maxX, maxY)
return simd.min(wrapped, maxValue)
}
}
I asked some related simd-related questions here which might be of interest.
EDIT2:
As can be seen in the above Swift Forums thread:
// Note that tiny negative values like:
let x: Float = -1e-08
// May produce results outside the [0, 1) range:
let wrapped = x - floor(x)
print(wrapped < 1.0) // false
// which may result in out-of-bounds table accesses
// in common usage, so it's probably better to use:
let correctlyWrapped = simd_fract(x)
print(correctlyWrapped < 1.0) // true
I have since updated the code to account for this.

do not understand result of opencl select statement

I have a simple kernel in OpenCL that has the following structure:
kernel void simple_select(global double *input, global double *output) {
size_t i = get_global_id(0);
printf("input %d\n", (int)(input[i] != 0.0));
output[i] = select((float)0.0, (float)1.0, (int)(input[i] != 0.0));
//output[i] = select((float)0.0, (float)1.0, 1);
}
Equivalently this can be:
kernel void simple_select(global double *input, global double *output) {
size_t i = get_global_id(0);
printf("input %d\n", (int)(input[i] != 0.0));
output[i] = input[i] != 0.0 ? 1.0 : 0.0;
//output[i] = 1 ? 1.0 : 0.0;
}
When I print to the command line, I see:
input 1
input 1
input 1
But the output array has all 0.0. However, if I uncomment the last line of the kernel and comment out the second-to-last-line (meaning if I use the scalar 1 in the select statement) then it works as expected and the output array has all 1.0. So what is the difference between these two lines that leads to two different results?

Here is the answer.
It's a quirk in OpenCL. The problem is that true/false values for scalars are 1/0 (like printf has shown you), but true/false values for vectors are -1/0 - and this is also what select() expects in last argument (more precisely, it expects MSB set which means any negative integer).
Though i think the ternary operator on scalars should still work as expected, if it doesn't i would consider it a bug.

Parallel derivatives of multidimensional real data with FFTW

I would like to build a 2D MPI-parallel spectral differentiation code.
The following piece of code seems to work fine for the x-derivative, both in serial and in parallel:
alloc_local = fftw_mpi_local_size_2d(N0,N1,MPI_COMM_WORLD,&local_n0, &local_0_start);
fplan_u = fftw_mpi_plan_dft_2d(N0,N1,ptr_u,uhat,MPI_COMM_WORLD,FFTW_FORWARD,FFTW_ESTIMATE);
bplan_x = fftw_mpi_plan_dft_2d(N0,N1,uhat_x,ptr_ux,MPI_COMM_WORLD,FFTW_BACKWARD,FFTW_ESTIMATE);
ptr_u = fftw_alloc_real(alloc_local);
ptr_ux = fftw_alloc_real(alloc_local);
uhat = fftw_alloc_complex(alloc_local);
uhat_x = fftw_alloc_complex(alloc_local);
fftw_execute(fplan_u);
// first renormalize the transform...
for (int j=0;j<local_n0;j++)
for (int i=0;i<N1;i++)
uhat[j*N1+i] /= (double)(N1*local_n0);
// then compute the x-derivative
for (int j=0;j<local_n0;j++)
for (int i=0;i<N1/2;i++)
uhat_x[j*N1+i] = I*pow(i,1)*uhat[j*N1+i]/(double)N0;
for (int j=0;j<local_n0;j++)
for (int i=N1/2;i<N1;i++)
uhat_x[j*N1+i] = -I*pow(N1-i,1)*uhat[j*N1+i]/(double)N0;
fftw_execute(bplan_x);
However, the following code for the y-derivative works well only in serial, and not in parallel.
bplan_y = fftw_mpi_plan_dft_2d(N0,N1,uhat_y,ptr_uy,MPI_COMM_WORLD,FFTW_BACKWARD,FFTW_ESTIMATE);
fftw_execute(fplan_u);
for (int j=0;j<local_n0;j++)
for (int i=0;i<N1;i++)
uhat[j*N1+i] /= (double)(N1*local_n0);
if (size == 1){ // in the serial case do this
for (int j=0;j<local_n0/2;j++)
for (int i=0;i<N1;i++)
uhat_y[j*N1+i] = I*pow(j,1)*uhat[j*N1+i]/(double)N0;
for (int j=local_n0/2;j<local_n0;j++)
for (unsigned int i=0;i<N1;i++)
uhat_y[j*N1+i] = -I*pow(N1-j,1)*uhat[j*N1+i]/(double)N0;
} else { // in the parallel case instead do this
for (int j=0;j<local_n0;j++)
for (int i=0;i<N1;i++)
if (rank <(size/2))
uhat_y[j*N1+i] = I*pow(j+local_n0*rank,1)*uhat[j*N1+i]/(double)N0;
else
uhat_y[j*N1+i] = -I*pow(N1-(j+local_n0*rank),1)*uhat[j*N1+i]/(double)N0;
}
fftw_execute(bplan_y);
where the conditional expression if(rank<(size/2)) is used for dealiasing purposes (because I am doing FFTs of real data). The dealiasing appears to be correctly coded in serial for the y derivative, and also in parallel for the x derivative.
I know that FFTW provides a series of functions specific for this topic (fftw_mpi_plan_dft_r2c and ..._c2r) but I would like to use anyway the complex to complex routines, because in the future I might consider complex data.
Of course, using fftw_mpi_plan_transpose to transpose the data, take a x derivative and then transpose back works in parallel, but I would like to avoid this because in the future I plan solving diffusive-like problems implicitly, e.g.
uhat[i*N0+j] = fhat[i*N0+j]/(pow((double)i,2)+pow((double)j,2)))
and this would not be possible with the transpose plans.
Edit: strangely enough, if I run the above code with 3 MPI processes, e.g.:
my_rank: 0 alloc_local: 87552 local_n0: 171 local_0_start: 0
my_rank: 1 alloc_local: 87552 local_n0: 171 local_0_start: 171
my_rank: 2 alloc_local: 87040 local_n0: 170 local_0_start: 342
I get results that are still wrong, but look much closer to the correct solution than what I obtain with with 2,4, or 8 processes.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex