Why in OpenCl when I used pipe using two kernels (producer and consumer) input and output arrays values is interchanged - opencl

here is my kernel code
Producer Kernel writing input data into Pipe using pipeWrite function
__kernel
void pipeWrite(
__global int *src,
__write_only pipe int out_pipe)
{
int gid = get_global_id(0);
reserve_id_t res_id;
res_id = reserve_write_pipe(out_pipe, 1);
if(is_valid_reserve_id(res_id))
{
if(write_pipe(out_pipe, res_id, 0, &src[gid]) != 0)
{
return;
}
commit_write_pipe(out_pipe, res_id);
}
}
Consumer Kernel reading input data from Pipe using pipeRead function
__kernel
void pipeRead(
__read_only pipe int in_pipe,
__global int *dst)
{
int gid = get_global_id(0);
reserve_id_t res_id;
res_id = reserve_read_pipe(in_pipe, 1);
if(is_valid_reserve_id(res_id))
{
if(read_pipe(in_pipe, res_id, 0, &dst[gid]) != 0)
{
return;
}
commit_read_pipe(in_pipe, res_id);
}
}
GPU info
Max compute units 11
SIMD per compute unit (AMD) 4
SIMD width (AMD) 32
SIMD instruction width (AMD) 1
Max clock frequency 1900MHz
Graphics IP (AMD) 10.1
Device Partition (core)
Max number of sub-devices 11
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x1024
Max work group size 256
Compiler Available Yes
Linker Available Yes
Preferred work group size multiple 32
Wavefront width (AMD) 32
but values is ok when I use global size 32 (64) local size 32 (64)
which I guess connected with wavefront which is 32.
My question is how I can get output array which values match with input values.

Related

Bizzare output from Arduinio

I'm trying to do something relatively simple with the Arduino (trying to get some lights to light up like a Simon says game) but I'm getting some really bizarre outputs
I got some really bizarre output on the pins so I took those parts of the code out to see it on a serial monitor to see what the contents of the array that holds the sequence of lights (colors) are. It just really doesn't make sense to me.
What my code is supposed to do is append a random number from 1-4 onto colors[] and then read them back out onto the serial monitor
bool lightValue[4] = { 0 };
// stores whether each LED is lit or not
int buttonPressed;
// 0 - no button pressed / nothing
// 1 - red
// 2 - yellow
// 3 - green
// 4 - blue
int colors[] = { 0 };
// probably not the best name for this variable but oh well
// stores all the colors displayed for the simon says
// refer above to see what each number means
int colorNum = 0;
// again not the best name, stores the number of colors displayed so far
// AKA the length of colors[]
int randomNum;
// will store a random number
// variables
void setup() {
randomSeed(analogRead(0));
Serial.begin(9600);
Serial.println();
Serial.println("PROGRAM START");
}
// pinModes. Lots of pinModes.
void loop() {
randomNum = random(1,5);
colors[colorNum] = randomNum;
Serial.println();
Serial.print(colorNum);
Serial.print(" ");
colorNum++;
// adds another random color onto the end of the color sequence
for (int i = 0; i < colorNum; i++) {
Serial.print(colors[i]);
delay(500);
}
}
Some examples of outputs I got:
0 3
0 1
2 13520
3 145202
4 1552024
5 16520241
6 175202414
7 1852024141
8 19520241414
9 1105202414141
10 11152024141414
0 1
2 13520
3 145202
4 1552024
5 16520241
6 175202414
7 1852024141
8 19520241414
9 1105202414141
10 11152024141414
colorNum, the main increment of this loop for some reason skips over one. The first and second output do not match, the third item in the array is 520, and for some reason, the second item is incrementing by 1 every step. Also, it stops at 10 for some reason.
The only thing I could chalk this inconsistent behavior to is accessing some piece of memory where it shouldn't, but I can't come up for the life of me where I horribly messed up.
int colors[] = { 0 };
defines an integer array with a single element 0.
Here colors[colorNum] = randomNum; you're assigning numbers to indices outside of that array for colorNum > 0. You shouldn't do that. This memory region is not reserved for colors!
So who stops your compiler from storing colorNum right after colors?
So when you assing a value to colors[1] you could very well change the value of colorNum. Same for your loop control variable i.
So the second value is incremented because you're incrementing colorNum which is at the same memory location as colorNum[1].
The print for colorNum == 1 is missing because you assigned 5 to colors[2] which is at the same memory location as your loop control variable i. As 5 > colorNum the loop does not run.
I just did this on a 32bit C++ compiler:
int colors[] = {0};
int colorNum = 0;
int i = 0;
And the addresses printed:
colors[0] # 0x7fff029a5ac4
colorNum # 0x7fff029a5ac8
colors[1] # 0x7fff029a5ac8
i # 0x7fff029a5acc
colors[2] # 0x7fff029a5acc
Note that colorNum is just 4 bytes after colors[0] which is the same address as colors[1]!
Anyway you shouldn't just fill memory in an infinite loop in the first place.
You're on a micro controller where memory is a limited resource

OpenCL vstoren does not store vector in scalar array

I have the kernel as below.
My question is why is vstore8 not working? When the output is printed in the host code, it only returns 0s.
I put an "if(all(v == 0) == 1)" in the code to check whether the error was caused when I copy the values from int4* to int8 in v, but it was not that.
It seems like vstoren is doing nothing.
I am new to OpenCL so any help is appreciated.
__kernel void select_vec(__global int4 *input1,
__global int *input2,
__global int *output){
//copy values in input arrays to vectors
int i = get_global_id(0);
int4 vA = input1[i];
int4 vB = input1[i+1];
__private int8 v = (int8)(vA.s0, vA.s1, vA.s2, vA.s3, vB.s0, vB.s1, vB.s2, vB.s3);
__private int8 v1 = vload8(0, input2);
__private int8 v2 = vload8(1, input2);
int8 results;
if(any(v > 10) == 1){
//if there is any of the elements in v that are greater than 10
// copy the corresponding elements from v1 for elements greater than 10
// for elements less than or equal to 17, copy the corresponding elements from v2
results = select(v1, v2, v > 10);
}else{
//results is the combination of the first half of v2 and v2
results = (int8) (v1.lo, v2.lo);
}
/* for testing of the error is due to vstoren */
// results = (int8) (1);
//store results in output array
vstore8(results, i, output);
}
Do you mean int8 v1 = vload8(i+0, input2);, int8 v2 = vload8(i+1, input2); and vstore8(results, i, output);?
Currently you read from the same memory addresses in input2 (0-7 for v1 and 8-15 for v2) and write to the same memory address in output (0-7) with all threads. This is a race condition because depending on v and the last thread writing to output, you can get randomly different results. But if input2 starts with 0s in addresses 0-15 and output is initialized with all 0s, it will remain all 0s.

Why does this binary math fail when adding 00000001, but work correctly otherwise?

I've tried everything I can think of and cannot seem to get the below binary math logic to work. Not sure why this is failing but probably indicates my misunderstanding of binary math or C. The ultimate intent is to store large integers (unsigned long) directly to an 8-bit FRAM memory module as 4-byte words so that a micro-controller (Arduino) can recover the values after a power failure. Thus the unsigned long has to be assembled from its four byte words parts as it's pulled from memory, and the arithmetic of assembling these word bytes is not working correctly.
In the below snippet of code, the long value is defined as four bytes A, B, C, and D (simulating being pulled form four 8-bit memory blocks), which get translated to decimal notation to be used as an unsigned long in the arrangement DDDDDDDDCCCCCCCCBBBBBBBBAAAAAAAA. If A < 256 and B, C, D all == 0, the math works correctly. The math also works correctly for any values of B, C, and D if A == 0. But if B, C, or D > 0 and A == 1, the 1 value of A is not added during the arithmetic. A value of 2 works, but not a value of 1. Is there any reason for this? Or am I doing binary math wrong? Is this a known issue that needs a workaround?
// ---- FUNCTIONS
unsigned long fourByte_word_toDecimal(uint8_t byte0 = B00000000, uint8_t byte1 = B00000000, uint8_t byte2 = B00000000, uint8_t byte3 = B00000000){
return (byte0 + (byte1 * 256) + (byte2 * pow(256, 2)) + (byte3 * pow(256, 3)));
}
// ---- MAIN
void setup() {
Serial.begin(9600);
uint8_t addressAval = B00000001;
uint8_t addressBval = B00000001;
uint8_t addressCval = B00000001;
uint8_t addressDval = B00000001;
uint8_t addressValArray[4];
addressValArray[0] = addressAval;
addressValArray[1] = addressBval;
addressValArray[2] = addressCval;
addressValArray[3] = addressDval;
unsigned long decimalVal = fourByte_word_toDecimal(addressValArray[0], addressValArray[1], addressValArray[2], addressValArray[3]);
// Print out resulting decimal value
Serial.println(decimalVal);
}
In the code above, the binary value should result as 00000001000000010000000100000001, AKA a decimal value of 16843009. But the code evaluates the decimal value to 16843008. Changing the value of addressAval to 00000000 also evaluates (correctly) to 16843008, and changing addressAval to 00000010 also correctly evaluates to 16843010.
I'm stumped.
The problem is that you're using pow(). This is causing everything to be calculated as a binary32, which doesn't have enough precision to hold 16843009.
>>> numpy.float32(16843009)
16843008.0
The fix is to use integers, specifically 65536 and 16777216UL.
Do not use pow() for this.
The usual way to do this is with the shift operator:
uint32_t result = uint32_t(byte3 << 24 | byte2 << 16 | byte1 << 8 | byte0);

Knapsack algorithm: strange behavior with pown() on the gpu

The version on the CPU OCL produces right results, where the GPU OCL in some places gives slightly different results in some places that after all influence the correctness of the result. I have debugged on Intel OCL SDK where I get right results. I haven't spotted any race condition or concurrent access to the memory. This problem has appeared after I have introduced in the kernel (one line of code) pown function.
void kernel knapsack(global int *input_f, global int *output_f, global uint *m_d, int cmax, int weightk, int pk, int maxelem, int i){
int c = get_global_id(0)+cmax;
if(get_global_id(0)<maxelem){
if(input_f[c] < input_f[c - weightk] + pk){
output_f[c] = input_f[c - weightk] + pk;
m_d[c-1] = pown(2.0,i); *//previous version: m_d[c-1] = 1;*
}
else{
output_f[c] = input_f[c];
}
}
}
The purpose of pown is to compress the m_d buffer which holds the outcomes.
For example
1 0 1 0 2^0+2^2, 2^1, 2^0, 2^1
0 1 0 1 =>
1 0 0 0
On the gpu I get something like this:
2^0+2^2, 2^1, 2^0+2^2, 2^1 in the
3rd column I access to pown one more again, when I'm not supposed to.
This gives me that "slight" different result.
Here you can find full code
This work is based on this article:
Solving knapsack problems on GPU by V. Boyera, D. El Baza, M. Elkihel
related work: Accelerating the knapsack problem on GPUs by Bharath Suri

issue with OpenCL stencil code

I have a problem with a 4-point stencil OpenCL code. The code runs fine but I don't get symetrics final 2D values which are expected.
I suspect it is a problem of updates values in the kernel code. Here's the kernel code :
// kernel code
const char *source ="__kernel void line_compute(const double diagx, const double diagy,\
const double weightx, const double weighty, const int size_x,\
__global double* tab_new, __global double* r)\
{ int iy = get_global_id(0)+1;\
int ix = get_global_id(1)+1;\
double new_value, cell, cell_n, cell_s, cell_w, cell_e;\
double rk;\
cell_s = tab_new[(iy+1)*(size_x+2)+ix];\
cell_n = tab_new[(iy-1)*(size_x+2)+ix];\
cell_e = tab_new[iy*(size_x+2)+(ix+1)];\
cell_w = tab_new[iy*(size_x+2)+(ix-1)];\
cell = tab_new[iy*(size_x+2)+ix];\
new_value = weighty *( cell_n + cell_s + cell*diagy)+\
weightx *( cell_e + cell_w + cell*diagx);\
rk = cell - new_value;\
r[iy*(size_x+2)+ix] = rk *rk;\
barrier(CLK_GLOBAL_MEM_FENCE);\
tab_new[iy*(size_x+2)+ix] = new_value;\
}";
cell_s, cell_n, cell_e, cell_w represents the 4 values for the 2D stencil. I compute the new_value and update it after a "barrier(CLK_GLOBAL_MEM_FENCE)".
However, it seems there are conflicts between differents work-items. How could I fix this ?
The barrier GLOBAL_MEM_FENCE you use will not synchronize all work-items as intended. It does only synchronize access with one single workgroup.
Usually all workgroups won't be executed at the same time, because they are scheduled on only a small number of physical cores, and global synchronization is not possible within a kernel.
The solution is to write the output to a different buffer.

Resources