OpenCL vstoren does not store vector in scalar array - opencl

I have the kernel as below.
My question is why is vstore8 not working? When the output is printed in the host code, it only returns 0s.
I put an "if(all(v == 0) == 1)" in the code to check whether the error was caused when I copy the values from int4* to int8 in v, but it was not that.
It seems like vstoren is doing nothing.
I am new to OpenCL so any help is appreciated.
__kernel void select_vec(__global int4 *input1,
__global int *input2,
__global int *output){
//copy values in input arrays to vectors
int i = get_global_id(0);
int4 vA = input1[i];
int4 vB = input1[i+1];
__private int8 v = (int8)(vA.s0, vA.s1, vA.s2, vA.s3, vB.s0, vB.s1, vB.s2, vB.s3);
__private int8 v1 = vload8(0, input2);
__private int8 v2 = vload8(1, input2);
int8 results;
if(any(v > 10) == 1){
//if there is any of the elements in v that are greater than 10
// copy the corresponding elements from v1 for elements greater than 10
// for elements less than or equal to 17, copy the corresponding elements from v2
results = select(v1, v2, v > 10);
}else{
//results is the combination of the first half of v2 and v2
results = (int8) (v1.lo, v2.lo);
}
/* for testing of the error is due to vstoren */
// results = (int8) (1);
//store results in output array
vstore8(results, i, output);
}

Do you mean int8 v1 = vload8(i+0, input2);, int8 v2 = vload8(i+1, input2); and vstore8(results, i, output);?
Currently you read from the same memory addresses in input2 (0-7 for v1 and 8-15 for v2) and write to the same memory address in output (0-7) with all threads. This is a race condition because depending on v and the last thread writing to output, you can get randomly different results. But if input2 starts with 0s in addresses 0-15 and output is initialized with all 0s, it will remain all 0s.

Related

do not understand result of opencl select statement

I have a simple kernel in OpenCL that has the following structure:
kernel void simple_select(global double *input, global double *output) {
size_t i = get_global_id(0);
printf("input %d\n", (int)(input[i] != 0.0));
output[i] = select((float)0.0, (float)1.0, (int)(input[i] != 0.0));
//output[i] = select((float)0.0, (float)1.0, 1);
}
Equivalently this can be:
kernel void simple_select(global double *input, global double *output) {
size_t i = get_global_id(0);
printf("input %d\n", (int)(input[i] != 0.0));
output[i] = input[i] != 0.0 ? 1.0 : 0.0;
//output[i] = 1 ? 1.0 : 0.0;
}
When I print to the command line, I see:
input 1
input 1
input 1
But the output array has all 0.0. However, if I uncomment the last line of the kernel and comment out the second-to-last-line (meaning if I use the scalar 1 in the select statement) then it works as expected and the output array has all 1.0. So what is the difference between these two lines that leads to two different results?
Here is the answer.
It's a quirk in OpenCL. The problem is that true/false values for scalars are 1/0 (like printf has shown you), but true/false values for vectors are -1/0 - and this is also what select() expects in last argument (more precisely, it expects MSB set which means any negative integer).
Though i think the ternary operator on scalars should still work as expected, if it doesn't i would consider it a bug.

Why does this binary math fail when adding 00000001, but work correctly otherwise?

I've tried everything I can think of and cannot seem to get the below binary math logic to work. Not sure why this is failing but probably indicates my misunderstanding of binary math or C. The ultimate intent is to store large integers (unsigned long) directly to an 8-bit FRAM memory module as 4-byte words so that a micro-controller (Arduino) can recover the values after a power failure. Thus the unsigned long has to be assembled from its four byte words parts as it's pulled from memory, and the arithmetic of assembling these word bytes is not working correctly.
In the below snippet of code, the long value is defined as four bytes A, B, C, and D (simulating being pulled form four 8-bit memory blocks), which get translated to decimal notation to be used as an unsigned long in the arrangement DDDDDDDDCCCCCCCCBBBBBBBBAAAAAAAA. If A < 256 and B, C, D all == 0, the math works correctly. The math also works correctly for any values of B, C, and D if A == 0. But if B, C, or D > 0 and A == 1, the 1 value of A is not added during the arithmetic. A value of 2 works, but not a value of 1. Is there any reason for this? Or am I doing binary math wrong? Is this a known issue that needs a workaround?
// ---- FUNCTIONS
unsigned long fourByte_word_toDecimal(uint8_t byte0 = B00000000, uint8_t byte1 = B00000000, uint8_t byte2 = B00000000, uint8_t byte3 = B00000000){
return (byte0 + (byte1 * 256) + (byte2 * pow(256, 2)) + (byte3 * pow(256, 3)));
}
// ---- MAIN
void setup() {
Serial.begin(9600);
uint8_t addressAval = B00000001;
uint8_t addressBval = B00000001;
uint8_t addressCval = B00000001;
uint8_t addressDval = B00000001;
uint8_t addressValArray[4];
addressValArray[0] = addressAval;
addressValArray[1] = addressBval;
addressValArray[2] = addressCval;
addressValArray[3] = addressDval;
unsigned long decimalVal = fourByte_word_toDecimal(addressValArray[0], addressValArray[1], addressValArray[2], addressValArray[3]);
// Print out resulting decimal value
Serial.println(decimalVal);
}
In the code above, the binary value should result as 00000001000000010000000100000001, AKA a decimal value of 16843009. But the code evaluates the decimal value to 16843008. Changing the value of addressAval to 00000000 also evaluates (correctly) to 16843008, and changing addressAval to 00000010 also correctly evaluates to 16843010.
I'm stumped.
The problem is that you're using pow(). This is causing everything to be calculated as a binary32, which doesn't have enough precision to hold 16843009.
>>> numpy.float32(16843009)
16843008.0
The fix is to use integers, specifically 65536 and 16777216UL.
Do not use pow() for this.
The usual way to do this is with the shift operator:
uint32_t result = uint32_t(byte3 << 24 | byte2 << 16 | byte1 << 8 | byte0);

pyopencl.LogicError: clEnqueueNDRangeKernel failed: invalid work item size

I am attempting to implement in Python using pyopencl the dot_persist_kernel() shown here, and I've been squashing numerous bugs along the way. But, I've stumbled upon an issue that I can't crack:
self.program = cl.Program(self.ctx, code).build()
# code is a string with the code from the link given
a = cl_array.to_device(self.queue, np.random.rand(2**20).astype(np.float32))
b = cl_array.to_device(self.queue, np.random.rand(2**20).astype(np.float32))
c = 0.
mf = cl.mem_flags
c_buf = cl.Buffer(self.ctx, mf.WRITE_ONLY, 4)
MAX_COMPUTE_UNITS = cl.get_platforms()[0].get_devices()[0].max_compute_units
WORK_GROUPS_PER_CU = MAX_COMPUTE_UNITS * 4
ELEMENTS_PER_GROUP = a.size / WORK_GROUPS_PER_CU
ELEMENTS_PER_WORK_ITEM = ELEMENTS_PER_GROUP / 256
self.program.DotProduct(self.queue, a.shape, a.shape,
a.data, b.data, c_buf,
np.uint32(ELEMENTS_PER_GROUP),
np.uint32(ELEMENTS_PER_WORK_ITEM),
np.uint32(1028 * MAX_COMPUTE_UNITS))
Assuming an array of size 2^26, the constants will have values of:
MAX_COMPUTE_UNITS = 32 // from get_device()[0].max_compute_units
WORK_GROUPS_PER_CU = 128 // MAX_COMPUTE_UNITS * 4
ELEMENTS_PER_GROUP = 524288 // 2^19
ELEMENTS_PER_WORK_ITEM = 2048 // 2^11
The kernel header looks like:
#define LOCAL_GROUP_XDIM 256
// Kernel for part 1 of dot product, version 3.
__kernel __attribute__((reqd_work_group_size(LOCAL_GROUP_XDIM, 1, 1)))
void dot_persist_kernel(
__global const double * x, // input vector
__global const double * y, // input vector
__global double * r, // result vector
uint n_per_group, // elements processed per group
uint n_per_work_item, // elements processed per work item
uint n // input vector size
)
The error that it is giving is:
Traceback (most recent call last):
File "GPUCompute.py", line 102, in <module>
gpu = GPUCompute()
File "GPUCompute.py", line 87, in __init__
np.uint32(1028 * MAX_COMPUTE_UNITS))
File "C:\Miniconda2\lib\site-packages\pyopencl\__init__.py", line 512, in kernel_call
global_offset, wait_for, g_times_l=g_times_l)
pyopencl.LogicError: clEnqueueNDRangeKernel failed: invalid work item size
I've tried shifting the numbers around a lot, to no avail. Ideas?
There were a few issues going on with the previous implementation, but this one is working:
WORK_GROUPS = cl.get_platforms()[0].get_devices()[0].max_compute_units * 4
ELEMENTS_PER_GROUP = np_a.size / WORK_GROUPS
LOCAL_GROUP_XDIM = 256
ELEMENTS_PER_WORK_ITEM = ELEMENTS_PER_GROUP / LOCAL_GROUP_XDIM
self.program = cl.Program(self.ctx, kernel).build()
self.program.DotProduct(
self.queue, np_a.shape, (LOCAL_GROUP_XDIM,), # kernel information
cl_a, cl_b, cl_c, # data
np.uint32(ELEMENTS_PER_GROUP), # elements processed per group
np.uint32(ELEMENTS_PER_WORK_ITEM), # elements processed per work item
np.uint32(np_a.size) # input vector size
)
It was the culmination of a few things, but the biggest factor was that the second and third arguments passed to DotProduct() are supposed to be tuples--not ints, like I thought. :)

How can I use `vector <unsigned int*> vec;` properly

I am new in C++ and I want to use vector <unsigned int*> vec;
I try this code:
vector <unsigned int*> vec;
unsigned int* tmpV= new unsigned int[4];
for(unsigned int i=0; i<4;i++){
tmpV[i]=i;
}
vec.push_back(tmpV);
unsigned int* tmpV2=vec.at(0);
cout<<"A) tmpV2[1]: "<<tmpV2[1] <<endl;
cout<<"vec.size(): "<<vec.size()<<endl;
for(unsigned int i=0; i<4;i++){
tmpV[i]=i+4;
}
vec.push_back(tmpV);
tmpV2=vec.at(0);
cout<<"vec.size(): "<<vec.size()<<endl;
cout<<"B) tmpV2[1]: "<<tmpV2[1]<<endl;
The problem her is that I wanted to output the same value for A) and B)
but it ouputs
A) tmpV2[1]: 1
B) tmpV2[1]: 5
I want to be able to handle different elements in this vector of pointers.
I can roughly understand why this is going on but I couldn't find a solution.
Have in mind that I don't want to use: vector < vector <unsigned int> >
It is because you have incremented the value pointed by the pointer at the index a in vector
vec
if you reprint it again after printing
valuecout<<"B) tmpV2[1]: "<<tmpV2[1]<<endl
valuecout<<"B) tmpV2[1]: "<<vec[1] <<endl
both will show same result
What you have done so far is
You have a vector of integer pointers
You have initialized this array
You had one temporary pointer pointing to the zeroth index of the vector
Now using this pointer you printed the value pointed by the second index of the vec
After that you manipulated all the values pointed by vector of pointers (incremented by 4)
Now you are again printing the value pointed by index 1 of vector
Both are same only thing is you printed the value manipulated it and again printed after manipulation. If you reprint both *vec[1] and tmpV2[1] at the end again you will find both are same.

issue with OpenCL stencil code

I have a problem with a 4-point stencil OpenCL code. The code runs fine but I don't get symetrics final 2D values which are expected.
I suspect it is a problem of updates values in the kernel code. Here's the kernel code :
// kernel code
const char *source ="__kernel void line_compute(const double diagx, const double diagy,\
const double weightx, const double weighty, const int size_x,\
__global double* tab_new, __global double* r)\
{ int iy = get_global_id(0)+1;\
int ix = get_global_id(1)+1;\
double new_value, cell, cell_n, cell_s, cell_w, cell_e;\
double rk;\
cell_s = tab_new[(iy+1)*(size_x+2)+ix];\
cell_n = tab_new[(iy-1)*(size_x+2)+ix];\
cell_e = tab_new[iy*(size_x+2)+(ix+1)];\
cell_w = tab_new[iy*(size_x+2)+(ix-1)];\
cell = tab_new[iy*(size_x+2)+ix];\
new_value = weighty *( cell_n + cell_s + cell*diagy)+\
weightx *( cell_e + cell_w + cell*diagx);\
rk = cell - new_value;\
r[iy*(size_x+2)+ix] = rk *rk;\
barrier(CLK_GLOBAL_MEM_FENCE);\
tab_new[iy*(size_x+2)+ix] = new_value;\
}";
cell_s, cell_n, cell_e, cell_w represents the 4 values for the 2D stencil. I compute the new_value and update it after a "barrier(CLK_GLOBAL_MEM_FENCE)".
However, it seems there are conflicts between differents work-items. How could I fix this ?
The barrier GLOBAL_MEM_FENCE you use will not synchronize all work-items as intended. It does only synchronize access with one single workgroup.
Usually all workgroups won't be executed at the same time, because they are scheduled on only a small number of physical cores, and global synchronization is not possible within a kernel.
The solution is to write the output to a different buffer.

Resources