issue with OpenCL stencil code - opencl

I have a problem with a 4-point stencil OpenCL code. The code runs fine but I don't get symetrics final 2D values which are expected.
I suspect it is a problem of updates values in the kernel code. Here's the kernel code :
// kernel code
const char *source ="__kernel void line_compute(const double diagx, const double diagy,\
const double weightx, const double weighty, const int size_x,\
__global double* tab_new, __global double* r)\
{ int iy = get_global_id(0)+1;\
int ix = get_global_id(1)+1;\
double new_value, cell, cell_n, cell_s, cell_w, cell_e;\
double rk;\
cell_s = tab_new[(iy+1)*(size_x+2)+ix];\
cell_n = tab_new[(iy-1)*(size_x+2)+ix];\
cell_e = tab_new[iy*(size_x+2)+(ix+1)];\
cell_w = tab_new[iy*(size_x+2)+(ix-1)];\
cell = tab_new[iy*(size_x+2)+ix];\
new_value = weighty *( cell_n + cell_s + cell*diagy)+\
weightx *( cell_e + cell_w + cell*diagx);\
rk = cell - new_value;\
r[iy*(size_x+2)+ix] = rk *rk;\
barrier(CLK_GLOBAL_MEM_FENCE);\
tab_new[iy*(size_x+2)+ix] = new_value;\
}";
cell_s, cell_n, cell_e, cell_w represents the 4 values for the 2D stencil. I compute the new_value and update it after a "barrier(CLK_GLOBAL_MEM_FENCE)".
However, it seems there are conflicts between differents work-items. How could I fix this ?

The barrier GLOBAL_MEM_FENCE you use will not synchronize all work-items as intended. It does only synchronize access with one single workgroup.
Usually all workgroups won't be executed at the same time, because they are scheduled on only a small number of physical cores, and global synchronization is not possible within a kernel.
The solution is to write the output to a different buffer.

Related

Why in OpenCl when I used pipe using two kernels (producer and consumer) input and output arrays values is interchanged

here is my kernel code
Producer Kernel writing input data into Pipe using pipeWrite function
__kernel
void pipeWrite(
__global int *src,
__write_only pipe int out_pipe)
{
int gid = get_global_id(0);
reserve_id_t res_id;
res_id = reserve_write_pipe(out_pipe, 1);
if(is_valid_reserve_id(res_id))
{
if(write_pipe(out_pipe, res_id, 0, &src[gid]) != 0)
{
return;
}
commit_write_pipe(out_pipe, res_id);
}
}
Consumer Kernel reading input data from Pipe using pipeRead function
__kernel
void pipeRead(
__read_only pipe int in_pipe,
__global int *dst)
{
int gid = get_global_id(0);
reserve_id_t res_id;
res_id = reserve_read_pipe(in_pipe, 1);
if(is_valid_reserve_id(res_id))
{
if(read_pipe(in_pipe, res_id, 0, &dst[gid]) != 0)
{
return;
}
commit_read_pipe(in_pipe, res_id);
}
}
GPU info
Max compute units 11
SIMD per compute unit (AMD) 4
SIMD width (AMD) 32
SIMD instruction width (AMD) 1
Max clock frequency 1900MHz
Graphics IP (AMD) 10.1
Device Partition (core)
Max number of sub-devices 11
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x1024
Max work group size 256
Compiler Available Yes
Linker Available Yes
Preferred work group size multiple 32
Wavefront width (AMD) 32
but values is ok when I use global size 32 (64) local size 32 (64)
which I guess connected with wavefront which is 32.
My question is how I can get output array which values match with input values.

How does "runif" function work internally in R?

I am trying to generate a set of uniformly distributed numbers in R. I know that we can use the function "runif" in R to do the same. But I really want to understand the idea behind how this function would have been developed. In the sense how does the code work for the function "runif". So, in a nutshell, I want to create my own function which can do the same task as the "runif"
Ultimately, runif calls a pseudorandom number generator. One of the simpler ones can be found here defined in C within the R code base and should be straightforward to emulate
static unsigned int I1=1234, I2=5678;
void set_seed(unsigned int i1, unsigned int i2)
{
I1 = i1; I2 = i2;
}
void get_seed(unsigned int *i1, unsigned int *i2)
{
*i1 = I1; *i2 = I2;
}
double unif_rand(void)
{
I1= 36969*(I1 & 0177777) + (I1>>16);
I2= 18000*(I2 & 0177777) + (I2>>16);
return ((I1 << 16)^(I2 & 0177777)) * 2.328306437080797e-10; /* in [0,1) */
}
So effectively this takes the initial integer seed values, shuffles them bitwise, then recasts them as double precision floating point numbers via multiplying by a small constant that normalises the doubles into the [0, 1) range.

do not understand result of opencl select statement

I have a simple kernel in OpenCL that has the following structure:
kernel void simple_select(global double *input, global double *output) {
size_t i = get_global_id(0);
printf("input %d\n", (int)(input[i] != 0.0));
output[i] = select((float)0.0, (float)1.0, (int)(input[i] != 0.0));
//output[i] = select((float)0.0, (float)1.0, 1);
}
Equivalently this can be:
kernel void simple_select(global double *input, global double *output) {
size_t i = get_global_id(0);
printf("input %d\n", (int)(input[i] != 0.0));
output[i] = input[i] != 0.0 ? 1.0 : 0.0;
//output[i] = 1 ? 1.0 : 0.0;
}
When I print to the command line, I see:
input 1
input 1
input 1
But the output array has all 0.0. However, if I uncomment the last line of the kernel and comment out the second-to-last-line (meaning if I use the scalar 1 in the select statement) then it works as expected and the output array has all 1.0. So what is the difference between these two lines that leads to two different results?
Here is the answer.
It's a quirk in OpenCL. The problem is that true/false values for scalars are 1/0 (like printf has shown you), but true/false values for vectors are -1/0 - and this is also what select() expects in last argument (more precisely, it expects MSB set which means any negative integer).
Though i think the ternary operator on scalars should still work as expected, if it doesn't i would consider it a bug.

Why does this binary math fail when adding 00000001, but work correctly otherwise?

I've tried everything I can think of and cannot seem to get the below binary math logic to work. Not sure why this is failing but probably indicates my misunderstanding of binary math or C. The ultimate intent is to store large integers (unsigned long) directly to an 8-bit FRAM memory module as 4-byte words so that a micro-controller (Arduino) can recover the values after a power failure. Thus the unsigned long has to be assembled from its four byte words parts as it's pulled from memory, and the arithmetic of assembling these word bytes is not working correctly.
In the below snippet of code, the long value is defined as four bytes A, B, C, and D (simulating being pulled form four 8-bit memory blocks), which get translated to decimal notation to be used as an unsigned long in the arrangement DDDDDDDDCCCCCCCCBBBBBBBBAAAAAAAA. If A < 256 and B, C, D all == 0, the math works correctly. The math also works correctly for any values of B, C, and D if A == 0. But if B, C, or D > 0 and A == 1, the 1 value of A is not added during the arithmetic. A value of 2 works, but not a value of 1. Is there any reason for this? Or am I doing binary math wrong? Is this a known issue that needs a workaround?
// ---- FUNCTIONS
unsigned long fourByte_word_toDecimal(uint8_t byte0 = B00000000, uint8_t byte1 = B00000000, uint8_t byte2 = B00000000, uint8_t byte3 = B00000000){
return (byte0 + (byte1 * 256) + (byte2 * pow(256, 2)) + (byte3 * pow(256, 3)));
}
// ---- MAIN
void setup() {
Serial.begin(9600);
uint8_t addressAval = B00000001;
uint8_t addressBval = B00000001;
uint8_t addressCval = B00000001;
uint8_t addressDval = B00000001;
uint8_t addressValArray[4];
addressValArray[0] = addressAval;
addressValArray[1] = addressBval;
addressValArray[2] = addressCval;
addressValArray[3] = addressDval;
unsigned long decimalVal = fourByte_word_toDecimal(addressValArray[0], addressValArray[1], addressValArray[2], addressValArray[3]);
// Print out resulting decimal value
Serial.println(decimalVal);
}
In the code above, the binary value should result as 00000001000000010000000100000001, AKA a decimal value of 16843009. But the code evaluates the decimal value to 16843008. Changing the value of addressAval to 00000000 also evaluates (correctly) to 16843008, and changing addressAval to 00000010 also correctly evaluates to 16843010.
I'm stumped.
The problem is that you're using pow(). This is causing everything to be calculated as a binary32, which doesn't have enough precision to hold 16843009.
>>> numpy.float32(16843009)
16843008.0
The fix is to use integers, specifically 65536 and 16777216UL.
Do not use pow() for this.
The usual way to do this is with the shift operator:
uint32_t result = uint32_t(byte3 << 24 | byte2 << 16 | byte1 << 8 | byte0);

pyopencl.LogicError: clEnqueueNDRangeKernel failed: invalid work item size

I am attempting to implement in Python using pyopencl the dot_persist_kernel() shown here, and I've been squashing numerous bugs along the way. But, I've stumbled upon an issue that I can't crack:
self.program = cl.Program(self.ctx, code).build()
# code is a string with the code from the link given
a = cl_array.to_device(self.queue, np.random.rand(2**20).astype(np.float32))
b = cl_array.to_device(self.queue, np.random.rand(2**20).astype(np.float32))
c = 0.
mf = cl.mem_flags
c_buf = cl.Buffer(self.ctx, mf.WRITE_ONLY, 4)
MAX_COMPUTE_UNITS = cl.get_platforms()[0].get_devices()[0].max_compute_units
WORK_GROUPS_PER_CU = MAX_COMPUTE_UNITS * 4
ELEMENTS_PER_GROUP = a.size / WORK_GROUPS_PER_CU
ELEMENTS_PER_WORK_ITEM = ELEMENTS_PER_GROUP / 256
self.program.DotProduct(self.queue, a.shape, a.shape,
a.data, b.data, c_buf,
np.uint32(ELEMENTS_PER_GROUP),
np.uint32(ELEMENTS_PER_WORK_ITEM),
np.uint32(1028 * MAX_COMPUTE_UNITS))
Assuming an array of size 2^26, the constants will have values of:
MAX_COMPUTE_UNITS = 32 // from get_device()[0].max_compute_units
WORK_GROUPS_PER_CU = 128 // MAX_COMPUTE_UNITS * 4
ELEMENTS_PER_GROUP = 524288 // 2^19
ELEMENTS_PER_WORK_ITEM = 2048 // 2^11
The kernel header looks like:
#define LOCAL_GROUP_XDIM 256
// Kernel for part 1 of dot product, version 3.
__kernel __attribute__((reqd_work_group_size(LOCAL_GROUP_XDIM, 1, 1)))
void dot_persist_kernel(
__global const double * x, // input vector
__global const double * y, // input vector
__global double * r, // result vector
uint n_per_group, // elements processed per group
uint n_per_work_item, // elements processed per work item
uint n // input vector size
)
The error that it is giving is:
Traceback (most recent call last):
File "GPUCompute.py", line 102, in <module>
gpu = GPUCompute()
File "GPUCompute.py", line 87, in __init__
np.uint32(1028 * MAX_COMPUTE_UNITS))
File "C:\Miniconda2\lib\site-packages\pyopencl\__init__.py", line 512, in kernel_call
global_offset, wait_for, g_times_l=g_times_l)
pyopencl.LogicError: clEnqueueNDRangeKernel failed: invalid work item size
I've tried shifting the numbers around a lot, to no avail. Ideas?
There were a few issues going on with the previous implementation, but this one is working:
WORK_GROUPS = cl.get_platforms()[0].get_devices()[0].max_compute_units * 4
ELEMENTS_PER_GROUP = np_a.size / WORK_GROUPS
LOCAL_GROUP_XDIM = 256
ELEMENTS_PER_WORK_ITEM = ELEMENTS_PER_GROUP / LOCAL_GROUP_XDIM
self.program = cl.Program(self.ctx, kernel).build()
self.program.DotProduct(
self.queue, np_a.shape, (LOCAL_GROUP_XDIM,), # kernel information
cl_a, cl_b, cl_c, # data
np.uint32(ELEMENTS_PER_GROUP), # elements processed per group
np.uint32(ELEMENTS_PER_WORK_ITEM), # elements processed per work item
np.uint32(np_a.size) # input vector size
)
It was the culmination of a few things, but the biggest factor was that the second and third arguments passed to DotProduct() are supposed to be tuples--not ints, like I thought. :)

Resources