strange OpenCL behavior - opencl

I faced very strange behavior of OpenCL. I've linked a minimal code sample.
Starting from some random index (commonly 32-divisible) values is not written to array if I add one extra operation beforehand (g_idata[ai] = g_idata[ai-1]). Also notable that, i will get correct result if:
just read value, and writing a literal (see SHOW_BUG).
add if (ai >= n) g_idata[0]+=0; at beginning. see commented lines
tested on Intel and nvidia.
import numpy as np
import pyopencl as cl
ctx = cl.create_some_context()
prg = cl.Program(ctx, """
__kernel void prescan(__global float *g_idata, const int n) {
int thid = get_global_id(0);
int ai = thid*2+1;
// if uncomment strings bellow the bug dissappears
//if (ai >= n){
// g_idata[0]+=0;
//}
bool SHOW_BUG=1;
// make a dummy operation
if (SHOW_BUG)
g_idata[ai] = g_idata[ai-1];
else {
g_idata[ai-1]; //dummy read
g_idata[ai] = 3.14f; //constant write
}
barrier(CLK_GLOBAL_MEM_FENCE);
//set 0,1,2,3... as result
g_idata[thid] = thid;
}
""").build()
prescan_kernel = prg.prescan
prescan_kernel.set_scalar_arg_dtypes([None, np.int32])
def main():
N = 512
a_np = (np.random.random((N,))).astype(np.float32)
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=a_np)
global_size = (512,)
local_size = None
prescan_kernel(queue, global_size, local_size, a_g, N)
cl.enqueue_copy(queue, a_np, a_g)
corect = np.array(range(N))
#assert np.allclose(a_np, 3.14), np.where(3.14 != a_np)
assert np.allclose(a_np, corect), np.where(corect != a_np)
if __name__ == '__main__':
for i in range(25):
main()

Several things in your code will, according to the OpenCL spec, create undefined behavior.
These include:
Accessing out-of-range memory. Array size expected to be N*2+1 for N work-items.
Multiple work-items (threads) accessing the same index of the array (read or write).
Furthermore barriers only synchronize work-items/threads in a work-group, so it has no effect in your code.
When discussing undefined behavior, it may behave differently on different platforms, sometimes crash the driver and sometimes take down the OS. Please fix these problems and then describe your problems.

Related

Use Comment to avoid OpenCL Error on NVIDIA

I wrote the following code for my test NVIDIA and AMD GPUs
kernel void computeLayerOutput_Rolled(
global Layer* layers,
global float* weights,
global float* output,
constant int* restrict netSpec,
int layer)
{
const int n = get_global_size(0);
const int nodeNumber = get_global_id(0); //There will be an offset depending on the layer we are operating on
int numberOfWeights;
float t;
//getPosition(i, netSpec, &layer, &nodeNumber);
numberOfWeights = layers[layer].nodes[nodeNumber].numberOfWeights;
//if (sizeof(Layer) > 60000) // This is the extra code add for nvidia
// exit(0);
t = 0;
for (unsigned int j = 0; j != numberOfWeights; ++j)
t += threeD_access(weights, layer, nodeNumber, j, MAXSIZE, MAXSIZE) *
twoD_access(output, layer-1, j, MAXSIZE);
twoD_access(output, layer, nodeNumber, MAXSIZE) = sigmoid(t);
}
At the beginning, I did not add the code that checking the size of Layer, and it works on AMD Kalindi GPU, but crash and report an error code -36 on NVIDIA Tesla C2075.
Since I had rewritten the struct type Layer and decreased the size of it a lot before, I decided to check the size of Layer to determine whether this struct defined well in kernel code. Then I added this code
if (sizeof(Layer) > 60000)
exit(0);
Then it is OK on NVIDIA. However, the strange thing is, when I add // before this just as the given code above, it still works. (I believe I do not need to make clean && make when I rewrite something in kernel code, but I still did it) Nevertheless, when I roll back to the version not contains this comment, it fails and the error code -36 appears again. It really puzzles me. I think two versions of my code are identical, isn't it?

Is the device address of a buffer the same for different kernels/programs in OpenCL

When passing buffers as argument to OpenCL kernels, will the address of the buffer seen by the kernel code remains the same for the same buffer?
I used the code below to check and it seems that the address are indeed the same. However, I can't find anything in the standard to guarantee this.
import pyopencl as cl
import numpy as np
def main():
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
buf = cl.Buffer(ctx, mf.READ_ONLY, 1000)
buf2 = cl.Buffer(ctx, mf.READ_WRITE, 8)
prg = cl.Program(ctx, """
__kernel void
get_addr(__global const int *in, __global long *out)
{
*out = (long)in;
}
""").build()
knl = prg.get_addr
knl.set_args(buf, buf2)
cl.enqueue_task(queue, knl)
b = np.empty([1], dtype=np.int64)
cl.enqueue_copy(queue, b, buf2).wait()
print(b[0])
prg = cl.Program(ctx, """
__kernel void
get_addr(__global const int *in, __global long *out)
{
*out = (long)in;
}
""").build()
knl = prg.get_addr
knl.set_args(buf, buf2)
cl.enqueue_task(queue, knl)
b = np.empty([1], dtype=np.int64)
cl.enqueue_copy(queue, b, buf2).wait()
print(b[0])
if __name__ == '__main__':
main()
The use case is that I am running a simulation using OpenCL which has many (arrays) of parameters. In order not having to pass these arrays around as arguments, I fill them in a struct and pass the pointer to the struct around instead. Since this struct will be used many times (and by all work items) I would like not having to fill it in every run of every kernels and would like to know if the pointers will change between different runs/work items.
It is not guaranteed for OpenCL 1.x. This is why it is unsafe to store pointers in buffers. The runtime is allowed to move the allocation for each kernel launch. There is no guarantee that it will move it, and of course it is reasonable to expect that the buffer will not often need to move so it isn't surprising that you'd see the result you see. If you allocate a lot more buffers and cycle through them to force the runtime to move them around you will be more likely to see the issue.
For OpenCL 2.0 the shared virtual memory feature guarantees this by definition: the address couldn't be shared if it kept changing.

Copying an Image using PyOpenCL

I've been having some trouble making a copy of an image using PyOpenCL. I wanted to try copying as I really want to do other processing, but im not able to understand this basic task of accessing every pixel. Please help me catch the error to make sure it works.
Here is the program
import pyopencl as cl
import numpy
import Image
import sys
img = Image.open(sys.argv[1])
img_arr = numpy.asarray(img).astype(numpy.uint8)
dim = img_arr.shape
host_arr = img_arr.reshape(-1)
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=host_arr)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, host_arr.nbytes)
kernel_code = """
__kernel void copyImage(__global const uint8 *a, __global uint8 *c)
{
int rowid = get_global_id(0);
int colid = get_global_id(1);
int ncols = %d;
int npix = %d; //number of pixels, 3 for RGB 4 for RGBA
int index = rowid * ncols * npix + colid * npix;
c[index + 0] = a[index + 0];
c[index + 1] = a[index + 1];
c[index + 2] = a[index + 2];
}
""" % (dim[1], dim[2])
prg = cl.Program(ctx, kernel_code).build()
prg.copyImage(queue, (dim[0], dim[1]) , None, a_buf, dest_buf)
result = numpy.empty_like(host_arr)
cl.enqueue_copy(queue, result, dest_buf)
result_reshaped = result.reshape(dim)
img2 = Image.fromarray(result_reshaped, "RGB")
img2.save("new_image_gpu.bmp")
The image I gave as input was
However, the output from the program was
I'm not able to make sense of why those black lines appear.
Please help me solve this bug.
Thank You
OK ! so I've found a solution. I changed all uint8 to int, and in the numpy array i removed the "astype(numpy.uint8)". I dont know why, I just tried this and it worked. An explanation as to why would be helpful. Also, does this mean this will take much more memory now ?
It works, but now I think it takes much more memory. Any workaround using the uint8 will be helpful.
There is a mismatch between the datatypes your are using in Python and OpenCL. In numpy, a uint8 is an 8-bit unsigned integer (which I presume is what you were after). In OpenCL, a uint8 is an 8-element vector of 32-bit unsigned integers. The correct datatype for an 8-bit unsigned integer in OpenCL is just uchar. So, your astype(numpy.uint8) is fine, but it should be accompanied by arrays of __global const uchar* in your OpenCL kernel.
If you are dealing with images, I would also recommend looking into OpenCL's dedicated image types, which can take advantage of the native support for handling images available in some hardware.

Working around pyopencl array offset limitation

Is there a way to work around the limitation in PyOpenCL whereby:
array.data
fails with
pyopencl.array.ArrayHasOffsetError: The operation you are attempting does not yet support arrays that start at an offset from the beginning of their buffer.
I tried:
a.base_data[a.offset: a.offset + a.nbytes]
This seems to work sometimes, but other times I get:
pyopencl.LogicError: clCreateSubBuffer failed: invalid value
clcreateSubBuffer needs to have the offset (or in this case it is called the origin) that is aligned, and the size + origin to fall within the limits of the buffer.
CL_INVALID_VALUE is returned in errcode_ret if the region specified by
(origin, size) is out of bounds in buffer.
CL_MISALIGNED_SUB_BUFFER_OFFSET is returned in errcode_ret if there
are no devices in context associated with buffer for which the origin
value is aligned to the CL_DEVICE_MEM_BASE_ADDR_ALIGN value.
For the particular error you are seeing it looks like either your program or pyopencl is miscalculating the size of the array after the offset. Even if you fixed this you may still have problems if the original offset is not aligned to CL_DEVICE_MEM_BASE_ADDR_ALIGN.
Having said that NVIDIA seems to break from spec and allow arbitrary offsets. So your mileage may vary depending on the hardware.
If you're just looking to get a buffer that marks the start of the array data, to pass to a kernel, you don't have to worry about the size. Here's a function that gets a size-1 buffer that points to the start of the offset data:
def data_ptr(array):
if array.offset:
return array.base_data.get_sub_region(array.offset, 1)
else:
return array.data
You can use this to pass to a kernel, if you need a pointer to the start of the offset data. Here's an example, where I want to set a sub-region clV of array clA to the value 3. I use data_ptr to get a pointer to the start of clV's data.
import numpy as np
import pyopencl as cl
import pyopencl.array
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
m, n = 5, 5
A = np.random.uniform(size=(m, n)).astype(np.float32)
clA = cl.array.Array(queue, A.shape, A.dtype)
clA.set(A)
clV = clA[1::2, 1::2]
def data(array):
if array.offset:
return array.base_data.get_sub_region(array.offset, 1)
else:
return array.data
source = """
__kernel void fn(long si, long sj, __global float *Y)
{
const int i = get_global_id(0);
const int j = get_global_id(1);
Y[i*si + j*sj] = 3;
}
"""
kernel = cl.Program(ctx, source).build().fn
gsize = clV.shape
lsize = None
estrides = np.array(clV.strides) / clV.dtype.itemsize
kernel(queue, gsize, lsize, estrides[0], estrides[1], data_ptr(clV))
print(clA.get())

Strange behaviour using local memory in OpenCL

I'm currently working on a project suing OpenCL on a NVIDIA Tesla C1060 (driver version 195.17). However I'm getting some strange behaviour I can't really explain. Here is the code which puzzles me (reduced for clarity and testing purpose):
kernel void TestKernel(global const int* groupOffsets, global float* result,
local int* tmpData, const int itemcount)
{
unsigned int groupid = get_group_id(0);
unsigned int globalsize = get_global_size(0);
unsigned int groupcount = get_num_groups(0);
for(unsigned int id = get_global_id(0); id < itemcount; id += globalsize, groupid += groupcount)
{
barrier(CLK_LOCAL_MEM_FENCE);
if(get_local_id(0) == 0)
tmpData[0] = groupOffsets[groupid];
barrier(CLK_LOCAL_MEM_FENCE);
int offset = tmpData[0];
result[id] = (float) offset;
}
}
This code should load the offset for each workgroup into local memory and then read it back and write it into the corresponding outputvector entry. For most workitems this is working, but for each workgroup the workitems with local ids 1 to 31 read an incorrect value.
My output vector (for workgroupsize=128) is as following:
index 0: 0
index 1- 31: 470400
index 32-127: 0
index 128: 640
index 129-159: 471040
index 160-255: 640
index 256: 1280
index 257-287: 471680
index 288-511: 1280
...
the output i expected would be
index 0-127: 0
index 128-255: 640
index 256-511: 1280
...
Strange thing is: the problem only occurs when I use less then itemcount workitems (so it works as expected when globalsize>=itemcount, meaning that every workitem processes only one entry). So I'm guessing it has something to do with the loop.
Does anyone know what I'm doing wrong and how to fix it?
Update:
I found out that it seems to work if I change
if(get_local_id(0) == 0)
tmpData[0] = groupOffsets[groupid];
to
if(get_local_id(0) < 32)
tmpData[0] = groupOffsets[groupid];
Which astonishes me even more, so while it might fix the problem, I'm don't feel comfortable fixing it this way (as in it might break some other time).
Besides I would rather avoid losing performance when running on Geforce 8xxx class hardware due to additional (uncoalesced for that hardware as far as I understand) memory accesses.
So the question still remains.
Firstly, and importantly, you need to be careful that itemcount is a multiple of the local work size to avoid divergence when executing the barrier.
All work-items in a work-group executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the barrier. This function must be encountered by all work-items in a work-group executing the kernel.
You could implement this as follows:
unsigned int itemcountrounded = get_local_size(0) * ((itemcount + get_local_size(0) - 1) / get_local_size(0));
for(unsigned int id = get_global_id(0); id < itemcountrounded; id += globalsize, groupid += groupcount)
{
// ...
if (id < itemcount)
result[id] = (float) offset;
}
You said the code was reduced for simplicity, what happens if you run what you posted? Just wondering whether you need to put the barrier on global memory as well.

Resources