I have a kernel that runs on all platforms that I have access to, but AMD app SDK 3.0 with intel.
The platform is: OpenCL.Device(Intel(R) Core(TM) i7-6700 CPU # 3.40GHz on AMD Accelerated Parallel Processing
The MWE (sorry it's in Julia, but calls should be almost the same as in C):
using OpenCL
test_source = "
struct __attribute__((packed)) Test{
float3 f1;
int f2;
float f3;
};
__kernel void structest(struct Test a){}
"
device = first(cl.devices())
ctx = cl.Context(device)
prg = cl.Program(ctx, source = test_source)
queue = cl.CmdQueue(ctx)
cl.build!(prg)
structkernel = cl.Kernel(prg, "structest")
astruct = ((1f0, 2f0, 3f0, 0f0), Int32(0), 22f0)
sizeof(astruct)
# == 24 exactly the same as what sizeof(struct Test a) in the kernel returns
astruct_boxed = Ref(astruct)
cl.#check cl.api.clSetKernelArg(structkernel.id, cl.cl_uint(0), sizeof(astruct), astruct_boxed)
So I have confirmed, that the size of sizeof(astruct) and the size in the kernel match, but I still get an CL_INVALID_ARG_SIZE error. Is this a bug or am I missing something?
Related
here is my kernel code
Producer Kernel writing input data into Pipe using pipeWrite function
__kernel
void pipeWrite(
__global int *src,
__write_only pipe int out_pipe)
{
int gid = get_global_id(0);
reserve_id_t res_id;
res_id = reserve_write_pipe(out_pipe, 1);
if(is_valid_reserve_id(res_id))
{
if(write_pipe(out_pipe, res_id, 0, &src[gid]) != 0)
{
return;
}
commit_write_pipe(out_pipe, res_id);
}
}
Consumer Kernel reading input data from Pipe using pipeRead function
__kernel
void pipeRead(
__read_only pipe int in_pipe,
__global int *dst)
{
int gid = get_global_id(0);
reserve_id_t res_id;
res_id = reserve_read_pipe(in_pipe, 1);
if(is_valid_reserve_id(res_id))
{
if(read_pipe(in_pipe, res_id, 0, &dst[gid]) != 0)
{
return;
}
commit_read_pipe(in_pipe, res_id);
}
}
GPU info
Max compute units 11
SIMD per compute unit (AMD) 4
SIMD width (AMD) 32
SIMD instruction width (AMD) 1
Max clock frequency 1900MHz
Graphics IP (AMD) 10.1
Device Partition (core)
Max number of sub-devices 11
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x1024
Max work group size 256
Compiler Available Yes
Linker Available Yes
Preferred work group size multiple 32
Wavefront width (AMD) 32
but values is ok when I use global size 32 (64) local size 32 (64)
which I guess connected with wavefront which is 32.
My question is how I can get output array which values match with input values.
Why is the image coming out as black when saved? I am just beginning to learn opencl.
Without opencl, on purely CPU, the loop iterates through the matrix and uses the rgb2gray average formula to store the values in gray array.
Using windows and python 3.8
import pyopencl
import numpy as np
import imread
import matplotlib.pyplot as plt
ocl_platforms = (platform.name for platform in pyopencl.get_platforms())
print("\n".join(ocl_platforms))
# select platform
platform = pyopencl.get_platforms()[0]
# select device
device = platform.get_devices()[0]
# create context
ctx = pyopencl.Context(devices=[device])
img = imread.imread('gigapixel.jpg')
r = np.array(img[:, :, 0], dtype=np.float32)
g = np.array(img[:, :, 1], dtype=np.float32)
b = np.array(img[:, :, 2], dtype=np.float32)
gray = np.empty_like(r)
# without gpu
for i in range(r.shape[0]):
for j in range(r.shape[1]):
gray[i, j] = (r[i, j] + g[i, j] + b[i, j]) / 3
plt.imshow(gray)
plt.show()
# convert to uint8
gray = np.uint8(gray)
# save image
imread.imsave('gray_cpu.jpg', gray)
with GPU the rest of the code is
gray = np.empty_like(r)
program_source = """
__kernel void rgb2gray(__global float *r, __global float *g, __global float *b, __global
float *gray) {
int i = get_global_id(0);
int j = get_global_id(1);
gray[i, j] = (r[i, j] + g[i, j] + b[i, j])/ 3;
}
"""
gpu_program_source = pyopencl.Program(ctx, program_source)
gpu_program = gpu_program_source.build()
program_kernel_names = gpu_program.get_info(pyopencl.program_info.KERNEL_NAMES)
print(program_kernel_names)
queue = pyopencl.CommandQueue(ctx)
r_buf = pyopencl.Buffer(ctx, pyopencl.mem_flags.READ_ONLY |
pyopencl.mem_flags.COPY_HOST_PTR, hostbuf=r)
g_buf = pyopencl.Buffer(ctx, pyopencl.mem_flags.READ_ONLY |
pyopencl.mem_flags.COPY_HOST_PTR, hostbuf=g)
b_buf = pyopencl.Buffer(ctx, pyopencl.mem_flags.READ_ONLY |
pyopencl.mem_flags.COPY_HOST_PTR, hostbuf=b)
gray_buf = pyopencl.Buffer(ctx, pyopencl.mem_flags.WRITE_ONLY, r.nbytes)
gpu_program.rgb2gray(queue, r.shape, None, r_buf, g_buf, b_buf, gray_buf)
pyopencl.enqueue_copy(queue, gray, gray_buf)
plt.imshow(gray)
plt.show()
gray = np.uint8(gray)
imread.imsave('gigapixel_gray.jpg', gray)
If you need to keep the array[x,y] notation, then try Numba instead of PyOpenCL. Numba converts Python function's bytecode into OpenCL kernels.
PyOpenCL is only a wrapper over OpenCL API so it compiles the given kernel code for C or C++ languages directly. So you need to index like this:
int i = get_global_id(0);
int j = get_global_id(1);
gray[i][j] = (r[i][j] + g[i][j] + b[i][j])/ 3;
If you want to see the errors produced at any stage(kernel compiling, buffer copying, etc), you need to catch exceptions of Python because PyOpenCL binds OpenCL-errors to Python exceptions. You should check them like this:
try:
your_opencl_accelerated_function()
except:
print("Something didn't work")
I am attempting to implement in Python using pyopencl the dot_persist_kernel() shown here, and I've been squashing numerous bugs along the way. But, I've stumbled upon an issue that I can't crack:
self.program = cl.Program(self.ctx, code).build()
# code is a string with the code from the link given
a = cl_array.to_device(self.queue, np.random.rand(2**20).astype(np.float32))
b = cl_array.to_device(self.queue, np.random.rand(2**20).astype(np.float32))
c = 0.
mf = cl.mem_flags
c_buf = cl.Buffer(self.ctx, mf.WRITE_ONLY, 4)
MAX_COMPUTE_UNITS = cl.get_platforms()[0].get_devices()[0].max_compute_units
WORK_GROUPS_PER_CU = MAX_COMPUTE_UNITS * 4
ELEMENTS_PER_GROUP = a.size / WORK_GROUPS_PER_CU
ELEMENTS_PER_WORK_ITEM = ELEMENTS_PER_GROUP / 256
self.program.DotProduct(self.queue, a.shape, a.shape,
a.data, b.data, c_buf,
np.uint32(ELEMENTS_PER_GROUP),
np.uint32(ELEMENTS_PER_WORK_ITEM),
np.uint32(1028 * MAX_COMPUTE_UNITS))
Assuming an array of size 2^26, the constants will have values of:
MAX_COMPUTE_UNITS = 32 // from get_device()[0].max_compute_units
WORK_GROUPS_PER_CU = 128 // MAX_COMPUTE_UNITS * 4
ELEMENTS_PER_GROUP = 524288 // 2^19
ELEMENTS_PER_WORK_ITEM = 2048 // 2^11
The kernel header looks like:
#define LOCAL_GROUP_XDIM 256
// Kernel for part 1 of dot product, version 3.
__kernel __attribute__((reqd_work_group_size(LOCAL_GROUP_XDIM, 1, 1)))
void dot_persist_kernel(
__global const double * x, // input vector
__global const double * y, // input vector
__global double * r, // result vector
uint n_per_group, // elements processed per group
uint n_per_work_item, // elements processed per work item
uint n // input vector size
)
The error that it is giving is:
Traceback (most recent call last):
File "GPUCompute.py", line 102, in <module>
gpu = GPUCompute()
File "GPUCompute.py", line 87, in __init__
np.uint32(1028 * MAX_COMPUTE_UNITS))
File "C:\Miniconda2\lib\site-packages\pyopencl\__init__.py", line 512, in kernel_call
global_offset, wait_for, g_times_l=g_times_l)
pyopencl.LogicError: clEnqueueNDRangeKernel failed: invalid work item size
I've tried shifting the numbers around a lot, to no avail. Ideas?
There were a few issues going on with the previous implementation, but this one is working:
WORK_GROUPS = cl.get_platforms()[0].get_devices()[0].max_compute_units * 4
ELEMENTS_PER_GROUP = np_a.size / WORK_GROUPS
LOCAL_GROUP_XDIM = 256
ELEMENTS_PER_WORK_ITEM = ELEMENTS_PER_GROUP / LOCAL_GROUP_XDIM
self.program = cl.Program(self.ctx, kernel).build()
self.program.DotProduct(
self.queue, np_a.shape, (LOCAL_GROUP_XDIM,), # kernel information
cl_a, cl_b, cl_c, # data
np.uint32(ELEMENTS_PER_GROUP), # elements processed per group
np.uint32(ELEMENTS_PER_WORK_ITEM), # elements processed per work item
np.uint32(np_a.size) # input vector size
)
It was the culmination of a few things, but the biggest factor was that the second and third arguments passed to DotProduct() are supposed to be tuples--not ints, like I thought. :)
I have a problem with a 4-point stencil OpenCL code. The code runs fine but I don't get symetrics final 2D values which are expected.
I suspect it is a problem of updates values in the kernel code. Here's the kernel code :
// kernel code
const char *source ="__kernel void line_compute(const double diagx, const double diagy,\
const double weightx, const double weighty, const int size_x,\
__global double* tab_new, __global double* r)\
{ int iy = get_global_id(0)+1;\
int ix = get_global_id(1)+1;\
double new_value, cell, cell_n, cell_s, cell_w, cell_e;\
double rk;\
cell_s = tab_new[(iy+1)*(size_x+2)+ix];\
cell_n = tab_new[(iy-1)*(size_x+2)+ix];\
cell_e = tab_new[iy*(size_x+2)+(ix+1)];\
cell_w = tab_new[iy*(size_x+2)+(ix-1)];\
cell = tab_new[iy*(size_x+2)+ix];\
new_value = weighty *( cell_n + cell_s + cell*diagy)+\
weightx *( cell_e + cell_w + cell*diagx);\
rk = cell - new_value;\
r[iy*(size_x+2)+ix] = rk *rk;\
barrier(CLK_GLOBAL_MEM_FENCE);\
tab_new[iy*(size_x+2)+ix] = new_value;\
}";
cell_s, cell_n, cell_e, cell_w represents the 4 values for the 2D stencil. I compute the new_value and update it after a "barrier(CLK_GLOBAL_MEM_FENCE)".
However, it seems there are conflicts between differents work-items. How could I fix this ?
The barrier GLOBAL_MEM_FENCE you use will not synchronize all work-items as intended. It does only synchronize access with one single workgroup.
Usually all workgroups won't be executed at the same time, because they are scheduled on only a small number of physical cores, and global synchronization is not possible within a kernel.
The solution is to write the output to a different buffer.
I trying to use OpenCL using HASKELL. I write a simple program converting a working one in C. It appears to work well, but when I assign a memory object to the kernel parameters, it fails with CL_INVALID_MEM_OBJECT error. I don't know who to fix because I use the same calls than in C program and there it works:
The programSource is the OpenCL code
programSource :: String
programSource = "__kernel void duparray(__global float *in, __global float *out ){ int id = get_global_id(0); out[id] = 2*in[id]; }"
And the initializacion works until the call of clSetKernelArg that fails with Just (ErrorCode (-38))
-- test openCL
input <- newArray ([0,1,2,3,4] :: [CFloat])
Right mem_in <- clCreateBuffer myContext (memFlagsJoin [clMemReadOnly,clMemCopyHostPtr]) (4*5) (castPtr input)
Right mem_out <- clCreateBuffer myContext clMemWriteOnly (4*5) nullPtr
print (mem_in,mem_out)
Right program <- clCreateProgramWithSource myContext programSource
print program
err <- clBuildProgram program [myDeviceId] "" buildProgramCallback nullPtr
print err
Right kernel <- clCreateKernel program "duparray"
print kernel
kaErr0 <- clSetKernelArg kernel 0 (fromIntegral.sizeOf $ mem_in) (castPtr mem_in)
kaErr1 <- clSetKernelArg kernel 1 (fromIntegral.sizeOf $ mem_out) (castPtr mem_out)
print (kaErr0,kaErr1)
I'm using OpenCLRaw, with several modifications that i put on https://github.com/zhensydow/OpenCLRaw
I found that I need to pass the direction of the mem buffer pointer, no the pointer itself. This is the rigth way to call clSetKernelArg:
dir_mem_in <- (malloc :: IO (Ptr Mem))
poke dir_mem_in mem_in
kaErr0 <- clSetKernelArg kernel 0 (fromIntegral.sizeOf $ mem_in) (castPtr dir_mem_in)
dir_mem_out <- (malloc :: IO (Ptr Mem))
poke dir_mem_out mem_out
kaErr1 <- clSetKernelArg kernel 1 (fromIntegral.sizeOf $ mem_out) (castPtr dir_mem_out)
print (kaErr0, kaErr1)