Is it possible to run a 4 dimensional work item in pyopencl? - opencl

I have a pyopencl based code which runs perfectly fine for 3-dimensional work groups, but when moving to 4-dimensional work groups, it breaks down with the error:
pyopencl._cl.LogicError: clEnqueueNDRangeKernel failed: INVALID_WORK_DIMENSION
Digging around, I found this answer to another question, which implies that OpenCl in fact allows higher dimensional work groups.
So my question is if it is possible to change this setting in pyopencl. From this other answer elsewhere, I understand that pyopencl immediately inputs the dimensions, but given the error I have, I think there must be some issue.
This is a minimal sample code to replicate this error.
The code works well for the first kernel function, it breaks down on the second one.
import pyopencl as cl
import numpy as np
context = cl.create_some_context()
queue = cl.CommandQueue(context)
kernel_code = """
__kernel void fun3d( __global double *output)
{
size_t i = get_global_id(0);
size_t j = get_global_id(1);
size_t k = get_global_id(2);
size_t I = get_global_size(0);
size_t J = get_global_size(1);
#
size_t idx = k*J*I + j*I + i;
#
output[idx] = idx;
}
__kernel void fun4d( __global double *output)
{
size_t i = get_global_id(0);
size_t j = get_global_id(1);
size_t k = get_global_id(2);
size_t l = get_global_id(3);
size_t I = get_global_size(0);
size_t J = get_global_size(1);
size_t K = get_global_size(2);
#
size_t idx = l*K*J*I + k*J*I + j*I + i;
#
output[idx] = idx;
}
"""
program = cl.Program(context, kernel_code).build()
I = 2
J = 3
K = 4
L = 5
output3d = np.zeros((I*J*K)).astype(np.float64)
cl_output3d = cl.Buffer(context, cl.mem_flags.WRITE_ONLY, output3d.nbytes)
program.fun3d(queue, (I,J,K), None, cl_output3d)
cl.enqueue_copy(queue, output3d, cl_output3d)
queue.finish()
import code; code.interact(local=dict(globals(), **locals()))
# 4d attempt
output4d = np.zeros((I*J*K*L)).astype(np.float64)
cl_output4d = cl.Buffer(context, cl.mem_flags.WRITE_ONLY, output4d.nbytes)
program.fun4d(queue, (I,J,K,L), None, cl_output4d)
cl.enqueue_copy(queue, output4d, cl_output4d)
queue.finish()

Trying to specify more dimensions than supported by implementation is not going to work.
The maximum number of supported dimensions can be queried via CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS or in terminal, for example:
$ clinfo | grep dim
Max work item dimensions 3

Related

OpenCL how to change the memory address of cl_mem?

I want to do a sub-matrix multiplication. Say I have a function:
void MatMul(cl_mem A, cl_mem B, cl_mem C, int M, int K, int N)
where A is M*K, B is K*N, C is M*N, and A, B, C are all row major 1 dimension array passed by host memory float *h_A, *h_B, *hC with the following function:
void ocl_push_array(cl_mem d_x, float *h_x, int n){
size_t data_size = sizeof(float)*n;
err = clEnqueueWriteBuffer(queue, d_x, CL_TRUE, 0, data_size, h_x, 0, NULL, NULL);
}
I want to ask:
if I want to do sub-matrix multiplication, say slicing A by row:
// cl_mem A, B, C;
for(int x=0; x<M; x+=16)
{
cl_mem A_sub = (cl_mem)((float *)A+x*K);
cl_mem C_sub = (cl_mem)((float *)C+x*N);
if((M-x+1)>=16)
MatMul(A_sub, B, C_sub, 16, K, N);
else
MatMul(A_sub, B, C_sub, M-x+1, K, N);
}
Is it the right code to do this operation? I got a run time error says: "CL_INVALID_MEM_OBJECT" (-38) when it assigns arguments to OpenCL kernel (clSetKernelArg).
The reason I want to do this operation is that I found the matrix multiplication got wrong answers when my input matrix A and B become big.
My OpenCL kernel is:
#define BLOCK_SIZE 16
#define AS(i, j) As[j + i * BLOCK_SIZE]
#define BS(i, j) Bs[j + i * BLOCK_SIZE]
__kernel void
matrixMul(__global float* A, __global float* B, __global float* C,
__local float* As, __local float* Bs, int uiWA, int uiWB)
{
int bx = get_group_id(0);
int by = get_group_id(1);
int tx = get_local_id(0);
int ty = get_local_id(1);
int aBegin = uiWA * BLOCK_SIZE * by;
int aEnd = aBegin + uiWA - 1;
int aStep = BLOCK_SIZE;
int bBegin = BLOCK_SIZE * bx;
int bStep = BLOCK_SIZE * uiWB;
float Csub = 0.0f;
for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {
AS(ty, tx) = A[a + uiWA * ty + tx];
BS(ty, tx) = B[b + uiWB * ty + tx];
barrier(CLK_LOCAL_MEM_FENCE);
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += AS(ty, k) * BS(k, tx);
barrier(CLK_LOCAL_MEM_FENCE);
}
C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = Csub;
}
and the size is:
#define BLOCK_SIZE 16
size_t localWorkSize[] = {BLOCK_SIZE, BLOCK_SIZE};
size_t globalWorkSize[] = {shrRoundUp(BLOCK_SIZE, N), shrRoundUp(BLOCK_SIZE, M)};
size_t shrRoundUp(int group_size, int global_size)
{
int r = global_size % group_size;
if(r == 0)
{
return global_size;
} else
{
return global_size + group_size - r;
}
}
the code is adopted from Nvidia OpenCL matrix multiplication sample. My GPU is: Intel(R) HD Graphics 4600.
Thanks!
I don't think you can do this:
cl_mem A_sub = (cl_mem)((float *)A+x*K);
Because cl_mem is an object in OpenCL, which is actually a complex data structure instead of just a data pointer. It maintains information such as data pointer to the actually memory, reference to the object, memory properties and so on. Different run-times may even have different implementations of cl_mem object. That's why you got the CL_INVALID_MEM_OBJECT error message.
What you can do to get wanted data for the sub matrix is one of the followings:
define two new cl_mem objects, and use a separate kernel to do
the copy work.
use clEnqueueCopyBuffer function to copy the data at the host
code domain.
use CL_MEM_ALLOC_HOST_PTR to memory buffer, and then use
clEnqueueMapBuffer map the GPU memory to host memory pointer, and
then modify the memory content by using the mapped host memory
pointer, when you finish, unmap the pointer to return the GPU memory
to the device memory domain.

OpenCL: Inserting local atomic_inc to reduction kernel

I am trying to include a local atomic similar to that described by DarkZeros here within a working reduction kernel. The kernel finds a largest value within a set of points; the aim of the local atomic is to allow me to filter selected point_ids into an output array without any gaps.
At present when I use the local atomic to increment the addition to a local array the kernel runs but produces a wrong overall highest point. If the atomic line is commented out then a correct result returns.
What is going on here and how do I fix it?
Simplified kernel code:
__kernel void reduce(__global const float4* dataSet, __global const int* input, const unsigned int items, //points and index
__global int* output, __local float4* shared, const unsigned int n, //finding highest
__global int* filtered, __global const float2* tri_input, const unsigned int pass, //finding filtered
__global int* global_count //global count
){
//set everything up
const unsigned int group_id = get_global_id(0) / get_local_size(0);
const unsigned int local_id = get_local_id(0);
const unsigned int group_size = items;
const unsigned int group_stride = 2 * group_size;
const int local_stride = group_stride * group_size;
__local float4 *zeroIt = &shared[local_id];
zeroIt->x = 0; zeroIt->y = 0; zeroIt->z = 0; zeroIt->w = 0;
volatile __local int local_count_set_1;
volatile __local int global_val_set_1;
volatile __local int filter_local[64];
if(local_id==0){
local_count_set_1 = 0;
global_val_set_1 = -1;
}
barrier(CLK_LOCAL_MEM_FENCE);
int i = group_id * group_stride + local_id;
while (i < n){
//load up a pair of points using the index to locate them within a massive dataSet
int ia = input[i];
float4 a = dataSet[ia-1];
int ib = input[i + group_size];
float4 b = dataSet[ib-1];
//on the first pass kernel increment a local count
if(pass == 0){
filter_local[atomic_inc(&local_count_set_1)] = 1; //including this line causes an erroneous highest point result
//filter_local[local_id] = 1; //but including this line does not
//atomic_inc(&local_count_set_1); //and neither does this one
}
//find the highest of the pair
float4 result;
if(a.z>b.z) result = a;
else result = b;
//load up the previous highest result locally
float4 s = shared[local_id];
//if the previous highest beat this, stick, else twist
if(s.z>result.z){ result = s; }
shared[local_id] = result;
i += local_stride;
}
barrier(CLK_LOCAL_MEM_FENCE);
if (group_size >= 512){
if (local_id < 256) {
__local float4 *a = &shared[local_id];
__local float4 *b = &shared[local_id+256];
if(b->z>a->z){ shared[local_id] = shared[local_id+256]; }
}}
//repeat barrier ops in increments down to group_size>=2 - this filters the highest result in shared
//finally, return the filtered highest result of shared to the global level
barrier(CLK_LOCAL_MEM_FENCE);
if(local_id == 0){
__local float4 *v = &shared[0];
int send = v->w ;
output[group_id] = send+1;
}}
[UPDATE]: When the atomic_inc line is included the 'wrong' highest point result is always a point near the end of the test dataset. I'm guessing that this means that the atomic_inc is affecting a latter comparison, but I'm not sure exactly what or where yet.
[UPDATE]: Edited code to simplify/clarify/update with debugging tweaks. Still not working and it is driving me loopy.
Total face-palm moment. In the setup phase of the kernel there are the lines:
if(local_id==0){
local_count_set_1 = 0;
global_val_set_1 = -1;
}
barrier(CLK_LOCAL_MEM_FENCE);
When these are split and the local_count_set_1 is included within the while loop, the error does not occur. i.e:
if(local_id==0) global_val_set_1 = -1;
barrier(CLK_LOCAL_MEM_FENCE);
while (i < n){
if(local_id==0) local_count_set_1 = 0;
barrier(CLK_LOCAL_MEM_FENCE);
....
if(pass = 0){
filter_local[atomic_inc(&local_count_set_1)] = 1;
}
....
I'm hoping this fixes the issue // will update if not.
Aaaand that's a weekend I'll never get back.

substitutions for cl_khr_int64_base_atomics

I have an ATI Firepro V4800 graphics card which does not support cl_khr_int64_base_atomics. I am trying to adapt the RadixSort algo for long integers. The algo uses atomic_inc, the 64-bit of which is atom_inc, which I cannot use in the kernel. So, my question is, is there a piece of code which performs the same function as atomic_inc which can be used? The piece of kernel code is given below:
__kernel void histogram(__global uint* unsortedData,
__global uint* buckets,
uint shiftCount,
__local uint* sharedArray)
{
size_t localId = get_local_id(0);
size_t globalId = get_global_id(0);
size_t groupId = get_group_id(0);
size_t groupSize = get_local_size(0);
uint numGroups = get_global_size(0) / get_local_size(0);
// Initialize shared array to zero //
sharedArray[localId] = 0;
barrier(CLK_LOCAL_MEM_FENCE);
// Calculate thread-histograms //
uint value = unsortedData[globalId];
value = value >> shiftCount & 0xFFU;
atomic_inc(sharedArray+value);
barrier(CLK_LOCAL_MEM_FENCE);
// Copy calculated histogram bin to global memory //
uint bucketPos = groupId * groupSize + localId ;
//uint bucketPos = localId * numGroups + groupId ;
buckets[bucketPos] = sharedArray[localId];
}
Any suggestions? Thank you.
Edit:
Another way for the same is given in this blogsite: http://suhorukov.blogspot.in/2011/12/opencl-11-atomic-operations-on-floating.html. This gives a very generic implementation of the Atomic Inc.
You could try something like this:
void atomInc64 (__local uint *counter)
{
uint old, carry;
old = atomic_inc (&counter [0]);
carry = old == 0xFFFFFFFF;
atomic_add (&counter [1], carry);
}
Where counter is an array of two 32-bit integers. While the two halves don't increment at exactly the same time, the total should be correct when the program completes.

How to pass an array of vectors in pyOpenCL

I'm moving a simulation into pyOpenCL and can't get my data access to work. I'm trying to supply a 1D array of vectors (well, actually several, but the example I've included just used one).
Currently, several vectors are copied over just fine, but then the data is simply not what I supplied.
I don't think I've posted here before, so apologies if any of the formatting/presentation is wrong. Also, I've just stripped out all the simulation code, so I realise this code is currently not actually doing anything, I just want to get the buffer passing correct.
Thanks in advance.
The kernel (kertest.py):
step1 = """
#pragma OPENCL EXTENSION cl_amd_printf: enable
#define X xdim
#define Y ydim
__kernel void k1(__global float3 *spins,
__local float3 *tile)
{
ushort lid = 2 * get_local_id(0);
ushort group = 2 * get_group_id(0);
ushort num = get_num_groups(0);
int lim = X*Y*3;
for (ushort i = 0; i < lim; i++)
{
if (lid == 0 && group == 0)
{
printf("%f :: %d\\n", spins[i].x, i);
}
}
}"""
The code itself (gputest.py):
import kertest as k2D
import numpy as np
import pyopencl as cl
class GPU_MC2DSim():
def __init__(self, x, y):
self.x = x
self.y = y
if x >= y:
self.xdim = int(self.x)
self.ydim = int(self.y)
else:
self.xdim = int(self.y)
self.ydim = int(self.x)
if self.xdim % 2 != 0: self.xdim += 1
if self.ydim % 2 != 0: self.ydim += 1
self.M = np.ones((self.xdim*self.ydim, 3)).astype(np.float32)
self.M[:, 1] += 1.0
self.M[:, 2] += 2.0
print self.M
def simulate(self):
ctx = cl.create_some_context()
q = cl.CommandQueue(ctx)
mf = cl.mem_flags
#Pass buffer:
M_buf = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf = self.M)
#Insert kernel parameters:
params = {"xdim" : "%d" % (self.xdim),
"ydim" : "%d" % (self.ydim),
}
for name in params:
k2D.step1 = k2D.step1.replace(name, params[name])
#Compile kernel:
step1 = cl.Program(ctx, k2D.step1).build()
locmem = cl.LocalMemory(self.xdim*4*4)
step1.k1(q, ((self.xdim*self.ydim)/4,), (self.xdim/2,), M_buf, locmem).wait()
return None
xdim = 4
ydim = 4
sim = GPU_MC2DSim(xdim, ydim)
sim.simulate()
Your code for copying the data to the device is just fine. However, your kernel has at least two problems:
float3 values are expected to be 16-byte aligned, as per OpenCL 1.2 Spec, 6.1.5:
For 3-component vector data types, the size of the data type is 4 * sizeof(component). This means that a 3-component vector data type will be aligned to a 4 * sizeof(component) boundary. The vload3 and vstore3 built-in functions can be used to read and write, respectively, 3-component vector data types from an array of packed scalar data type.
The values you upload to the devices are not properly aligned for the kernel to read float3 values directly.
Your limit calculation int lim = X*Y*3; is slightly off. You are already trying to read from an array of float3, so the *3 is superfluous.
The solution to both problems is simple: as stated in the spec, you should use vload3 to load from an array of floats:
#pragma OPENCL EXTENSION cl_amd_printf: enable
#define X xdim
#define Y ydim
__kernel void k1(__global float *spins,
__local float3 *tile)
{
ushort lid = 2 * get_local_id(0);
ushort group = 2 * get_group_id(0);
ushort num = get_num_groups(0);
int lim = X*Y;
for (ushort i = 0; i < lim; i++)
{
if (lid == 0 && group == 0)
{
float3 vec = vload3(i, spins);
printf("(%f, %f, %f) :: %d\\n", vec.x, vec.y, vec.z, i);
}
}
}

Can this OpenCL code be optimized?

I am working on a piece of OpencL code for a specialized matrix function: for a Dx1 vector v, two DxD matrices A and B and a constant c, return 1xD vector r where r[i] = c * sum_over_j (v[j] * A[i][j] * B[i][j])
Below is what I have so far, but it runs freakishly slow. A version without summing that returns a DxD matrix is about ten times faster. It's called from PyOpenCL if that makes any difference.
Is anything done wrong? Could it be optimized?
#define D 1000
...
__kernel void element_mult(
__global float *result,
__global const float *vector,
__global const float *matrix,
__global const float *matrix2,
const float factor)
{
int y = get_global_id(1);
float sum = 0;
for(int k = 0; k < D; k++)
{
sum += vector[k] * matrix[(y*D) + k]
* matrix2[(y*D) + k ];
}
result[y] = sum * factor;
}
Cheers!
Optimization #1: make vector __local.
My first pass at this got a decent improvement in performance. I noticed that each vector[k] is read a total of D times, so I copied it to a __local. This is only possible because D is small enough to allow this. The kernel as you have it above suffers from a terrible ALU:fetch ratio of 0.08 on both the 5870 and the 6970 gpus. Even the slower gpus are still waiting on the memory access.
#define D 1000
__kernel void element_mult(
__global float *result,
__global const float *vector,
__global const float *matrix,
__global const float *matrix2,
const float factor)
{
int y = get_global_id(0);
float sum = 0;
__local float vectCopy[D];
int ls = get_local_size(0);
int lid = get_local_id(0);
for(int i=0;i<D;i+=ls){
vectCopy[i+lid] = vector[i+lid];
}
mem_fence(CLK_LOCAL_MEM_FENCE);
for(int k = 0; k < D; k++)
{
sum += vectCopy[k] * matrix[(y*D) + k] * matrix2[(y*D) + k ];
}
result[y] = sum * factor;
}
With this change, APP profiler is showing a new ALU:fetch ratio of 0.20 for the 5870 and 6970 gpus. Average times changed from 1513-->1034, and 1261-->861 on the same cards. The low end gpus are now bound by ALU instead of fetch. (greater than 4:1 ratio)
Opimization #2: calculate each result[y] using an entire work group.
You would have to do this id D were much larger (100k+). The idea is to get the best memory access pattern by using the work group to compute a single element of the result at a time. I defined ls (local size) to be 64 here, because it works on my hardware, as well as most vendors'. The workgroup size you use from the host-side will have to be 64 unless you change that definition. It needs to be defined to create the sum[ls] storage as __local, and I don't like passing variable sized __local vars into my kernels.
results: 5870 ALU:fetch=0.59:1, avg=708. 6970 ALU:fetch=0.72, avg=590. According to APP profiler, this is about twice as fast as your original listing.
#define D 1000
#define ls 64
__kernel void element_mult(
__global float *result,
__global const float *vector,
__global const float *matrix,
__global const float *matrix2,
const float factor)
{
__local float vectCopy[D];
int lid = get_local_id(0);
for(int i=0;i<D;i+=ls){
vectCopy[i+lid] = vector[i+lid];
}
mem_fence(CLK_LOCAL_MEM_FENCE);
int ng = get_num_groups(0);
int gid = get_group_id(0);
int y, k;
__local float sum[ls];
for(y = gid; y < D; y+=ng){
for(k = lid; k < D; k+=ls)
{
sum[lid] += vectCopy[k] * matrix[(y*D) + k] * matrix2[(y*D) + k ];
}
if(lid==0){
result[y] = sum[0];
for(k=1;k<ls;k++){
result[y] += sum[k];
}
result[y] *= factor;
}
mem_fence(CLK_LOCAL_MEM_FENCE);
}
}
EDIT: APP profiler = AMD APP KernelAnalyzer

Resources