Workgroup Bound Check not working - opencl

In my OpenCL kernel i'm checking if the global_id is inside the global problem size but it is not working.
If the global problem size is dividable by the workgroupsize everything is fine. If not, the kernel is doing just nothing.
__kernel void move_points(const unsigned int points,
const unsigned int floors,
const unsigned int gridWidth,
const unsigned int gridHeight,
__global const GraphData *graph,
__global const float *pin_x,
__global const float *pin_y,
__global const float *pin_z,
__global float *pout_x,
__global float *pout_y,
__global float *pout_z,
__global clrngMrg31k3pHostStream *streams)
{
int id = get_global_id(0);
if (id < points) {
do kernel things...
}
}
Do somebody know where the problem is?
Thanks a lot. Robin.

If your global size is not divisible by your local size (workgroup size), then the kernel will not run at all.
The enqueueNDRangeKernel() call will return CL_INVALID_WORK_GROUP_SIZE as an error as specified here.
If you really want to follow the CUDA mode, where you may have unused work items. Then put the check in the kernel (as you already have), and use a bigger global size, that is multiple of your local size.

Related

Worse performance when using a bigger multiple work group size of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

I am writing a matrix multiplication program in OpenCL. In order to utilize local memory, I divided the original matrix into block_size * block_size sub-matrix, where block_size is equals to a square root of work group size and will be calculated in one work group.
My GPU is RX580, and CL_KERNEL_WORK_GROUP_SIZE returns 256, so a square root is 16, while CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE returns 64. And after some profiling, I found out that settting the block_size to 8*8 performs 4 times faster than setting it to 16*16, which seems unintuitive to me. Any explanation to this?
EDIT: CodeXL screenshot
8*8 work group
16*16 work group
I noticed something unusual. How can SALUBusy% exceed 100% in the 8*8 work group case?
EDIT2: Add my kernel code, which might be useful
size_t getIndex(int row, int col, int width)
{
return row*width+col;
}
__kernel void matrixMul(__global const float* restrict a,
__global const float* restrict b, //assume tranposed
__local float* restrict a_local,
__local float* restrict b_local,
__global float* restrict result)
{
const int row=get_global_id(0);
const int col=get_global_id(1);
const int limit=get_global_size(0);
const int blockSize=get_local_size(0);
const int gidRow=get_local_id(0);
const int gidCol=get_local_id(1);
float sum=0;
for(int blockIndex=0; blockIndex < (limit/blockSize); ++blockIndex)
{
/*copy block -> local memory*/
a_local[getIndex(gidRow, gidCol, blockSize)]=a[getIndex(row, col, limit)];
b_local[getIndex(gidRow, gidCol, blockSize)]=b[getIndex(row, col, limit)];
barrier(CLK_LOCAL_MEM_FENCE);
/*block multiply*/
__local float* restrict a_local_row=&a_local[getIndex(gidRow, 0, blockSize)];
__local float* restrict b_local_row=&b_local[getIndex(gidCol, 0, blockSize)];
for(int i=0; i<blockSize; ++i)
{
sum+= (*a_local_row) * (*b_local_row);
++a_local_row;
++b_local_row;
}
}
result[getIndex(row, col, limit)]=sum;
}

OpenCL Local Memory Declaration

What is the difference between declaring local memory as follows:
__kernel void mmul(const int Ndim, const int Mdim, const int Pdim,
const __global int* A,
const __global int* B,
__global char* C,
__local int* restrict block_a,
__local int* restrict block_b)
and declaring local memory inside the kernel
#define a_size 1024
#define b_size 1024 * 1024
__kernel void mmul(const int Ndim, const int Mdim, const int Pdim,
const __global int* A,
const __global int* B,
__global char* C) {
__local int block_a[a_size]
__local int block_b[b_size]
...
}
In both cases, all threads will update a single cell in the shared A and B arrays
I understand that it's not possible to have "variable" length arrays in the kernel (hence the #define at the top of the second kernel), but is there any other difference? Is there any difference with regards to when the memory is freed?
In both cases, local memory exists for the lifetime of the work-group. The only difference, as you have noted, is that passing the local memory pointer as an argument allows the size of the buffer to be specified dynamically, rather than being a compile-time constant. Different work-groups will always use different local memory allocations.
The second method is better if you want to port code to CUDA, because the __shared__ memory in CUDA (equivalent to __local in OpenCL) does not support to be declared like the first case.

Can you pass const unsigned int4* to a kernel?

I have instructions to use:
__kernel void myKernel(__global const unsigned int4* data
But I get CL_INVALID_PROGRAM_EXECUTABLE whenever I try to build it. However, both of these build without error:
__kernel void myKernel(__global const int4* data
__kernel void myKernel(__global const unsigned int* data
"unsigned int" is a valid type, but "unsigned int4" is not. I think what you're looking for is "uint4". See section 6.1.2 of the specification ("Built-in Vector Data Types").

OpenCL structure declarations in different memory spaces

In OpenCL what will be the consequences and differences between the following struct declarations. And if they are illegal, why?
struct gr_array
{
int ndims;
__global m_integer* dim_size;
__global m_real* data;
};
typedef struct gr_array g_real_array;
struct lr_array
{
int ndims;
__local m_integer* dim_size;
__local m_real* data;
};
typedef struct lr_array l_real_array;
__ kernel temp(...){
__local g_real_array A;
g_real_array B;
__local l_real_array C;
l_real_array D;
}
My question is where will the structures be allocated (and the members)? who can access them? And is this a good practice or not?
EDIT
how about this
struct r_array
{
__local int ndims;
};
typedef struct r_array real_array;
__ kernel temp(...){
__local real_array A;
real_array B;
}
if a work-item modifies ndims in struct B, is the change visible to other work-items in the work-group?
I've rewritten your code as valid CL, or at least CL that will compile. Here:
typedef struct gr_array {
int ndims;
global int* dim_size;
global float* data;
} g_float_array;
typedef struct lr_array {
int ndims;
local int* dim_size;
local float* data;
} l_float_array;
kernel void temp() {
local g_float_array A;
g_float_array B;
local l_float_array C;
l_float_array D;
}
One by one, here's how this breaks down:
A is in local space. It's a struct that is composed of one int and two pointers. These pointers point to data in global space, but are themselves allocated in local space.
B is in private space; it's an automatic variable. It is composed of an int and two pointers that point to stuff in global memory.
C is in local space. It contains an int and two pointers to stuff in local space.
D, you can probably guess at this point. It's in private space, and contains an int and two pointers that point to stuff in local space.
I cannot say if either is preferable for your problem, since you haven't described what your are trying to accomplish.
EDIT: I realized I didn't address the second part of your question -- who can access the structure fields.
Well, you can access the fields anywhere the variable is in scope. I'm guessing that you were thinking that the fields you had marked as global in g_float_array were in global space (an local space for l_float_array). But they're just pointing to stuff in global (or local) space.
So, you'd use them like this:
kernel void temp(
global float* data, global int* global_size,
local float* data_local, local int* local_size,
int num)
{
local g_float_array A;
g_float_array B;
local l_float_array C;
l_float_array D;
A.ndims = B.ndims = C.ndims = D.ndims = num;
A.data = B.data = data;
A.dim_size = B.dim_size = global_size;
C.data = D.data = data_local;
C.dim_size = D.dim_size = local_size;
}
By the way -- if you're hacking CL on a Mac running Lion, you can compile .cl files using the "offline" CL compiler, which makes experimenting with this kind of stuff a bit easier. It's located here:
/System/Library/Frameworks/OpenCL.framework/Libraries/openclc
There is some sample code here.
It probably won't work, because the current GPU-s have different memory spaces for OpenCL kernels and for the ordinary program. You have to make explicit calls to transmit data between both spaces, and it is often the bottleneck of the program (because the bandwidth of PCI-X graphics card is quite low).

Problem reinterpreting parameters in OpenCL 1.0

Is it possible to reinterpret parameters that have been passed into an OpenCL Kernel. For example, if I have an array of integers being passes in, but I want to interpret the integer at index 16 as a float (don't ask why!) then I would have thought this would work.
__kernel void Test(__global float* im, __constant int* constArray)
{
float x = *( (__constant float*) &constArray[16] );
im[0] = x;
}
However, I get a CL_INVALID_COMMAND_QUEUE error when I next try to use the command queue, implying that the above code has performed an illegal operation.
Any suggests what is wrong with the above, and/or how to achieve the reinterpretation?
I have now tried:
__kernel void Test(__global float* im, __constant int* constArray)
{
float x = as_float(0x3f800000);
im[0] = x;
}
and this does indeed give a 1.0f in im[0]. However,
__kernel void Test(__global float* im, __constant int* constArray)
{
float x = as_float(constArray[16]);
im[0] = x;
}
always results in zero in im[0] regardless of what is in constArray[16].
Regards,
Mark.
OpenCL includes the as_typen family of operators for reinterpret casting of values from one type to another. If I am understanding the question, you should be able to do something like
__kernel void Test(__global float* im, __constant int* constArray)
{
float x = as_float(constArray[16]);
im[0] = x;
}

Resources