Passing a kernel global buffer to a local function in OpenCL - opencl

Say we have a kernel function:
kernel void function(global const float* a, global const float* b, global float* c, int nElements)
{
...
c[gid] = a[gid] * b[gid];
}
but want to break up a large complex kernel into several smaller functions. How do I pass the global buffers to these smaller functions?
If I do the following I get an error of the form "implicit declaration of function 'cl_axpbyr' is invalid in OpenCL":
kernel void function(global const float* a, global const float* b, global float* c, int nElements)
{
...
cl_axpbyr(1.0f, a, c, nElements);
}
inline void cl_axpy(float alpha, global const float* x, global float* y, int nElements)
{
int gid = get_global_id(0);
if (gid >= nElements)
{
return;
}
y[gid] = alpha*x[gid] + y[gid];
}

First of all you call this:
cl_axpbyr(1.0f, a, c, nElements);
While your function is:
inline void cl_axpy
You should call cl_axpy instead of cl_axpbyr
Second of all OpenCL kernel language is just C. So you need to predeclare your functions if you are going to define them after the place you are going to call them. The following code compiles cleanly:
// This is the normal C style function declaration which must exist
inline void cl_axpy(float alpha, global const float* x, global float* y, int nElements);
kernel void function(global const float* a, global const float* b, global float* c, int nElements)
{
cl_axpy(1.0f, a, c, nElements);
}
inline void cl_axpy(float alpha, global const float* x, global float* y, int nElements)
{
int gid = get_global_id(0);
if (gid >= nElements)
{
return;
}
y[gid] = alpha*x[gid] + y[gid];
}
You could also just place the whole cl_axpy before your kernel definition. Both ways work fine.

Related

pyopencl - how to use generic types?

I work Interchangeably with 32 bit floats and 32 bit integers. I want two kernels that do exactly the same thing, but one is for integers and one is for floats. At first I thought I could use templates or something, but it does not seem possible to specify two kernels with the same name but different argument types?
import pyopencl as cl
import numpy as np
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
prg = cl.Program(ctx, """
__kernel void arange(__global int *res_g)
{
int gid = get_global_id(0);
res_g[gid] = gid;
}
__kernel void arange(__global float *res_g)
{
int gid = get_global_id(0);
res_g[gid] = gid;
}
""").build()
Error:
<kernel>:8:15: error: conflicting types for 'arange'
__kernel void arange(__global float *res_g)
^
<kernel>:2:15: note: previous definition is here
__kernel void arange(__global int *res_g)
What is the most convenient way of doing this?
#define directive can be used for that:
code = """
__kernel void arange(__global TYPE *res_g)
{
int gid = get_global_id(0);
res_g[gid] = gid;
}
"""
prg_int = cl.Program(ctx, code).build("-DTYPE=int")
prg_float = cl.Program(ctx, code).build("-DTYPE=float")

Worse performance when using a bigger multiple work group size of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

I am writing a matrix multiplication program in OpenCL. In order to utilize local memory, I divided the original matrix into block_size * block_size sub-matrix, where block_size is equals to a square root of work group size and will be calculated in one work group.
My GPU is RX580, and CL_KERNEL_WORK_GROUP_SIZE returns 256, so a square root is 16, while CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE returns 64. And after some profiling, I found out that settting the block_size to 8*8 performs 4 times faster than setting it to 16*16, which seems unintuitive to me. Any explanation to this?
EDIT: CodeXL screenshot
8*8 work group
16*16 work group
I noticed something unusual. How can SALUBusy% exceed 100% in the 8*8 work group case?
EDIT2: Add my kernel code, which might be useful
size_t getIndex(int row, int col, int width)
{
return row*width+col;
}
__kernel void matrixMul(__global const float* restrict a,
__global const float* restrict b, //assume tranposed
__local float* restrict a_local,
__local float* restrict b_local,
__global float* restrict result)
{
const int row=get_global_id(0);
const int col=get_global_id(1);
const int limit=get_global_size(0);
const int blockSize=get_local_size(0);
const int gidRow=get_local_id(0);
const int gidCol=get_local_id(1);
float sum=0;
for(int blockIndex=0; blockIndex < (limit/blockSize); ++blockIndex)
{
/*copy block -> local memory*/
a_local[getIndex(gidRow, gidCol, blockSize)]=a[getIndex(row, col, limit)];
b_local[getIndex(gidRow, gidCol, blockSize)]=b[getIndex(row, col, limit)];
barrier(CLK_LOCAL_MEM_FENCE);
/*block multiply*/
__local float* restrict a_local_row=&a_local[getIndex(gidRow, 0, blockSize)];
__local float* restrict b_local_row=&b_local[getIndex(gidCol, 0, blockSize)];
for(int i=0; i<blockSize; ++i)
{
sum+= (*a_local_row) * (*b_local_row);
++a_local_row;
++b_local_row;
}
}
result[getIndex(row, col, limit)]=sum;
}

Where should I define a C function that will be called in C kernel code when using PYOPENCL

Since Kernel Code in PyOpenCl needs to be written only in C, I have written few functions that need to be called inside the Kernel code in PyOpenCL.Where should I store these functions? how to pass a global variable to that function.
In PyOpenCl my kernel code looks like this:
program = cl.Program(context, """
__kernel void Kernel_OVERLAP_BETWEEN_N_IP_GPU(__constant int *FBNs_array,__local int *Binary_IP, __local int *cc,__global const int *olp)
{
function1(int *x, int *y,__global const int *olp);
}
""").build()
Where should I write and store the function1 function. should I define it in kernel itself, or in some other file and provide a path. If i need to define it at some other place and provide a path, please provide me some details , I am completely new to C.
Thanks
Like in C, before the kernel.
program = cl.Program(context, """
void function1(int *x, int *y)
{
//function1 code
}
__kernel void kernel_name()
{
function1(int *x, int *y);
}""").build()
program = cl.Program(context, """
void function1(int x, int *y,__global const int *cc)
{
x=10;
}
__kernel void kernel_name(__global const int *cc)
{
int x=1;
int y[1]={10};
function1(x,y,cc); //now x=10
}""").build()

OpenCL Local Memory Declaration

What is the difference between declaring local memory as follows:
__kernel void mmul(const int Ndim, const int Mdim, const int Pdim,
const __global int* A,
const __global int* B,
__global char* C,
__local int* restrict block_a,
__local int* restrict block_b)
and declaring local memory inside the kernel
#define a_size 1024
#define b_size 1024 * 1024
__kernel void mmul(const int Ndim, const int Mdim, const int Pdim,
const __global int* A,
const __global int* B,
__global char* C) {
__local int block_a[a_size]
__local int block_b[b_size]
...
}
In both cases, all threads will update a single cell in the shared A and B arrays
I understand that it's not possible to have "variable" length arrays in the kernel (hence the #define at the top of the second kernel), but is there any other difference? Is there any difference with regards to when the memory is freed?
In both cases, local memory exists for the lifetime of the work-group. The only difference, as you have noted, is that passing the local memory pointer as an argument allows the size of the buffer to be specified dynamically, rather than being a compile-time constant. Different work-groups will always use different local memory allocations.
The second method is better if you want to port code to CUDA, because the __shared__ memory in CUDA (equivalent to __local in OpenCL) does not support to be declared like the first case.

Problem reinterpreting parameters in OpenCL 1.0

Is it possible to reinterpret parameters that have been passed into an OpenCL Kernel. For example, if I have an array of integers being passes in, but I want to interpret the integer at index 16 as a float (don't ask why!) then I would have thought this would work.
__kernel void Test(__global float* im, __constant int* constArray)
{
float x = *( (__constant float*) &constArray[16] );
im[0] = x;
}
However, I get a CL_INVALID_COMMAND_QUEUE error when I next try to use the command queue, implying that the above code has performed an illegal operation.
Any suggests what is wrong with the above, and/or how to achieve the reinterpretation?
I have now tried:
__kernel void Test(__global float* im, __constant int* constArray)
{
float x = as_float(0x3f800000);
im[0] = x;
}
and this does indeed give a 1.0f in im[0]. However,
__kernel void Test(__global float* im, __constant int* constArray)
{
float x = as_float(constArray[16]);
im[0] = x;
}
always results in zero in im[0] regardless of what is in constArray[16].
Regards,
Mark.
OpenCL includes the as_typen family of operators for reinterpret casting of values from one type to another. If I am understanding the question, you should be able to do something like
__kernel void Test(__global float* im, __constant int* constArray)
{
float x = as_float(constArray[16]);
im[0] = x;
}

Resources