Problem reinterpreting parameters in OpenCL 1.0 - opencl

Is it possible to reinterpret parameters that have been passed into an OpenCL Kernel. For example, if I have an array of integers being passes in, but I want to interpret the integer at index 16 as a float (don't ask why!) then I would have thought this would work.
__kernel void Test(__global float* im, __constant int* constArray)
{
float x = *( (__constant float*) &constArray[16] );
im[0] = x;
}
However, I get a CL_INVALID_COMMAND_QUEUE error when I next try to use the command queue, implying that the above code has performed an illegal operation.
Any suggests what is wrong with the above, and/or how to achieve the reinterpretation?
I have now tried:
__kernel void Test(__global float* im, __constant int* constArray)
{
float x = as_float(0x3f800000);
im[0] = x;
}
and this does indeed give a 1.0f in im[0]. However,
__kernel void Test(__global float* im, __constant int* constArray)
{
float x = as_float(constArray[16]);
im[0] = x;
}
always results in zero in im[0] regardless of what is in constArray[16].
Regards,
Mark.

OpenCL includes the as_typen family of operators for reinterpret casting of values from one type to another. If I am understanding the question, you should be able to do something like
__kernel void Test(__global float* im, __constant int* constArray)
{
float x = as_float(constArray[16]);
im[0] = x;
}

Related

OpenCL sum `cl_khr_fp64` double values into a single number

From this question and this question I managed to compile a minimal example of summing a vector into a single double inside OpenCL 1.2.
/* https://suhorukov.blogspot.com/2011/12/opencl-11-atomic-operations-on-floating.html */
inline void AtomicAdd(volatile __global double *source, const double operand) {
union { unsigned int intVal; double floatVal; } prevVal, newVal;
do {
prevVal.floatVal = *source;
newVal.floatVal = prevVal.floatVal + operand;
} while( atomic_cmpxchg((volatile __global unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal );
}
void kernel cost_function(__constant double* inputs, __global double* outputs){
int index = get_global_id(0);
if(0 == error_index){ outputs[0] = 0.0; }
barrier(CLK_GLOBAL_MEM_FENCE);
AtomicAdd(&outputs[0], inputs[index]); /* (1) */
//AtomicAdd(&outputs[0], 5.0); /* (2) */
}
As in fact this solution is incorrect because the result is always 0 when the buffer is accessed. What might the problem with this?
the code at /* (1) */ doesn't work, and neither does the code at /* (2) */, which is only there to test the logic independent of any inputs.
Is barrier(CLK_GLOBAL_MEM_FENCE); used correctly here to reset the output before any calculations are done to it?
According to the specs in OpenCL 1.2 single precision floating point numbers are supported by atomic operations, is this(AtomicAdd) a feasible method of extending the support to double precision numbers or am I missing something?
Of course the device I am testing with supports cl_khr_fp64˙of course.
Your AtomicAdd is incorrect. Namely, the 2 errors are:
In the union, intVal must be a 64-bit integer and not 32-bit integer.
Use the 64-bit atom_cmpxchg function and not the 32-bit atomic_cmpxchg function.
The correct implementation is:
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
inline void AtomicAdd(volatile __global double *source, const double operand) {
union { unsigned ulong u64; double f64; } prevVal, newVal;
do {
prevVal.f64 = *source;
newVal.f64 = prevVal.f64 + operand;
} while(atom_cmpxchg((volatile __global ulong*)source, prevVal.u64, newVal.u64) != prevVal.u64);
}
barrier(CLK_GLOBAL_MEM_FENCE); is used correctly here. Note that a barrier must not be in an if- or else-branch.
UPDATE: According to STREAMHPC, the original implementation you use is not guaranteed to produce correct results. There is an improved implementation:
void __attribute__((always_inline)) atomic_add_f(volatile global float* addr, const float val) {
union {
uint u32;
float f32;
} next, expected, current;
current.f32 = *addr;
do {
next.f32 = (expected.f32=current.f32)+val; // ...*val for atomic_mul_f()
current.u32 = atomic_cmpxchg((volatile global uint*)addr, expected.u32, next.u32);
} while(current.u32!=expected.u32);
}
#ifdef cl_khr_int64_base_atomics
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
void __attribute__((always_inline)) atomic_add_d(volatile global double* addr, const double val) {
union {
ulong u64;
double f64;
} next, expected, current;
current.f64 = *addr;
do {
next.f64 = (expected.f64=current.f64)+val; // ...*val for atomic_mul_d()
current.u64 = atom_cmpxchg((volatile global ulong*)addr, expected.u64, next.u64);
} while(current.u64!=expected.u64);
}
#endif

Worse performance when using a bigger multiple work group size of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

I am writing a matrix multiplication program in OpenCL. In order to utilize local memory, I divided the original matrix into block_size * block_size sub-matrix, where block_size is equals to a square root of work group size and will be calculated in one work group.
My GPU is RX580, and CL_KERNEL_WORK_GROUP_SIZE returns 256, so a square root is 16, while CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE returns 64. And after some profiling, I found out that settting the block_size to 8*8 performs 4 times faster than setting it to 16*16, which seems unintuitive to me. Any explanation to this?
EDIT: CodeXL screenshot
8*8 work group
16*16 work group
I noticed something unusual. How can SALUBusy% exceed 100% in the 8*8 work group case?
EDIT2: Add my kernel code, which might be useful
size_t getIndex(int row, int col, int width)
{
return row*width+col;
}
__kernel void matrixMul(__global const float* restrict a,
__global const float* restrict b, //assume tranposed
__local float* restrict a_local,
__local float* restrict b_local,
__global float* restrict result)
{
const int row=get_global_id(0);
const int col=get_global_id(1);
const int limit=get_global_size(0);
const int blockSize=get_local_size(0);
const int gidRow=get_local_id(0);
const int gidCol=get_local_id(1);
float sum=0;
for(int blockIndex=0; blockIndex < (limit/blockSize); ++blockIndex)
{
/*copy block -> local memory*/
a_local[getIndex(gidRow, gidCol, blockSize)]=a[getIndex(row, col, limit)];
b_local[getIndex(gidRow, gidCol, blockSize)]=b[getIndex(row, col, limit)];
barrier(CLK_LOCAL_MEM_FENCE);
/*block multiply*/
__local float* restrict a_local_row=&a_local[getIndex(gidRow, 0, blockSize)];
__local float* restrict b_local_row=&b_local[getIndex(gidCol, 0, blockSize)];
for(int i=0; i<blockSize; ++i)
{
sum+= (*a_local_row) * (*b_local_row);
++a_local_row;
++b_local_row;
}
}
result[getIndex(row, col, limit)]=sum;
}

Workgroup Bound Check not working

In my OpenCL kernel i'm checking if the global_id is inside the global problem size but it is not working.
If the global problem size is dividable by the workgroupsize everything is fine. If not, the kernel is doing just nothing.
__kernel void move_points(const unsigned int points,
const unsigned int floors,
const unsigned int gridWidth,
const unsigned int gridHeight,
__global const GraphData *graph,
__global const float *pin_x,
__global const float *pin_y,
__global const float *pin_z,
__global float *pout_x,
__global float *pout_y,
__global float *pout_z,
__global clrngMrg31k3pHostStream *streams)
{
int id = get_global_id(0);
if (id < points) {
do kernel things...
}
}
Do somebody know where the problem is?
Thanks a lot. Robin.
If your global size is not divisible by your local size (workgroup size), then the kernel will not run at all.
The enqueueNDRangeKernel() call will return CL_INVALID_WORK_GROUP_SIZE as an error as specified here.
If you really want to follow the CUDA mode, where you may have unused work items. Then put the check in the kernel (as you already have), and use a bigger global size, that is multiple of your local size.

Concurrent updates (x += a) to global memory in OpenCL

I'm doing the following in an OpenCL kernel (simplified example):
__kernel void step(const uint count, __global int *map, __global float *sum)
{
const uint i = get_global_id(0);
if(i < count) {
sum[map[i]] += 12.34;
}
}
Here, sum is some quantity I want to calculate (previously set to zero in another kernel) and map is a mapping from integers i to integers j, such that multiple i's can map to the same j.
(map could be in constant memory rather than global, but it seems the amount of constant memory on my GPU is incredibly limited)
Will this work? Is a "+=" implemented in an atomic way, or is there a chance of concurrent operations overwriting each other?
Will this work? Is a "+=" implemented in an atomic way, or is there a chance of concurrent operations overwriting each other?
It will not work. When threads access memory written to by other threads, you need to explicitly resort to atomic operations. In this case, atomic_add.
Something like:
__kernel void step(const uint count, __global int *map, __global double *sum)
{
const uint i = get_global_id(0);
if(i < count) {
atomic_add(&sum[map[i]], 1234);
}
}

OpenCL Local Memory Declaration

What is the difference between declaring local memory as follows:
__kernel void mmul(const int Ndim, const int Mdim, const int Pdim,
const __global int* A,
const __global int* B,
__global char* C,
__local int* restrict block_a,
__local int* restrict block_b)
and declaring local memory inside the kernel
#define a_size 1024
#define b_size 1024 * 1024
__kernel void mmul(const int Ndim, const int Mdim, const int Pdim,
const __global int* A,
const __global int* B,
__global char* C) {
__local int block_a[a_size]
__local int block_b[b_size]
...
}
In both cases, all threads will update a single cell in the shared A and B arrays
I understand that it's not possible to have "variable" length arrays in the kernel (hence the #define at the top of the second kernel), but is there any other difference? Is there any difference with regards to when the memory is freed?
In both cases, local memory exists for the lifetime of the work-group. The only difference, as you have noted, is that passing the local memory pointer as an argument allows the size of the buffer to be specified dynamically, rather than being a compile-time constant. Different work-groups will always use different local memory allocations.
The second method is better if you want to port code to CUDA, because the __shared__ memory in CUDA (equivalent to __local in OpenCL) does not support to be declared like the first case.

Resources