Concurrent updates (x += a) to global memory in OpenCL - opencl

I'm doing the following in an OpenCL kernel (simplified example):
__kernel void step(const uint count, __global int *map, __global float *sum)
{
const uint i = get_global_id(0);
if(i < count) {
sum[map[i]] += 12.34;
}
}
Here, sum is some quantity I want to calculate (previously set to zero in another kernel) and map is a mapping from integers i to integers j, such that multiple i's can map to the same j.
(map could be in constant memory rather than global, but it seems the amount of constant memory on my GPU is incredibly limited)
Will this work? Is a "+=" implemented in an atomic way, or is there a chance of concurrent operations overwriting each other?

Will this work? Is a "+=" implemented in an atomic way, or is there a chance of concurrent operations overwriting each other?
It will not work. When threads access memory written to by other threads, you need to explicitly resort to atomic operations. In this case, atomic_add.
Something like:
__kernel void step(const uint count, __global int *map, __global double *sum)
{
const uint i = get_global_id(0);
if(i < count) {
atomic_add(&sum[map[i]], 1234);
}
}

Related

Worse performance when using a bigger multiple work group size of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

I am writing a matrix multiplication program in OpenCL. In order to utilize local memory, I divided the original matrix into block_size * block_size sub-matrix, where block_size is equals to a square root of work group size and will be calculated in one work group.
My GPU is RX580, and CL_KERNEL_WORK_GROUP_SIZE returns 256, so a square root is 16, while CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE returns 64. And after some profiling, I found out that settting the block_size to 8*8 performs 4 times faster than setting it to 16*16, which seems unintuitive to me. Any explanation to this?
EDIT: CodeXL screenshot
8*8 work group
16*16 work group
I noticed something unusual. How can SALUBusy% exceed 100% in the 8*8 work group case?
EDIT2: Add my kernel code, which might be useful
size_t getIndex(int row, int col, int width)
{
return row*width+col;
}
__kernel void matrixMul(__global const float* restrict a,
__global const float* restrict b, //assume tranposed
__local float* restrict a_local,
__local float* restrict b_local,
__global float* restrict result)
{
const int row=get_global_id(0);
const int col=get_global_id(1);
const int limit=get_global_size(0);
const int blockSize=get_local_size(0);
const int gidRow=get_local_id(0);
const int gidCol=get_local_id(1);
float sum=0;
for(int blockIndex=0; blockIndex < (limit/blockSize); ++blockIndex)
{
/*copy block -> local memory*/
a_local[getIndex(gidRow, gidCol, blockSize)]=a[getIndex(row, col, limit)];
b_local[getIndex(gidRow, gidCol, blockSize)]=b[getIndex(row, col, limit)];
barrier(CLK_LOCAL_MEM_FENCE);
/*block multiply*/
__local float* restrict a_local_row=&a_local[getIndex(gidRow, 0, blockSize)];
__local float* restrict b_local_row=&b_local[getIndex(gidCol, 0, blockSize)];
for(int i=0; i<blockSize; ++i)
{
sum+= (*a_local_row) * (*b_local_row);
++a_local_row;
++b_local_row;
}
}
result[getIndex(row, col, limit)]=sum;
}

OpenCL Local Memory Declaration

What is the difference between declaring local memory as follows:
__kernel void mmul(const int Ndim, const int Mdim, const int Pdim,
const __global int* A,
const __global int* B,
__global char* C,
__local int* restrict block_a,
__local int* restrict block_b)
and declaring local memory inside the kernel
#define a_size 1024
#define b_size 1024 * 1024
__kernel void mmul(const int Ndim, const int Mdim, const int Pdim,
const __global int* A,
const __global int* B,
__global char* C) {
__local int block_a[a_size]
__local int block_b[b_size]
...
}
In both cases, all threads will update a single cell in the shared A and B arrays
I understand that it's not possible to have "variable" length arrays in the kernel (hence the #define at the top of the second kernel), but is there any other difference? Is there any difference with regards to when the memory is freed?
In both cases, local memory exists for the lifetime of the work-group. The only difference, as you have noted, is that passing the local memory pointer as an argument allows the size of the buffer to be specified dynamically, rather than being a compile-time constant. Different work-groups will always use different local memory allocations.
The second method is better if you want to port code to CUDA, because the __shared__ memory in CUDA (equivalent to __local in OpenCL) does not support to be declared like the first case.

error CL_OUT_OF_RESOURCES while reading back data in host memory while using atomic function in opencl kernel

I am trying to implement atomic functions in my opencl kernel. Multiple threads I am creating are parallely trying to write a single memory location. I want them to perform serial execution on that particular line of code. I have never used an atomic function before.
I found similar problems on many blogs and forums,and I am trying one solution.,i.e. use of two different functions 'acquire' and 'release' for locking and unlocking the semaphore. I have included necessary opencl extensions, which are all surely supported by my device (NVIDIA GeForce GTX 630M).
My kernel execution configuration:
global_item_size = 8;
ret = clEnqueueNDRangeKernel(command_queue2, kernel2, 1, NULL, &global_item_size2, &local_item_size2, 0, NULL, NULL);
Here is my code: reducer.cl
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable
typedef struct data
{
double dattr[10];
int d_id;
int bestCent;
}Data;
typedef struct cent
{
double cattr[5];
int c_id;
}Cent;
__global void acquire(__global int* mutex)
{
int occupied;
do {
occupied = atom_xchg(mutex, 1);
} while (occupied>0);
}
__global void release(__global int* mutex)
{
atom_xchg(mutex, 0); //the previous value, which is returned, is ignored
}
__kernel void reducer(__global int *keyMobj, __global int *valueMobj,__global Data *dataMobj,__global Cent *centMobj,__global int *countMobj,__global double *sumMobj, __global int *mutex)
{
__local double sum[2][2];
__local int cnt[2];
int i = get_global_id(0);
int n,j;
if(i<2)
cnt[i] = countMobj[i];
barrier(CLK_GLOBAL_MEM_FENCE);
n = keyMobj[i];
for(j=0; j<2; j++)
{
barrier(CLK_GLOBAL_MEM_FENCE);
acquire(mutex);
sum[n][j] += dataMobj[i].dattr[j];
release(mutex);
}
if(i<2)
{
for(j=0; j<2; j++)
{
sum[i][j] = sum[i][j]/countMobj[i];
centMobj[i].cattr[j] = sum[i][j];
}
}
}
Unfortunately the solution doesn't seem like working for me. When I am reading back the centMobj into the host memory, using
ret = clEnqueueReadBuffer(command_queue2, centMobj, CL_TRUE, 0, (sizeof(Cent) * 2), centNode, 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue2, sumMobj, CL_TRUE, 0, (sizeof(double) * 2 * 2), sum, 0, NULL, NULL);
it is giving me error with error code = -5 (CL_OUT_OF_RESOURCES) for both centMobj and sumMobj.
I am not getting if there is any problem in my atomic function code or problem is in reading back data into the host memory. If I am using the atomic function incorrectly, please make me correct.
Thank you in advance.
In OpenCL, synchronization between work items can be done only inside a work-group. Code trying to synchronize work-items across different work-groups may work in some very specific (and implementation/device dependent) cases, but will fail in the general case.
The solution is to either use atomics to serialize accesses to the same memory location (but without blocking any work item), or redesign the code differently.

OpenCL structure declarations in different memory spaces

In OpenCL what will be the consequences and differences between the following struct declarations. And if they are illegal, why?
struct gr_array
{
int ndims;
__global m_integer* dim_size;
__global m_real* data;
};
typedef struct gr_array g_real_array;
struct lr_array
{
int ndims;
__local m_integer* dim_size;
__local m_real* data;
};
typedef struct lr_array l_real_array;
__ kernel temp(...){
__local g_real_array A;
g_real_array B;
__local l_real_array C;
l_real_array D;
}
My question is where will the structures be allocated (and the members)? who can access them? And is this a good practice or not?
EDIT
how about this
struct r_array
{
__local int ndims;
};
typedef struct r_array real_array;
__ kernel temp(...){
__local real_array A;
real_array B;
}
if a work-item modifies ndims in struct B, is the change visible to other work-items in the work-group?
I've rewritten your code as valid CL, or at least CL that will compile. Here:
typedef struct gr_array {
int ndims;
global int* dim_size;
global float* data;
} g_float_array;
typedef struct lr_array {
int ndims;
local int* dim_size;
local float* data;
} l_float_array;
kernel void temp() {
local g_float_array A;
g_float_array B;
local l_float_array C;
l_float_array D;
}
One by one, here's how this breaks down:
A is in local space. It's a struct that is composed of one int and two pointers. These pointers point to data in global space, but are themselves allocated in local space.
B is in private space; it's an automatic variable. It is composed of an int and two pointers that point to stuff in global memory.
C is in local space. It contains an int and two pointers to stuff in local space.
D, you can probably guess at this point. It's in private space, and contains an int and two pointers that point to stuff in local space.
I cannot say if either is preferable for your problem, since you haven't described what your are trying to accomplish.
EDIT: I realized I didn't address the second part of your question -- who can access the structure fields.
Well, you can access the fields anywhere the variable is in scope. I'm guessing that you were thinking that the fields you had marked as global in g_float_array were in global space (an local space for l_float_array). But they're just pointing to stuff in global (or local) space.
So, you'd use them like this:
kernel void temp(
global float* data, global int* global_size,
local float* data_local, local int* local_size,
int num)
{
local g_float_array A;
g_float_array B;
local l_float_array C;
l_float_array D;
A.ndims = B.ndims = C.ndims = D.ndims = num;
A.data = B.data = data;
A.dim_size = B.dim_size = global_size;
C.data = D.data = data_local;
C.dim_size = D.dim_size = local_size;
}
By the way -- if you're hacking CL on a Mac running Lion, you can compile .cl files using the "offline" CL compiler, which makes experimenting with this kind of stuff a bit easier. It's located here:
/System/Library/Frameworks/OpenCL.framework/Libraries/openclc
There is some sample code here.
It probably won't work, because the current GPU-s have different memory spaces for OpenCL kernels and for the ordinary program. You have to make explicit calls to transmit data between both spaces, and it is often the bottleneck of the program (because the bandwidth of PCI-X graphics card is quite low).

Problem reinterpreting parameters in OpenCL 1.0

Is it possible to reinterpret parameters that have been passed into an OpenCL Kernel. For example, if I have an array of integers being passes in, but I want to interpret the integer at index 16 as a float (don't ask why!) then I would have thought this would work.
__kernel void Test(__global float* im, __constant int* constArray)
{
float x = *( (__constant float*) &constArray[16] );
im[0] = x;
}
However, I get a CL_INVALID_COMMAND_QUEUE error when I next try to use the command queue, implying that the above code has performed an illegal operation.
Any suggests what is wrong with the above, and/or how to achieve the reinterpretation?
I have now tried:
__kernel void Test(__global float* im, __constant int* constArray)
{
float x = as_float(0x3f800000);
im[0] = x;
}
and this does indeed give a 1.0f in im[0]. However,
__kernel void Test(__global float* im, __constant int* constArray)
{
float x = as_float(constArray[16]);
im[0] = x;
}
always results in zero in im[0] regardless of what is in constArray[16].
Regards,
Mark.
OpenCL includes the as_typen family of operators for reinterpret casting of values from one type to another. If I am understanding the question, you should be able to do something like
__kernel void Test(__global float* im, __constant int* constArray)
{
float x = as_float(constArray[16]);
im[0] = x;
}

Resources