Swap memory pointers atomically on CUDA

I have two pointers in memory and I want to swap it atomically but atomic operation in CUDA support only int types. There is a way to do the following swap?
classA* a1 = malloc(...);
classA* a2 = malloc(...);

When writing device-side code...
While CUDA provides atomics, they can't cover multiple (possibly remote) memory locations at once.
To perform this swap, you will need to "protect" access to both these values with something like mutex, and have whoever wants to write values to them take a hold of the mutex for the duration of the critical section (like in C++'s host-side std::lock_guard). This can be done using CUDA's actual atomic facilities, e.g. compare-and-swap, and is the subject of this question:
Implementing a critical section in CUDA
A caveat to the above is mentioned by #RobertCrovella: If you can make do with, say, a pair of 32-bit offsets rather than a 64-bit pointer, then if you were to store them in a 64-bit aligned struct, you could use compare-and-exchange on the whole struct to implement an atomic swap of the whole struct.
... but is it really device side code?
Your code actually doesn't look like something one would run on the device: Memory allocation is usually (though not always) done from the host side before you launch your kernel and do actual work. If you could make sure these alterations only happen on the host side (think CUDA events and callbacks), and that device-side code will not be interfered with by them - you can just use your plain vanilla C++ facilities for concurrent programming (like lock_guard I mentioned above).

I managed to have the needed behaviour, it is not atomic swap but still safe. The context was a monotonic Linked List working both on CPU and GPU:
template<typename T>
union readablePointer
T* ptr;
unsigned long long int address;
template<typename T>
struct LinkedList
struct Node
T value;
readablePointer<Node> previous;
Node start;
Node end;
int size;
__host__ __device__ void initialize()
size = 0;
start.previous.ptr = nullptr;
end.previous.ptr = &start;
__host__ __device__ void push_back(T value)
Node* node = nullptr;
malloc(&node, sizeof(Node));
readablePointer<Node> nodePtr;
nodePtr.ptr = node;
nodePtr.ptr->value = value;
#ifdef __CUDA_ARCH__
nodePtr.ptr->previous.address = atomicExch(&end.previous.address, nodePtr.address);
nodePtr.ptr->previous.address = end.previous.address;
end.previous.address = nodePtr.address;
size += 1;
__host__ __device__ T pop_back()
assert(end.previous.ptr != &start);
readablePointer<Node> lastNodePtr;
lastNodePtr.ptr = nullptr;
#ifdef __CUDA_ARCH__
lastNodePtr.address = atomicExch(&end.previous.address,end.previous.ptr->previous.address);
lastNodePtr.address = end.previous.address;
end.previous.address = end.previous.ptr->previous.address;
size -= 1;
T toReturn = lastNodePtr.ptr->value;
return toReturn;
__host__ __device__ void clear()
while(size > 0)


OpenCL channels dynamic indexing

I want to implement a systolic structure for matrix multiplication. My objective is to use a single kernel for every Processing Element so I will execute the same kernel from the host part multiple times.
To communicate between kernels I would like to use channels or pipes. The problem is that "channels extension does not support dynamic indexing into arrays of channel IDs". The number of kernels will depend on the size of the matrix so I will need some method to connect the channels to the corresponding kernels automatically.
Summarizing, I am looking for a method to create this functionality:
channel float c0[32];
__kernel void producer (__global float * data_in){
for(int i=0; i<32; i++){
__kernel void consumer (__global float * ret_buf){
for(int i=0; i<32; i++){
Thanks in advance!
OpenCL channels (Intel FPGA extension) do not support "true" dynamic
indexing, but you can work around this limitation in most cases by
using switch or #pragma unroll approach:
switch approach is described in Intel FPGA SDK for OpenCL Programming Guide:
channel int ch[WORKGROUP_SIZE];
__kernel void consumer() {
int gid = get_global_id(0);
int value;
case 0: value = read_channel_intel(ch[0]); break;
case 1: value = read_channel_intel(ch[1]); break;
case 2: value = read_channel_intel(ch[2]); break;
case 3: value = read_channel_intel(ch[3]); break;
case WORKGROUP_SIZE-1:read_channel_intel(ch[WORKGROUP_SIZE-1]); break;
You can also use #pragma unroll if you have a loop over channels:
__kernel void consumer() {
int values[WORKGROUP_SIZE]
#pragma unroll
for (int i = 0; i < WORKGROUP_SIZE; ++i) {
values[i] = read_channel_intel(ch[i]);
As far as I know, we need to know how many channels we would require at the maximum much before compiling the program for the board, as we cannot program the FPGA like the way we do for other computing system and allocate resources on the go. Once we know the maximum number (atleast) we can use
#pragma unroll
before we start the loop for reading/writing the channels

Pointers to stack

I am sorry that I cannot support my question with some code (I didnt understand how to structure it so it would be accepted here), but I try anyway.
If I understand correctly, a struct that references a struct of same type would need to do this with contained pointer for reference. Can this pointer reference to allocated space on the stack (instead of the heap) without creating segmentation fault? -
how should this be declared?
Yes, you can use pointers to variables on the stack, but only when the method that provides that stack frame has not returned. For example this will work:
typedef struct
int a;
float b;
} s;
void printStruct(const s *s)
printf("a=%d, b=%f\n", s->a, s->b);
void test()
s s;
s.a = 12;
s.b = 34.5f;
This will cause an error however, as the stack frame would have disappeared:
s *bad()
s s;
s.a = 12;
s.b = 34.5f;
return &s;
EDIT: Well I say it will cause an error, but while calling that code with:
int main()
s *s = bad();
return 0;
I get a warning during compilation:
s.c:27:5: warning: function returns address of local variable [enabled by default]
and the program appears to work fine:
$ ./s
a=12, b=34.500000
a=12, b=34.500000
But it is, in fact, broken.
You didn't say what language you are working in, so assuming C for now from the wording of your question... the following code is perfectly valid:
typedef struct str_t_tag {
int foo;
int bar;
struct str_t_tag *pNext;
} str_t;
str_t str1;
str_t str2;
str1.pNext = &str2;
In this example both str1 and str2 are on the stack, but this would also work if either or both were on the heap. The only thing you need to be careful of is that stack variables will be zapped when they go out of scope, so if you had dynamically allocated str1 and passed it back out of a function, you would not want str1->pNext to point to something that was on the stack within that function.
In other words, DON'T DO THIS:
typedef struct str_t_tag {
int foo;
int bar;
struct str_t_tag *pNext;
} str_t;
str_t *func(void)
str_t *pStr1 = malloc(sizeof(*pStr1));
str_t str2;
pStr1->pNext = &str2;
return pStr1; /* NO!! pStr1->pNext will point to invalid memory after this */
Not sure if this is specifically a C/C++ question, but I'll give C/C++ code as example in anyway.
The only way you can declare it: (with minor variations)
typedef struct abc
struct abc *other;
} abc;
other can point to an object on the stack as follows:
abc a, b; // stack objects
b.other = &a;
This is not a question about scope, so I'll skip commenting on possible issues with doing the above.
If, however, you want to assign it to a dynamically created object, there's no way this object can be on the stack.
abc b;
b.other = malloc(sizeof(abc)); // on the heap

error CL_OUT_OF_RESOURCES while reading back data in host memory while using atomic function in opencl kernel

I am trying to implement atomic functions in my opencl kernel. Multiple threads I am creating are parallely trying to write a single memory location. I want them to perform serial execution on that particular line of code. I have never used an atomic function before.
I found similar problems on many blogs and forums,and I am trying one solution.,i.e. use of two different functions 'acquire' and 'release' for locking and unlocking the semaphore. I have included necessary opencl extensions, which are all surely supported by my device (NVIDIA GeForce GTX 630M).
My kernel execution configuration:
global_item_size = 8;
ret = clEnqueueNDRangeKernel(command_queue2, kernel2, 1, NULL, &global_item_size2, &local_item_size2, 0, NULL, NULL);
Here is my code: reducer.cl
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable
typedef struct data
double dattr[10];
int d_id;
int bestCent;
typedef struct cent
double cattr[5];
int c_id;
__global void acquire(__global int* mutex)
int occupied;
do {
occupied = atom_xchg(mutex, 1);
} while (occupied>0);
__global void release(__global int* mutex)
atom_xchg(mutex, 0); //the previous value, which is returned, is ignored
__kernel void reducer(__global int *keyMobj, __global int *valueMobj,__global Data *dataMobj,__global Cent *centMobj,__global int *countMobj,__global double *sumMobj, __global int *mutex)
__local double sum[2][2];
__local int cnt[2];
int i = get_global_id(0);
int n,j;
cnt[i] = countMobj[i];
n = keyMobj[i];
for(j=0; j<2; j++)
sum[n][j] += dataMobj[i].dattr[j];
for(j=0; j<2; j++)
sum[i][j] = sum[i][j]/countMobj[i];
centMobj[i].cattr[j] = sum[i][j];
Unfortunately the solution doesn't seem like working for me. When I am reading back the centMobj into the host memory, using
ret = clEnqueueReadBuffer(command_queue2, centMobj, CL_TRUE, 0, (sizeof(Cent) * 2), centNode, 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue2, sumMobj, CL_TRUE, 0, (sizeof(double) * 2 * 2), sum, 0, NULL, NULL);
it is giving me error with error code = -5 (CL_OUT_OF_RESOURCES) for both centMobj and sumMobj.
I am not getting if there is any problem in my atomic function code or problem is in reading back data into the host memory. If I am using the atomic function incorrectly, please make me correct.
Thank you in advance.
In OpenCL, synchronization between work items can be done only inside a work-group. Code trying to synchronize work-items across different work-groups may work in some very specific (and implementation/device dependent) cases, but will fail in the general case.
The solution is to either use atomics to serialize accesses to the same memory location (but without blocking any work item), or redesign the code differently.

Program fails when trying to add a pointer to an array inside a function (C)

I cannot get this code to work properly. When I try to compile it, one of three things will happen: Either I'll get no errors, but when I run the program, it immediately locks up; or it'll compile fine, but says 'Segmentation fault' and exits when I run it; or it gives warnings when compiled:
"conflicting types for ‘addObjToTree’
previous implicit declaration of ‘addObjToTree’ was here"
but then says 'Segmentation fault' and exits when I try to run it.
I'm on Mac OS X 10.6 using gcc.
typedef struct itemPos {
float x;
float y;
} itemPos;
typedef struct gameObject {
itemPos loc;
int uid;
int kind;
int isEmpty;
} gameObject;
void addObjToTree (gameObject *targetObj, gameObject *destTree[]) {
int i = 0;
int stop = 1;
while (stop) {
if ((*destTree[i]).isEmpty == 0)
else if ((*destTree[i]).isEmpty == 1)
stop = 0;
if (stop == 0) {
destTree[i] = targetObj;
void initFS_LA (gameObject *target, gameObject *tree[], itemPos destination) {
addObjToTree(target, tree);
(*target).uid = 12981;
(*target).kind = 101;
(*target).isEmpty = 0;
(*target).maxHealth = 100;
(*target).absMaxHealth = 200;
(*target).curHealth = 100;
(*target).skill = 1;
(*target).isSolid = 1;
(*target).factionID = 555;
(*target).loc.x = destination.x;
(*target).loc.y = destination.y;
#include "game-obj.h"
#include "internal-routines.h"
#include <stdio.h>
int main()
gameObject abc;
gameObject jkl;
abc.kind = 101;
abc.uid = 1000;
itemPos aloc;
aloc.x = 10;
aloc.y = 15;
gameObject *masterTree[3];
masterTree[0] = &(abc);
initFS_LA(&jkl, masterTree, aloc);
return 0;
I don't understand why it doesn't work. I just want addObjToTree(...) to add a pointer to a gameObject in the next free space of masterTree, which is an array of pointers to gameObject structures. even weirder, if I remove the line addObjToTree(target, tree); from initFS_LA(...) it works perfectly. I've already created a function that searches masterTree by uid and that also works fine, even if I initialize a new gameObject with initFS_LA(...) (without the addObjToTree line.) I've tried rearranging the functions within the header file, putting them into separate header files, prototyping them, rearranging the order of #includes, explicitly creating a pointer variable instead of using &jkl, but absolutely nothing works. Any ideas? I appreciate any help
If I see this correctly, then you don't initialize elements 1 and 2 of the masterTree array anywhere. Then, your addObjToTree() function searches the - uninitialized - array for a free element.
Declaring a variable like gameObject *masterTree[3]; in C does not zero-initialize the array. Add some memset (masterTree, 0, sizeof (masterTree)); to initialize.
Note that you're declaring an array of pointers to structs here, not an array of structs (see also here), so you also need to adjust your addObjToTree() to check for a NULL-pointer instead of isEmpty.
It would also be good practice to pass the length of that array to that function to avoid buffer overruns.
If you want an array of structs, then you need to declare it as gameObject masterTree[3]; and the parameter in your addObjToTree() becomes gameObject *tree.

OpenCL structure declarations in different memory spaces

In OpenCL what will be the consequences and differences between the following struct declarations. And if they are illegal, why?
struct gr_array
int ndims;
__global m_integer* dim_size;
__global m_real* data;
typedef struct gr_array g_real_array;
struct lr_array
int ndims;
__local m_integer* dim_size;
__local m_real* data;
typedef struct lr_array l_real_array;
__ kernel temp(...){
__local g_real_array A;
g_real_array B;
__local l_real_array C;
l_real_array D;
My question is where will the structures be allocated (and the members)? who can access them? And is this a good practice or not?
how about this
struct r_array
__local int ndims;
typedef struct r_array real_array;
__ kernel temp(...){
__local real_array A;
real_array B;
if a work-item modifies ndims in struct B, is the change visible to other work-items in the work-group?
I've rewritten your code as valid CL, or at least CL that will compile. Here:
typedef struct gr_array {
int ndims;
global int* dim_size;
global float* data;
} g_float_array;
typedef struct lr_array {
int ndims;
local int* dim_size;
local float* data;
} l_float_array;
kernel void temp() {
local g_float_array A;
g_float_array B;
local l_float_array C;
l_float_array D;
One by one, here's how this breaks down:
A is in local space. It's a struct that is composed of one int and two pointers. These pointers point to data in global space, but are themselves allocated in local space.
B is in private space; it's an automatic variable. It is composed of an int and two pointers that point to stuff in global memory.
C is in local space. It contains an int and two pointers to stuff in local space.
D, you can probably guess at this point. It's in private space, and contains an int and two pointers that point to stuff in local space.
I cannot say if either is preferable for your problem, since you haven't described what your are trying to accomplish.
EDIT: I realized I didn't address the second part of your question -- who can access the structure fields.
Well, you can access the fields anywhere the variable is in scope. I'm guessing that you were thinking that the fields you had marked as global in g_float_array were in global space (an local space for l_float_array). But they're just pointing to stuff in global (or local) space.
So, you'd use them like this:
kernel void temp(
global float* data, global int* global_size,
local float* data_local, local int* local_size,
int num)
local g_float_array A;
g_float_array B;
local l_float_array C;
l_float_array D;
A.ndims = B.ndims = C.ndims = D.ndims = num;
A.data = B.data = data;
A.dim_size = B.dim_size = global_size;
C.data = D.data = data_local;
C.dim_size = D.dim_size = local_size;
By the way -- if you're hacking CL on a Mac running Lion, you can compile .cl files using the "offline" CL compiler, which makes experimenting with this kind of stuff a bit easier. It's located here:
There is some sample code here.
It probably won't work, because the current GPU-s have different memory spaces for OpenCL kernels and for the ordinary program. You have to make explicit calls to transmit data between both spaces, and it is often the bottleneck of the program (because the bandwidth of PCI-X graphics card is quite low).
