In OpenCL what will be the consequences and differences between the following struct declarations. And if they are illegal, why?
struct gr_array
{
int ndims;
__global m_integer* dim_size;
__global m_real* data;
};
typedef struct gr_array g_real_array;
struct lr_array
{
int ndims;
__local m_integer* dim_size;
__local m_real* data;
};
typedef struct lr_array l_real_array;
__ kernel temp(...){
__local g_real_array A;
g_real_array B;
__local l_real_array C;
l_real_array D;
}
My question is where will the structures be allocated (and the members)? who can access them? And is this a good practice or not?
EDIT
how about this
struct r_array
{
__local int ndims;
};
typedef struct r_array real_array;
__ kernel temp(...){
__local real_array A;
real_array B;
}
if a work-item modifies ndims in struct B, is the change visible to other work-items in the work-group?
I've rewritten your code as valid CL, or at least CL that will compile. Here:
typedef struct gr_array {
int ndims;
global int* dim_size;
global float* data;
} g_float_array;
typedef struct lr_array {
int ndims;
local int* dim_size;
local float* data;
} l_float_array;
kernel void temp() {
local g_float_array A;
g_float_array B;
local l_float_array C;
l_float_array D;
}
One by one, here's how this breaks down:
A is in local space. It's a struct that is composed of one int and two pointers. These pointers point to data in global space, but are themselves allocated in local space.
B is in private space; it's an automatic variable. It is composed of an int and two pointers that point to stuff in global memory.
C is in local space. It contains an int and two pointers to stuff in local space.
D, you can probably guess at this point. It's in private space, and contains an int and two pointers that point to stuff in local space.
I cannot say if either is preferable for your problem, since you haven't described what your are trying to accomplish.
EDIT: I realized I didn't address the second part of your question -- who can access the structure fields.
Well, you can access the fields anywhere the variable is in scope. I'm guessing that you were thinking that the fields you had marked as global in g_float_array were in global space (an local space for l_float_array). But they're just pointing to stuff in global (or local) space.
So, you'd use them like this:
kernel void temp(
global float* data, global int* global_size,
local float* data_local, local int* local_size,
int num)
{
local g_float_array A;
g_float_array B;
local l_float_array C;
l_float_array D;
A.ndims = B.ndims = C.ndims = D.ndims = num;
A.data = B.data = data;
A.dim_size = B.dim_size = global_size;
C.data = D.data = data_local;
C.dim_size = D.dim_size = local_size;
}
By the way -- if you're hacking CL on a Mac running Lion, you can compile .cl files using the "offline" CL compiler, which makes experimenting with this kind of stuff a bit easier. It's located here:
/System/Library/Frameworks/OpenCL.framework/Libraries/openclc
There is some sample code here.
It probably won't work, because the current GPU-s have different memory spaces for OpenCL kernels and for the ordinary program. You have to make explicit calls to transmit data between both spaces, and it is often the bottleneck of the program (because the bandwidth of PCI-X graphics card is quite low).
Related
I was trying to create a struct with an unknown size array. here's what I think, but it seems to have a problem because memcheck report an error.I'm not sure if this is legal in c++
struct information
{
std::string name;
int x=0;
int* score=new int[x];
}
I'm new to opencL and this question might look silly.
I have a kernel which takes two structures A and C. I want to copy contents of structure A to structure C.
Structure looks like below:
struct Block {
bool used;
int size;
intptr_t data[1];
};
__kernel void function(__global struct Block *A, __global struct Block *C) {
//Do something on A
//COPY A to C by memcpy alternative
}
Is there any function like memcpy which I can use inside kernel?. I'm using opencl in integrated GPU with zero copy.
Or Do I have to copy block by block to structure C?.
In your case you can simply assign the structures:
__kernel void function(__global struct Block *A, __global struct Block *C) {
//Do something on A
*C = *A;
}
It's same as in plain C, yet many programmers don't know they can assign structures and resort to memcpy.
I have two pointers in memory and I want to swap it atomically but atomic operation in CUDA support only int types. There is a way to do the following swap?
classA* a1 = malloc(...);
classA* a2 = malloc(...);
atomicSwap(a1,a2);
When writing device-side code...
While CUDA provides atomics, they can't cover multiple (possibly remote) memory locations at once.
To perform this swap, you will need to "protect" access to both these values with something like mutex, and have whoever wants to write values to them take a hold of the mutex for the duration of the critical section (like in C++'s host-side std::lock_guard). This can be done using CUDA's actual atomic facilities, e.g. compare-and-swap, and is the subject of this question:
Implementing a critical section in CUDA
A caveat to the above is mentioned by #RobertCrovella: If you can make do with, say, a pair of 32-bit offsets rather than a 64-bit pointer, then if you were to store them in a 64-bit aligned struct, you could use compare-and-exchange on the whole struct to implement an atomic swap of the whole struct.
... but is it really device side code?
Your code actually doesn't look like something one would run on the device: Memory allocation is usually (though not always) done from the host side before you launch your kernel and do actual work. If you could make sure these alterations only happen on the host side (think CUDA events and callbacks), and that device-side code will not be interfered with by them - you can just use your plain vanilla C++ facilities for concurrent programming (like lock_guard I mentioned above).
I managed to have the needed behaviour, it is not atomic swap but still safe. The context was a monotonic Linked List working both on CPU and GPU:
template<typename T>
union readablePointer
{
T* ptr;
unsigned long long int address;
};
template<typename T>
struct LinkedList
{
struct Node
{
T value;
readablePointer<Node> previous;
};
Node start;
Node end;
int size;
__host__ __device__ void initialize()
{
size = 0;
start.previous.ptr = nullptr;
end.previous.ptr = &start;
}
__host__ __device__ void push_back(T value)
{
Node* node = nullptr;
malloc(&node, sizeof(Node));
readablePointer<Node> nodePtr;
nodePtr.ptr = node;
nodePtr.ptr->value = value;
#ifdef __CUDA_ARCH__
nodePtr.ptr->previous.address = atomicExch(&end.previous.address, nodePtr.address);
atomicAdd(&size,1);
#else
nodePtr.ptr->previous.address = end.previous.address;
end.previous.address = nodePtr.address;
size += 1;
#endif
}
__host__ __device__ T pop_back()
{
assert(end.previous.ptr != &start);
readablePointer<Node> lastNodePtr;
lastNodePtr.ptr = nullptr;
#ifdef __CUDA_ARCH__
lastNodePtr.address = atomicExch(&end.previous.address,end.previous.ptr->previous.address);
atomicSub(&size,1);
#else
lastNodePtr.address = end.previous.address;
end.previous.address = end.previous.ptr->previous.address;
size -= 1;
#endif
T toReturn = lastNodePtr.ptr->value;
free(lastNodePtr.ptr);
return toReturn;
}
__host__ __device__ void clear()
{
while(size > 0)
{
pop_back();
}
}
};
I'm doing the following in an OpenCL kernel (simplified example):
__kernel void step(const uint count, __global int *map, __global float *sum)
{
const uint i = get_global_id(0);
if(i < count) {
sum[map[i]] += 12.34;
}
}
Here, sum is some quantity I want to calculate (previously set to zero in another kernel) and map is a mapping from integers i to integers j, such that multiple i's can map to the same j.
(map could be in constant memory rather than global, but it seems the amount of constant memory on my GPU is incredibly limited)
Will this work? Is a "+=" implemented in an atomic way, or is there a chance of concurrent operations overwriting each other?
Will this work? Is a "+=" implemented in an atomic way, or is there a chance of concurrent operations overwriting each other?
It will not work. When threads access memory written to by other threads, you need to explicitly resort to atomic operations. In this case, atomic_add.
Something like:
__kernel void step(const uint count, __global int *map, __global double *sum)
{
const uint i = get_global_id(0);
if(i < count) {
atomic_add(&sum[map[i]], 1234);
}
}
What is the difference between declaring local memory as follows:
__kernel void mmul(const int Ndim, const int Mdim, const int Pdim,
const __global int* A,
const __global int* B,
__global char* C,
__local int* restrict block_a,
__local int* restrict block_b)
and declaring local memory inside the kernel
#define a_size 1024
#define b_size 1024 * 1024
__kernel void mmul(const int Ndim, const int Mdim, const int Pdim,
const __global int* A,
const __global int* B,
__global char* C) {
__local int block_a[a_size]
__local int block_b[b_size]
...
}
In both cases, all threads will update a single cell in the shared A and B arrays
I understand that it's not possible to have "variable" length arrays in the kernel (hence the #define at the top of the second kernel), but is there any other difference? Is there any difference with regards to when the memory is freed?
In both cases, local memory exists for the lifetime of the work-group. The only difference, as you have noted, is that passing the local memory pointer as an argument allows the size of the buffer to be specified dynamically, rather than being a compile-time constant. Different work-groups will always use different local memory allocations.
The second method is better if you want to port code to CUDA, because the __shared__ memory in CUDA (equivalent to __local in OpenCL) does not support to be declared like the first case.