Multidimensional array in openCL with no shared memory? - multidimensional-array

Hello I am fairly new to openCL and have encountered a problem when trying to index my multidimensional arrays. From what I understand it is not possible to store a multidimensional array in the global memory, but it is possible in the local memory. However when I try to access my 2D local array it always comes back as 0.I had a look at my gpu at http://www.notebookcheck.net/NVIDIA-GeForce-GT-635M.66964.0.html and found out that I had 0 shared memory, could this be the reason? What other limitations will 0 shared memory place on my programming experience?
I've posted a small simple program of the problem that I'm facing.
The input is = [1,2,3,4] and I would like to store this in my 2D array.
__kernel void kernel(__global float *input, __global float *output)
{//the input is [1,2,3,4];
int size=2;//2by2 matrix
int idx = get_global_id(0);
int idy = get_global_id(1);
__local float 2Darray[2][2];
2Darray[idx][idy]=input[idx*size+idy];
output[0]=2Darray[1][1];//this always returns 0, but should return 4 on the first output no?
}

__local float 2Darray[1][1];
is 1 element wide, 1 element high.
2Darray[1][1]
is second row and second column which doesnt exist.
Even if it lets you have local memory without an error, it spills to global memory and gets as slow as vram bandwidth(if it doesnt fit local mem space).
Race condition:
output[0]=2Darray[1][1];
each core trying to write to same(0) index. Add
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
if(idx==0 && idy==0)
before it so only 1 thread writes to it. But this still needs synchronization instruction before that.

Related

dynamic allocation in shared memory in opencl on Nvidia

I'm following the example here to create a variable-length local memory array.
The kernel signature is something like this:
__kernel void foo(__global float4* ex_buffer,
int ex_int,
__local void *local_var)
Then I call clSetKernelArg for the local memory kernel argument as follows:
clSetKernelArg(*kern, 2, sizeof(char) * MaxSharedMem, NULL)
Where MaxSharedMem is set from querying CL_DEVICE_LOCAL_MEM_SIZE.
Then inside the kernel I split up the allocated local memory into several arrays and other data structures and use them as I see fit. All of this works fine with AMD (gpu and cpu) and Intel devices. However, on Nvidia, I get the error CL_INVALID_COMMAND_QUEUE when I enqueue this kernel and then run clFinish on the queue.
This is a simple kernel that generates the mentioned error (local work size is 32):
__kernel
void s_Kernel(const unsigned int N, __local void *shared_mem_block )
{
const ushort thread_id = get_local_id(0);
__local double *foo = shared_mem_block;
__local ushort *bar = (__local ushort *) &(foo[1000]);
foo[thread_id] = 0.;
bar[thread_id] = 0;
}
The kernel runs fine if I allocate the same arrays and data structures in local memory statically. Could somebody provide an explanation for this behavior, and/or workarounds?
For those interested, I finally received an explanation from Nvidia. When the chunk of shared memory is passed in via a void pointer, the actual alignment does not match the expected alignment for a pointer to double (8-byte aligned). The GPU device throws an exception due to the misalignment.
As one of the comments pointed out, a way to circumvent the problem is to have the kernel parameter be a pointer to something that the compiler would properly align to at least 8 bytes (double, ulong, etc).
Ideally, the compiler would take responsibility for any alignment issues specific to the device, but because there is an implicit pointer cast in the little kernel featured in my question, I think it gets confused.
Once the memory is 8-byte aligned, a cast to a pointer type that assumes a shorter alignment (e.g. ushort) works without issues. So, if you're chaining the memory allocation like I'm doing, and the pointers are to different types, make sure to have the pointer to the largest type in the kernel signature.

OpenCL select/delete points from large array

I have an array of 2M+ points (planned to be increased to 20M in due course) that I am running calculations on via OpenCL. I'd like to delete any points that fall within a random triangle geometry.
How can I do this within an OpenCL kernel process?
I can already:
identify those points that fall outside the triangle (simple point in poly algorithm in the kernel)
pass their coordinates to a global output array.
But:
an openCL global output array cannot be variable and so I initialise it to match the input array of points in terms of size
As a result, 0,0 points occur in the final output when a point falls within the triangle
The output array therefore does not result in any reduction per se.
Can the 0,0 points be deleted within the openCL context?
n.b. I am coding in OpenFrameworks, so c++ implementations are linking to .cl files
Just an alternative for the case where most of the points fall inside the atomic condition:
It is possible to have a local counter, and local atomic. Then to merge that atomic to the global value it is possible to use atomic_add(). Witch will return the "previous" global value. So, you just copy the indexes to that address and up.
It should be a noticeable speed up, since the threads will sync locally and only once globally. The global copy can be parallel since the address will never overlap.
For example:
__kernel mykernel(__global MyType * global_out, __global int * global_count, _global MyType * global_in){
int lid = get_local_id(0);
int lws = get_local_size(0);
int idx = get_global_id(0);
__local int local_count;
__local int global_val;
//I am using a local container, but a local array of pointers to global is possible as well
__local MyType local_out[WG_SIZE]; //Ensure this is higher than your work_group size
if(lid==0){
local_count = 0; global_val = -1;
}
barrier(CLK_LOCAL_MEM_FENCE);
//Classify them
if(global_in[idx] == ....)
local_out[atomic_inc(local_count)] = global_in[idx];
barrier(CLK_LOCAL_MEM_FENCE);
//If not, we are done
if(local_count > 0){
//Only the first local ID does the atomic to global
if(lid == 0)
global_val = atomic_add(global_count,local_count);
//Resync all the local workers here
barrier(CLK_LOCAL_MEM_FENCE);
//Copy all the data
for(int i=0; i<local_count; i+=lws)
global_out[global_val+i] = local_out[i];
}
}
NOTE: I didn't compile it but should more or less work.
If I understood your problem, you can do:
--> In your kernel, you can identify the points in the triangle and:
if(element[idx]!=(0,0))
output_array[atomic_inc(number_of_elems)] = element[idx];
Finally, in first number_of_elems of output_array in the host you will have
your inner points.
I hope this help you,
Best
There are alternatives, all working better or worse, depending on how the data looks like. I put one below.
Deleting the identified points can also be done by registering them in a separate array per workgroup - you need to use the same atomic_inc as with Moises's answer (see my remark there about doing this at workgroup-level!!). The end-result is a list of start-points and end-points of parts that don't need to be deleted. You can then copy parts of the array those by different threads. This is less effective if you have clusters of points that need to be deleted

OpenCL ND-Range boundaries?

Consider a kernel which performs vector addition:
__kernel void vecAdd(__global double *a,
__global double *b,
__global double *c,
const unsigned int n)
{
//Get our global thread ID
int id = get_global_id(0);
//Make sure we do not go out of bounds
if (id < n)
c[id] = a[id] + b[id];
}
Is it really necessary to pass the size n to the function, and do a check on the boundaries ?
I have seen the same version without the check on n. Which one is correct?
More generally, I wonder what happens if the data size to process is different than the user defined NR-Range.
Will the remaining, out-of-bounds, data be processed or not?
Is so, how is it processed ?
If not, does that mean that the user have to consider boundaries when programming a Kernel ?
Does OpenCL specifies any of that?
Thanks
The check against n is a good idea if you aren't certain to have a multiple of n work items. When you know you will only ever call the kernel with n work items, the check is only taking up processing cycles, kernel size, and the instruction scheduler's attention.
Nothing will happen with the extra data you pass to the kernel. Although if you don't use the data at some point, you did waste time copying it to the device.
I like to make a kernel's work group and global size independent of the total work to be done. I need to pass in 'n' when this is the case.
For example:
__kernel void vecAdd( __global double *a, __global double *b, __global double *c, const unsigned int n)
{
//Get our global thread ID and global size
int gid = get_global_id(0);
int gsize = get_global_size(0);
//check vs n using for-loop condition
for(int i=gid; i<n; i+= gsize){
c[i] = a[i] + b[i];
}
}
The example will take an arbitrary value for n, as well as any global size. each work item will process every nth element, beginning at its own global id. The same idea works well with work groups too, sometimes outperforming the global version I have listed due to memory locality.
If you know the value of n to be constant, it is often better to hard code it (as a DEFINE at the top). This will let compilers optimize for that specific value and eliminate the extra parameter. Examples of such kernels include: DFT/FFT processing, bitonic sorting at a given stage, and image processing using constant dimensions.
This is typical when the host code specifies the workgroup size, because in OpenCL 1.x the global size must be a multiple of the work group size. So if your data size is 1000 and your workgroup size is 128 then the global size needs to be rounded up to 1024. Hence the check. In OpenCL 2.0 this requirement has been removed.

OpenCL: __constant vs. __local?

Let's say I have a large array of values (still smaller than 64 kB), which is read very often in the kernel, but not written to. It can however change from outside. The array has two sets of values, lets call them left and right.
So the question is, is it faster to get the large array as a __global and write it into __local left and __local right arrays; or get it as a constant __constant large and handle the accesing in the kernel? For example:
__kernel void f(__global large, __local left, __local right, __global x, __global y) {
for(int i; i < size; i++) {
left[i] = large[i];
right[i] = large[i + offset];
}
...
x = foo * left[idx];
y = bar * right[idx];
}
vs:
__kernel void f(__constant large, __global x, __global y) {
...
x = foo * large[idx];
y = bar * large[idx * offset];
}
(The indexing is a bit more complicated, but can be made with macros, for instance)
I read that constant memory lives in the global space, so should it be slower?
It will run in a Nvidia card.
First of all in the second case you should have someway of making the result available for your CPU. I am assuming you copy back to a global space after computation.
I think it depends on what you do in the kernel. For example if you kernel computation is heavy (a lot of computations per thread) then the first option might pay of. Why?
You spend some time copying data from global large space to local spaces left and right - Acceptable
You do a lot of computation on the data on local space - OK
You spend some time copying back from local left and right to global large. - Acceptable.
However if you kernel is relatively light i.e. each thread will do some small computations, then
You do a few computations with data on constant space. Which most probably means you don't need to access it a lot.
You store intermediate results in local space.
You spend some time copying back from local space to global space. - Acceptable.
To sum it up for large kernels the first option is better. For small kernels the second.
P.S. One more thing to note is that if you have multiple kernels that wwork on large one after the other, then definitely go with the first option. Because then you can keep the data on global memory space and you don't have to do copy every time you launch a kernel.
EDIT: since you have said it is accessed very often then I think you should probably go with the first option.

How do I use local memory in OpenCL?

I've been playing with OpenCL recently, and I'm able to write simple kernels that use only global memory. Now I'd like to start using local memory, but I can't seem to figure out how to use get_local_size() and get_local_id() to compute one "chunk" of output at a time.
For example, let's say I wanted to convert Apple's OpenCL Hello World example kernel to something the uses local memory. How would you do it? Here's the original kernel source:
__kernel square(
__global float *input,
__global float *output,
const unsigned int count)
{
int i = get_global_id(0);
if (i < count)
output[i] = input[i] * input[i];
}
If this example can't easily be converted into something that shows how to make use of local memory, any other simple example will do.
Check out the samples in the NVIDIA or AMD SDKs, they should point you in the right direction. Matrix transpose would use local memory for example.
Using your squaring kernel, you could stage the data in an intermediate buffer. Remember to pass in the additional parameter.
__kernel square(
__global float *input,
__global float *output,
__local float *temp,
const unsigned int count)
{
int gtid = get_global_id(0);
int ltid = get_local_id(0);
if (gtid < count)
{
temp[ltid] = input[gtid];
// if the threads were reading data from other threads, then we would
// want a barrier here to ensure the write completes before the read
output[gtid] = temp[ltid] * temp[ltid];
}
}
There is another possibility to do this, if the size of the local memory is constant. Without using a pointer in the kernels parameter list, the local buffer can be declared within the kernel just by declaring it __local:
__local float localBuffer[1024];
This removes code due to less clSetKernelArg calls.
In OpenCL local memory is meant to share data across all work items in a workgroup. And it usually requires to do a barrier call before the local memory data can be used (for example, one work item wants to read a local memory data that is written by the other work items). Barrier is costly in hardware. Keep in mind, local memory should be used for repeated data read/write. Bank conflict should be avoided as much as possible.
If you are not careful with local memory, you may end up with worse performance some time than using global memory.

Resources