I have an array of 2M+ points (planned to be increased to 20M in due course) that I am running calculations on via OpenCL. I'd like to delete any points that fall within a random triangle geometry.
How can I do this within an OpenCL kernel process?
I can already:
identify those points that fall outside the triangle (simple point in poly algorithm in the kernel)
pass their coordinates to a global output array.
But:
an openCL global output array cannot be variable and so I initialise it to match the input array of points in terms of size
As a result, 0,0 points occur in the final output when a point falls within the triangle
The output array therefore does not result in any reduction per se.
Can the 0,0 points be deleted within the openCL context?
n.b. I am coding in OpenFrameworks, so c++ implementations are linking to .cl files
Just an alternative for the case where most of the points fall inside the atomic condition:
It is possible to have a local counter, and local atomic. Then to merge that atomic to the global value it is possible to use atomic_add(). Witch will return the "previous" global value. So, you just copy the indexes to that address and up.
It should be a noticeable speed up, since the threads will sync locally and only once globally. The global copy can be parallel since the address will never overlap.
For example:
__kernel mykernel(__global MyType * global_out, __global int * global_count, _global MyType * global_in){
int lid = get_local_id(0);
int lws = get_local_size(0);
int idx = get_global_id(0);
__local int local_count;
__local int global_val;
//I am using a local container, but a local array of pointers to global is possible as well
__local MyType local_out[WG_SIZE]; //Ensure this is higher than your work_group size
if(lid==0){
local_count = 0; global_val = -1;
}
barrier(CLK_LOCAL_MEM_FENCE);
//Classify them
if(global_in[idx] == ....)
local_out[atomic_inc(local_count)] = global_in[idx];
barrier(CLK_LOCAL_MEM_FENCE);
//If not, we are done
if(local_count > 0){
//Only the first local ID does the atomic to global
if(lid == 0)
global_val = atomic_add(global_count,local_count);
//Resync all the local workers here
barrier(CLK_LOCAL_MEM_FENCE);
//Copy all the data
for(int i=0; i<local_count; i+=lws)
global_out[global_val+i] = local_out[i];
}
}
NOTE: I didn't compile it but should more or less work.
If I understood your problem, you can do:
--> In your kernel, you can identify the points in the triangle and:
if(element[idx]!=(0,0))
output_array[atomic_inc(number_of_elems)] = element[idx];
Finally, in first number_of_elems of output_array in the host you will have
your inner points.
I hope this help you,
Best
There are alternatives, all working better or worse, depending on how the data looks like. I put one below.
Deleting the identified points can also be done by registering them in a separate array per workgroup - you need to use the same atomic_inc as with Moises's answer (see my remark there about doing this at workgroup-level!!). The end-result is a list of start-points and end-points of parts that don't need to be deleted. You can then copy parts of the array those by different threads. This is less effective if you have clusters of points that need to be deleted
Related
I have a question for my understanding in general. For this question I build up a scenario to keep it as simple as possible.
Lets say:
I have a structure of 2 variables (x and y). And also I have thousands of objects of this structure in a buffer next to each other in an array. The initial values of these structure are different. But later always the same arithmetic operations should be applied to each of these structures. (So this is extremely good for the GPU because each worker is doing exactly the same operation only with different values without branching.) Additionally this structs are not needed on CPU at all. So only at the entire end of the program all values should be stored back to the CPU.
The operations on these structs are limited as well! Lets say, we have 8 operations which can be applied:
x + y, store result in x
x + y, store result in y
x + x, store result in x
y + y, store result in y
x * y, store result in x
x * y, store result in y
x * x, store result in x
y * y, store result in y
when creating one kernel program for one operation, the kernel program for operation 1 would look like the following:
__kernel void operation1(__global float *structArray)
{
// Get the index of the current element to be processed
int i = get_global_id(0) * 2;
// Do the operation
structArray[i] = structArray[i] + structArray[i + 1]; //this line will change for different operations (+, *, store to x, y)
}
when executing these kernels multiple times in some order like: operation 1, 2, 2, 3, 1, 7, 3, 5....
Then I have for each execution at least one global memory read operation and also one global memory write operation. But in Theory if each worker would store its structure (x and y value) in the private memory the execution would be faster by a factor of like 50 or so.
Is it possible to do something like this?:
__private float x;
__private float y;
__kernel void operation1(void)
{
// Do the operation
x = x + y; //this line will change for different operations (+, *, store to x, y)
}
to do so, you fist need to store the values... for example like the following:
__private float x;
__private float y;
__kernel void operationStore(__global float *structArray)
{
int i = get_global_id(0) * 2;
//store the x and y value from global to private memory
x = structArray[i];
y = structArray[i + 1];
}
and of cause at the entire end of the program you need to store them back to global memory to later push it to the CPU again:
__private float x;
__private float y;
__kernel void operationStoreToGlobal(__global float *structArray)
{
int i = get_global_id(0) * 2;
//store the x and y value from private to global memory
structArray[i] = x;
structArray[i + 1] = y;
}
So my question:
Can I somehow manage to store values on private or maybe local memory during different kernel calls? If so, I would only have the performance reduction by the program queue.
How many clock cycles does the program queue need to change from one kernel to another?
Is this timing of the change of kernel, kernel size specific? If so: Does is depend on number of operations within the kernel or does is depend on number of buffer bindings (rebind stuff)
Is there a thumb of rule, how mush operations (counted by clock cycles) a kernel should at least have to be performant?
This is not possible. You cannot communicate data across kernels in "global variables" in private or local memory space. You need to use global kernel arguments to temporarily store results, and thus write the values to video memory temporarily and read from video memory in the next kernel.
The only memory space allowed for "global variables" is constant: With it, you can create large look-up tables for example. These are read-only. constant variables are cached in L2 whenever possible.
Potentially several thousand. When you finish one kernel and start another, you have a global synchronization point. All instances of kernel 1 need to be finished before kernel 2 can start.
Yes. It depends on the global range, local (work group) range, number of operations (especially if-else branching, because one work group can take significantly longer than the other), but not on the number of kernel arguments / buffer bindings. The larger the global size, the longer the kernel takes, the smaller are relative time-vatiations between work groups and the smaller is the relative performance loss of the kernel change (synchronization point).
Better question: How large should the global range be for a kernel to be performant? Answer: Very large, like 100 times the CUDA core / stream processor count.
There are tricks to reduce the number of required global synchronization points. For example: If a kernel can combine multiple different tasks from different kernels, squash two kernels together into one.
Example here: lattice Boltzmann method, two-step swap versus one-step swap.
Another common trick is to allocate a buffer twice in video memory. In even steps, read from A and write to B and in odd steps the other way around. Avoid reading from A and at the same time writing to other elements of A (introduces race-conditions).
Using MPI and C, I'm looking to distribute (scatter and gather) a 2D array of complex double values (ie. every element in the 2D array is of type complex double, so has a creal and cimag component). If I use regular declaration of a 2D array of size n-by-n:
double complex grid[n][n];
Everything works just fine, BUT my program will fail depending on the size of n, giving a "segmentation fault" error. Anything above, say, 256 will immediately spit out a "segmentation fault" error. This is the problem that I'm having and am failing miserably to figure out.
After browsing through numerous similar issues, I'm guessing my problem is that I'm overloading the stack memory (something I'm honestly not 100% in understanding), meaning that I need to dynamically allocate my 2D arrays using malloc or calloc.
However, in my understanding, allocating a 2D array that you can call like grid[n][n] won't work since the allocated memory is not necessarily aligned, meaning that MPI_Scatter fails.
double complex **alloc_2d_complex(int rows, int cols){
double complex *data = (complex double*) malloc(rows*cols*sizeof(complex double));
double complex **array = (complex double**) malloc(rows*sizeof(complex double*));
int i;
for (i = 0; i < rows; i++)
array[i] = &(data[cols*i]);
return array;
}
int main(int argc, char*argv[]){
double complex **grid;
grid = alloc_2d_complex(n,n);
/* Continue to initialize MPI and attempt Scatter... */
}
I've tried initializing a 2D by this method and scatter does fail for me, giving errors "memcpy argument memory ranges overlap" since something in memory apparently doesn't line up right.
This means I must allocate everything in 1D arrays in row-major order, like:
grid[y][x] ==> grid[y*n + x]
I'm really, really trying to avoid this because I'm dealing with numerous transposed and untransposed matrices (which is hard enough to keep track of in [y][x] logic) and it's going to make things difficult to keep track of for my purpose, but fine, if it's what I have to do then let's get it over with. But this ALSO doesn't work with MPI_Scatter, giving me once again "memcpy" errors, which I am utterly dumbfounded by. Below is an example of how I'm trying to do everything using 1D arrays. Since I'm getting the same error for this and the 2D allocated array, maybe the 2D allocation will work and I'm just missing something here. I'm only using a number of processors, numProcs, that can evenly divide n.
int n = 128;
double complex *grid = malloc(n*n*sizeof(complex double));
/* ... Initialize MPI ... */
stepSize = (int) n/numProcs;
double complex *gridChunk = malloc(stepSize*n*sizeof(complex double));
/* ... Initialize grid[y*n+x] Values... */
MPI_Scatter(&grid, n*stepSize, MPI_C_DOUBLE_COMPLEX,
&gridChunk, n*stepSize, MPI_C_DOUBLE_COMPLEX,
0, MPI_COMM_WORLD);
Hello I am fairly new to openCL and have encountered a problem when trying to index my multidimensional arrays. From what I understand it is not possible to store a multidimensional array in the global memory, but it is possible in the local memory. However when I try to access my 2D local array it always comes back as 0.I had a look at my gpu at http://www.notebookcheck.net/NVIDIA-GeForce-GT-635M.66964.0.html and found out that I had 0 shared memory, could this be the reason? What other limitations will 0 shared memory place on my programming experience?
I've posted a small simple program of the problem that I'm facing.
The input is = [1,2,3,4] and I would like to store this in my 2D array.
__kernel void kernel(__global float *input, __global float *output)
{//the input is [1,2,3,4];
int size=2;//2by2 matrix
int idx = get_global_id(0);
int idy = get_global_id(1);
__local float 2Darray[2][2];
2Darray[idx][idy]=input[idx*size+idy];
output[0]=2Darray[1][1];//this always returns 0, but should return 4 on the first output no?
}
__local float 2Darray[1][1];
is 1 element wide, 1 element high.
2Darray[1][1]
is second row and second column which doesnt exist.
Even if it lets you have local memory without an error, it spills to global memory and gets as slow as vram bandwidth(if it doesnt fit local mem space).
Race condition:
output[0]=2Darray[1][1];
each core trying to write to same(0) index. Add
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
if(idx==0 && idy==0)
before it so only 1 thread writes to it. But this still needs synchronization instruction before that.
Consider a kernel which performs vector addition:
__kernel void vecAdd(__global double *a,
__global double *b,
__global double *c,
const unsigned int n)
{
//Get our global thread ID
int id = get_global_id(0);
//Make sure we do not go out of bounds
if (id < n)
c[id] = a[id] + b[id];
}
Is it really necessary to pass the size n to the function, and do a check on the boundaries ?
I have seen the same version without the check on n. Which one is correct?
More generally, I wonder what happens if the data size to process is different than the user defined NR-Range.
Will the remaining, out-of-bounds, data be processed or not?
Is so, how is it processed ?
If not, does that mean that the user have to consider boundaries when programming a Kernel ?
Does OpenCL specifies any of that?
Thanks
The check against n is a good idea if you aren't certain to have a multiple of n work items. When you know you will only ever call the kernel with n work items, the check is only taking up processing cycles, kernel size, and the instruction scheduler's attention.
Nothing will happen with the extra data you pass to the kernel. Although if you don't use the data at some point, you did waste time copying it to the device.
I like to make a kernel's work group and global size independent of the total work to be done. I need to pass in 'n' when this is the case.
For example:
__kernel void vecAdd( __global double *a, __global double *b, __global double *c, const unsigned int n)
{
//Get our global thread ID and global size
int gid = get_global_id(0);
int gsize = get_global_size(0);
//check vs n using for-loop condition
for(int i=gid; i<n; i+= gsize){
c[i] = a[i] + b[i];
}
}
The example will take an arbitrary value for n, as well as any global size. each work item will process every nth element, beginning at its own global id. The same idea works well with work groups too, sometimes outperforming the global version I have listed due to memory locality.
If you know the value of n to be constant, it is often better to hard code it (as a DEFINE at the top). This will let compilers optimize for that specific value and eliminate the extra parameter. Examples of such kernels include: DFT/FFT processing, bitonic sorting at a given stage, and image processing using constant dimensions.
This is typical when the host code specifies the workgroup size, because in OpenCL 1.x the global size must be a multiple of the work group size. So if your data size is 1000 and your workgroup size is 128 then the global size needs to be rounded up to 1024. Hence the check. In OpenCL 2.0 this requirement has been removed.
Let's say I have a large array of values (still smaller than 64 kB), which is read very often in the kernel, but not written to. It can however change from outside. The array has two sets of values, lets call them left and right.
So the question is, is it faster to get the large array as a __global and write it into __local left and __local right arrays; or get it as a constant __constant large and handle the accesing in the kernel? For example:
__kernel void f(__global large, __local left, __local right, __global x, __global y) {
for(int i; i < size; i++) {
left[i] = large[i];
right[i] = large[i + offset];
}
...
x = foo * left[idx];
y = bar * right[idx];
}
vs:
__kernel void f(__constant large, __global x, __global y) {
...
x = foo * large[idx];
y = bar * large[idx * offset];
}
(The indexing is a bit more complicated, but can be made with macros, for instance)
I read that constant memory lives in the global space, so should it be slower?
It will run in a Nvidia card.
First of all in the second case you should have someway of making the result available for your CPU. I am assuming you copy back to a global space after computation.
I think it depends on what you do in the kernel. For example if you kernel computation is heavy (a lot of computations per thread) then the first option might pay of. Why?
You spend some time copying data from global large space to local spaces left and right - Acceptable
You do a lot of computation on the data on local space - OK
You spend some time copying back from local left and right to global large. - Acceptable.
However if you kernel is relatively light i.e. each thread will do some small computations, then
You do a few computations with data on constant space. Which most probably means you don't need to access it a lot.
You store intermediate results in local space.
You spend some time copying back from local space to global space. - Acceptable.
To sum it up for large kernels the first option is better. For small kernels the second.
P.S. One more thing to note is that if you have multiple kernels that wwork on large one after the other, then definitely go with the first option. Because then you can keep the data on global memory space and you don't have to do copy every time you launch a kernel.
EDIT: since you have said it is accessed very often then I think you should probably go with the first option.