Deleting some elements in an array in OpenCL kernel code - opencl

After compacting an array(putting required elements from an input array into an output array) by doing a scan operation, there might be some empty spaces left in the output(compacted) array in a contiguous form after the required elements are placed. Is there a way to free these empty spaces in the OpenCL kernel code itself without going back in the host(just for the sake of deleting)...?
for eg I have an input array of 100 elements with some no.s greater than 50 and some of them less than 50 and want to store the no.s more than 50 in a different array and do further processing only on those elements in that array, and I don't know the size of this output array since I don't know how many no.s are actually greater than 50(so I declare the size of this array to be 100)... then after performing a scan I get the output array with all elements more than 50... but there might be some continuous spaces empty in the output array after the storage of these elements... then how do we delete these spaces... Is there a way of doing this in the kernel code itself...? Or do we have to come back in the Host code for this...?
How do we deal with such compacted arrays to do further processing if we can't delete the remaining spaces in the kernel code itself and also if we don't want to go back in the host code..?

There is no simple solution to your problem I'm afraid.
What I think you might do, is to have a counter of the elements in each array. You can increment the counter first locally with atomic_inc() and then globally with atomic_add().
This way at the end of your kernel execution the total number of elements in each array will be present.
You can also use this atomic operation as an index for the array. This way you can write to the output without any "hole" in your array. However you will probably lose some speed due to abusing of atomic operations I'm afraid.


Parallel iteration over array with step size greater than 1

I'm working on a practice program for doing belief propagation stereo vision. The relevant aspect of that here is that I have a fairly long array representing every pixel in an image, and want to carry out an operation on every second entry in the array at each iteration of a for loop - first one half of the entries, and then at the next iteration the other half (this comes from an optimisation described by Felzenswalb & Huttenlocher in their 2006 paper 'Efficient belief propagation for early vision'.) So, you could see it as having an outer for loop which runs a number of times, and for each iteration of that loop I iterate over half of the entries in the array.
I would like to parallelise the operation of iterating over the array like this, since I believe it would be thread-safe to do so, and of course potentially faster. The operation involved updates values inside the data structures representing the neighbouring pixels, which are not themselves used in a given iteration of the outer loop. Originally I just iterated over the entire array in one go, which meant that it was fairly trivial to carry this out - all I needed to do was put .Parallel between Array and .iteri. Changing to operating on every second array entry is trickier, however.
To make the change from simply iterating over every entry, I from Array.iteri (fun i p -> ... to using for i in startIndex..2..(ArrayLength - 1) do, where startIndex is either 1 or 0 depending on which one I used last (controlled by toggling a boolean). This means though that I can't simply use the really nice .Parallel to make things run in parallel.
I haven't been able to find anything specific about how to implement a parallel for loop in .NET which has a step size greater than 1. The best I could find was a paragraph in an old MSDN document on parallel programming in .NET, but that paragraph only makes a vague statement about transforming an index inside a loop body. I do not understand what is meant there.
I looked at Parallel.For and Parallel.ForEach, as well as creating a custom partitioner, but none of those seemed to include options for changing the step size.
The other option that occurred to me was to use a sequence expression such as
let getOddOrEvenArrayEntries myarray oddOrEven =
seq {
let startingIndex =
if oddOrEven then
for i in startingIndex..2..(Array.length myarray- 1) do
yield (i, myarray.[i])
and then using PSeq.iteri from ParallelSeq, but I'm not sure whether it will work correctly with .NET Core 2.2. (Note that, currently at least, I need to know the index of the given element in the array, as it is used as the index into another array during the processing).
How can I go about iterating over every second element of an array in parallel? I.e. iterating over an array using a step size greater than 1?
You could try PSeq.mapi which provides not only a sequence item as a parameter but also the index of an item.
Here's a small example
let res = nums
|> PSeq.mapi(fun index item -> if index % 2 = 0 then item else item + 1)
You can also have a look over this sampling snippet. Just be sure to substitute Seq with PSeq

Determine if there exists a number in the array occurring k times

I want to create a divide and conquer algorithm (O(nlgn) runtime) to determine if there exists a number in an array that occurs k times. A constraint on this problem is that only a equality/inequality comparison method is defined on the objects of the array (i.e can't use <, >).
So I have tried a number of approaches including splitting the array into k pieces of equal size (approximately). The approach is similar to finding the majority item in an array, however in the majority case when you split the array, you know that one half must have a majority item if such an item exists. Any pointers or tips that one could provide to put me in the right direction ?
EDIT: To clear up a little, I am wondering whether the problem of finding the majority item by splitting the array in half and using a recursive solution can be extended to other situations where k may be n/4 or n/5 etc.
Maybe I should of phrased the question using n/k instead.
This is impossible. As a simple example of why this is impossible, consider an input with a length-n array, all elements distinct, and k=2. The only way to be sure no element appears twice is to compare every element against every other element, which takes O(n^2) time. Until you perform all possible comparisons, you cannot be sure that some pair you didn't compare isn't actually equal.

Append OpenCL result to list / Reduce solution room

I have an OpenCL Kernel with multiple work items. Let's assume for discussion, that I have a 2-D Workspace with x*y elements working on an equally sized, but sparce, array of input elements. Few of these input elements produce a result, that I want to keep, most don't. I want to enqueue another kernel, that only takes the kept results as an input.
Is it possible in OpenCL to append results to some kind of list to pass them as input to another Kernel or is there a better idea to reduce the volume of the solution space? Furthermore: Is this even a good question to ask with the programming model of OpenCL in mind?
What I would do if the amount of result data is a small percentage (ie: 0-10%) is use local atomics and global atomics, with a global counter.
Data interface between kernel 1 <----> Kernel 2:
int counter //used by atomics to know where to write
data_type results[counter]; //used to store the results
Create a kernel function that does the operation on the data
Work items that do produce a result:
Save the result to local memory, and ensure no data races occur using local atomics in a local counter.
Use the work item 0 to save all the local results back to global memory using global atomics.
Work items lower than "counter" do work, the others just return.

What is Vector data structure

I know Vector in C++ and Java, it's like dynamic Array, but I can't find any general definition of Vector data structure. So what is Vector? Is Vector a general data structure(like arrray, stack, queue, tree,...) or it just a data type depending on language?
The word "vector" as applied to computer science/programming is borrowed from math, which can make the use confusing (even your question could be on multiple subjects).
The simplest example of vectors in math is the number line, used to teach elementary math (especially to help visualize negative numbers, subtraction of negative numbers, addition of negative numbers, etc).
The vector is a distance and direction from a point. This is why it can confuse the discussion, because a vector data structure COULD be three points, X,Y,Z, in a structure used in 3D graphics engines, or a 2D point (just X,Y). In that context, the subtraction of two such points results in a vector - the vector describes how far and in what direction to travel from one of the source operands to the other.
This applies to storage, like stl vectors or Java vectors, in that storage is represented as a distance from an address (where a memory address is similar to a point in space, or on a number line).
The concept is related to arrays, because arrays could be the storage allocated for a vector, but I submit that the vector is a larger concept than the array. A vector must include the concept of distance from a starting point, and if you think of the beginning of an array as the starting point, the distance to the end of the array is it's size.
So, the data structure representing a vector must include the size, whereas an array doesn't have storage to include the size, it's assumed by the way it's allocated. That is to say, if you dynamically allocate an array, there is no data structure storing the size of that array, the programmer must assume to know that size, or store it in a some integer or long.
The vector data structure (say, the design of a vector class) DOES need to store the size, so at a minimum, there would be a starting point (the base of an array, or some address in memory) and a distance from that point indicating size.
That's really "RAM" oriented, though, in description, because there's one more point not yet described which must be part of the data describing the vector - the notion of element size. If a vector represents bytes, and memory storage is typically measured in bytes, an address and a distance (or size) would represent a vector of bytes, but nothing else - and that's a very machine level thinking. A higher thought, that of some structure, has it's own size - say, the size of a float or double, or of a structure or class in C++. Whatever the element size is, the memory required to store N of them requires that the vector data structure have some knowledge of WHAT it's storing, and how large that thing is. This is why you'd think in terms of "a vector of strings" or "a vector of points". A vector must also store an element size.
So, a basic vector data structure must have:
An address (the starting point)
An element size (each thing it stores is X bytes long)
A number of elements stored (how many elements times element size is 'minimum' storage size).
One important "assumption" made in this simple 3 item list of entries in the vector data structure is that the address is allocated memory, which must be freed at some point, and is to be guarded against access beyond the end of the vector.
That means there's something missing. In order to make a vector class work, there is a recognizable difference between the number of ITEMS stored in the vector, and the amount of memory ALLOCATED for that storage. Typically, as you might realize from the use of vector from the STL, it may "know" it has room to store 10 items, but currently only has 2 of them.
So, a working vector class would ALSO have to store the amount of memory allocation. This would be how it could dynamically extend itself - it would now have sufficient information to expand storage automatically.
Thinking through just how you would make a vector class operate gives you the structure of data required to operate a vector class.
It's an array with dynamically allocated space, everytime you exceed this space new place in memory is allocated and old array is copied to the new one. Old one is freed then.
Moreover, vector usually allocates more memory, than it needs to, so it does not have to copy all the data, when new element is added.
It may seem, that lists then are much much better, but it's not necessarily so. If you do not change your vector often (in terms of size), then computer's cache memory functions much better with vectors, than lists, because they are continuus in memory space. Disadvantage is when you have large vector, that you need to expand. Then you have to agree to copy large amount of data to another space in memory.
What's more. You can add new data to the end and to the front of the vector. Because Vector's are array-like, then every time you want to add element to the beginning of the vector all the array has to be copied. Adding elements to the end of vector is far more efficient. There's no such an issue with linked lists.
Vector gives random access to it's internal kept data, while lists,queues,stacks do not.
Vectors are the same as dynamic arrays with the ability to resize
itself automatically when an element is inserted or deleted.
Vector elements are placed in contiguous storage so that they can be
accessed and traversed using iterators.
In vectors, data is inserted at the end.

Is it ok to create big array of AVX/SSE values

I am parallelizing a certain dynamic programming problem using AVX2/SSE instructions.
In the main iteration of my calculation, I calculate column in matrix where each cell is a structure of AVX2 registers (_m256i). I use values from the previous matrix column as input values for calculating the current column. Columns can be big, so what I do is I have an array of structures (on stack), where each structure has two _m256i elements.
struct Cell {
_m256i first;
_m256i second;
An then I have array like this: Cell prevColumn [N]. N will tipically be few hundreds.
I know that _m256i basically represents an avx2 register, so I am wondering how should I think about this array, how does it behave, since N is much larger than 16 (which is number of avx registers)? Is it a good practice to create such an array, or is there some better approach that i should use when storing a lot of _m256i values that are going to be reused real soon?
Also, is there any aligning I should be doing with this structures? I read a lot about aligning, but I am still not sure how and when to do it exactly.
It's better to structure your code to do everything it can with a value before moving on. Small buffers that fit in L1 cache aren't going to be too bad for performance, but don't do that unless you need to.
I think it's more typical to write your code with buffers of int [] type, rather than __m256i type, but I'm not sure. Either way works, and should get the compile to generate efficient code. But the int [] way means less code has to be different for the SSE, AVX2, and AVX512 version. And it might make it easier to examine things with a debugger, to have your data in an array with a type that will get the data formatted nicely.
As I understand it, the load/store intrinsics are partly there as a cast between _m256i and int [], since AVX doesn't fault on unaligned, just slows down on cacheline boundaries. Assigning to / from an array of _m256i should work fine, and generate load/store instructions where needed, otherwise generate vector instructions with memory source operands. (for more compact code and fewer fused-domain uops.)
