I am parallelizing a certain dynamic programming problem using AVX2/SSE instructions.
In the main iteration of my calculation, I calculate column in matrix where each cell is a structure of AVX2 registers (_m256i). I use values from the previous matrix column as input values for calculating the current column. Columns can be big, so what I do is I have an array of structures (on stack), where each structure has two _m256i elements.
Structure:
struct Cell {
_m256i first;
_m256i second;
};
An then I have array like this: Cell prevColumn [N]. N will tipically be few hundreds.
I know that _m256i basically represents an avx2 register, so I am wondering how should I think about this array, how does it behave, since N is much larger than 16 (which is number of avx registers)? Is it a good practice to create such an array, or is there some better approach that i should use when storing a lot of _m256i values that are going to be reused real soon?
Also, is there any aligning I should be doing with this structures? I read a lot about aligning, but I am still not sure how and when to do it exactly.
It's better to structure your code to do everything it can with a value before moving on. Small buffers that fit in L1 cache aren't going to be too bad for performance, but don't do that unless you need to.
I think it's more typical to write your code with buffers of int [] type, rather than __m256i type, but I'm not sure. Either way works, and should get the compile to generate efficient code. But the int [] way means less code has to be different for the SSE, AVX2, and AVX512 version. And it might make it easier to examine things with a debugger, to have your data in an array with a type that will get the data formatted nicely.
As I understand it, the load/store intrinsics are partly there as a cast between _m256i and int [], since AVX doesn't fault on unaligned, just slows down on cacheline boundaries. Assigning to / from an array of _m256i should work fine, and generate load/store instructions where needed, otherwise generate vector instructions with memory source operands. (for more compact code and fewer fused-domain uops.)
Related
I have an OpenCL Kernel with multiple work items. Let's assume for discussion, that I have a 2-D Workspace with x*y elements working on an equally sized, but sparce, array of input elements. Few of these input elements produce a result, that I want to keep, most don't. I want to enqueue another kernel, that only takes the kept results as an input.
Is it possible in OpenCL to append results to some kind of list to pass them as input to another Kernel or is there a better idea to reduce the volume of the solution space? Furthermore: Is this even a good question to ask with the programming model of OpenCL in mind?
What I would do if the amount of result data is a small percentage (ie: 0-10%) is use local atomics and global atomics, with a global counter.
Data interface between kernel 1 <----> Kernel 2:
int counter //used by atomics to know where to write
data_type results[counter]; //used to store the results
Kernel1:
Create a kernel function that does the operation on the data
Work items that do produce a result:
Save the result to local memory, and ensure no data races occur using local atomics in a local counter.
Use the work item 0 to save all the local results back to global memory using global atomics.
Kernel2:
Work items lower than "counter" do work, the others just return.
So bit arrays and hash tables don't seem to inherently allow for a find-max type operation, but there are ways around it. I'm wondering if there's a way using the bit array alone without extra variables, pointers, or manipulating the start/end of the array, in some scenarios. For example...
I have integers {1,...,n} and a n-bit bit array. To keep a subset of the integers, I use the integer itself as the key in the bit array and set the bit to 1 if it is in the subset, or 0 if it is not.
For example for integers {1,2,3,4} and subset {1,3), the bit array would look like {1,0,1,0}.
It seems like there's no way to do this without somehow moving the bits around which leads me to believe the O(1) dream is dead and perhaps the bit array won't work. Is something like this possible in O(log n)?
Thanks
Finding the highest set bit on a bit array of length n is O(n). If you need better, then you'll need to choose another data structure, or keep a high-water mark along with your bitmap.
After compacting an array(putting required elements from an input array into an output array) by doing a scan operation, there might be some empty spaces left in the output(compacted) array in a contiguous form after the required elements are placed. Is there a way to free these empty spaces in the OpenCL kernel code itself without going back in the host(just for the sake of deleting)...?
for eg I have an input array of 100 elements with some no.s greater than 50 and some of them less than 50 and want to store the no.s more than 50 in a different array and do further processing only on those elements in that array, and I don't know the size of this output array since I don't know how many no.s are actually greater than 50(so I declare the size of this array to be 100)... then after performing a scan I get the output array with all elements more than 50... but there might be some continuous spaces empty in the output array after the storage of these elements... then how do we delete these spaces... Is there a way of doing this in the kernel code itself...? Or do we have to come back in the Host code for this...?
How do we deal with such compacted arrays to do further processing if we can't delete the remaining spaces in the kernel code itself and also if we don't want to go back in the host code..?
There is no simple solution to your problem I'm afraid.
What I think you might do, is to have a counter of the elements in each array. You can increment the counter first locally with atomic_inc() and then globally with atomic_add().
This way at the end of your kernel execution the total number of elements in each array will be present.
You can also use this atomic operation as an index for the array. This way you can write to the output without any "hole" in your array. However you will probably lose some speed due to abusing of atomic operations I'm afraid.
Assuming you want a list of arrays, each having the same size. Is it better performance-wise to use a 2D array :
integer, allocatable :: data(:,:)
or an array of derived types :
type test
integer, allocatable :: content(:)
end type
type(test), allocatable :: data(:)
Of course, for arrays of different sizes, we don't have a choice. But how is the memory managed between the 2 cases ? Also, is one of them good code practice ?
Choose the implementation which minimises the conceptual distance that your mind has to leap between the problem in your head and the solution in your code. The force of this approach increases with age, both the age of your code (good conceptual design is a solid foundation for future development) and your own age (the less effort understanding your code demands the longer you'll remain mentally competent enough to understand it).
As to the non-opinion-determined part of your question concerning the way that the memory is managed ... My naive expectation is that most compilers will, under most circumstances, allocate contiguous memory for the first of your outlines, and may not for the second. But I don't care enough about this to check, and I do not think that you should either. I don't, by this, suggest that you should not be interested in what is going on under the hood, but rather that you should be more concerned with the matters referred to in the first paragraph.
In general, you want to use the simplest data structure that suits your problem. If a 2d rectangular array meets your needs - and for a huge number of scientific computing problems, problems for which Fortran is a good choice, it does - then that's the choice you want.
The 2d array will be contiguous in memory, which will normally make accessing it faster both due to caching and one fewer level of indirection; the 2d array will also allow you to do things like data = data * 2 or data = 0. which the array-of-array approach doesn't [Edited to add: though as IanH points out in comments you can create a defined type and defined operations on those types to allow this]. Those advantages are great enough that even when you have "ragged arrays", if the range of expected row lengths isn't that large, implementing it as a rectangular 2d array is sometimes a choice worth considering.
I have a struct defined as such:
typedef struct {
string mName;
vector<int> mParts;
} AGroup;
I'm storing instances of this struct in a vector. I need to write this to an HDF (v5) file. I guess I could loop through each instance to find the longest mName, and longest mParts, create a new, non-variable length, array to hold the information, and then write that array to the file.
Is that the best way to do it? It seems overly complex just to write some data.
Variable length arrays and string introduce overhead. But they also make sense as they reflect more accurately your data structure. You could go with a compound datatype made of a variable length C string and a variable length array of integers.
If your string and vectors all have a size close to the same upper bound, save yourself some time and trouble and use fixed length strings and arrays.
It all depends on the size of your dataset. In terms of disk space, there is a tradeoff between overhead of variable length elements and wasted space in fixed size elements that is hard to estimate without trying. If your dataset is small, do what is more convenient to you. If it is large, choose what you favor most: avoid wasted space, semantic of your data, ease of programming, etc. and optimize according to this criteria.