C++ doubling index vector of unique ordered relationships - vector

I'm looking for an std style container, i.e. with iterators and such with a
structure along the lines of:
template <hashable T, hasable U> class relationships {
relation(std::[vector or list]<std::pair<T,U>> list);
const std::pair<T,U>& operator [](const T index);
const std::pair<T,U>& operator [](const U index);
}
This is for a two way mapping, into a order list of pairs, every value of both T and U are unique, and both are hashable, and the pairs of related T and U have a specific ordering to them, that should be reproduce by the following loop
for (auto it : relationships) {
// do something with it
}
would be equivalent to
for (auto it : list) {
// do something with it
}
I also want efficient lookup i.e. operator [], should be equivalent to an std::unorderd_map for both types.
Finally I'm look for solutions based around the Standard Library using C++14 and DO NOT WANT TO USE BOOST.
I seen how to implement a Hash map previously using binary search trees, however I looking for insight in how to efficiently maintain the structure for two indexes plus ordered elements, or existing solutions if one exists:
my current idea is something using nodes along the line of
template <typename T, typename U> struct node {
std::pair<T, U> value; // actual value
// hashs for sorting binary trees
size_t hashT;
size_t hashU;
// linked list for ordering
node * prevL;
node * nextL;
// binary search tree for type T lookup
node * parentT;
node * prevT;
node * nextT;
// binary search tree for type U lookup
node * parentU;
node * prevU;
node * nextU;
}
However that seams inefficient
my other idea is to store a vector or values, which has order, and then two sorted index vectors of std::pair<size_t, size_t> with first being the hash, and second the index, however how should I deal with performing a binary search on the index vector and handle hash collisions. I believe this solution would be more memory efficient and similar speed, but not sure on all the implementation details.
EDIT: I don't need fast insertions, just lookup and iteration, the mapping is would be generated once and then used to find relationships.

Regarding performance it all depends on the algorithm and the type of T and U you are trying to use . If you build your data and then do not change it, a simple solution would be the following:
Use a vector<pair<T,U>> for constructing your data
duplicate this vector
sort one vector according to T, one according to U
use binary search for fast lookup, either in the first vector if looking by T, or in the second if looking by U
hide all this behind a construct/sort/access interface. You might not want to use operator[] since you are expected to look into your data structure only once sorted
Of course the solution is not perfect in the sense that you are copying data. However, remember that you will have no extra hidden allocation as you would with a hashmap. For example, for T = U = int, I would think that there will be no more memory in use than a std::unordered_map, since each node needs to store a pointer.

Related

Usage of AVX512-CD

I am currently working with the KNL and try to understand the new opportunities of AVX512. Besides the extended register side, AVX512 comes along with new instruction sets. The conflict detection seems to be promising.
The intrinsic
_mm512_conflict_epi32(...)
creates a vector register, containing a conflict free subset of the given source register:
As one can see, the first appearence of a value results in a 0 at the corresponding position within the result vector. If the value is present multiple times, the result register holds a zero-extended value.
So far so good! BUT I wonder how one can utilize this result for further aggregations or computations. I read that one could use it along side a leading zeros count, but I don't think that is should be enough to determine the values of the subsets.
Does anyone know how one can utilize this result?
Sincerely
Now I understand that your question is how to utilize results from VPCONFLICTD/Q to build subsets for further aggregations or computations ...
Using your own example:
conflict_input =
[
00000001|00000001|00000001|00000001|
00000002|00000002|00000002|00000002|
00000002|00000002|00000001|00000001|
00000001|00000001|00000001|00000001
]
Applying VPCONFLICTD:
__m512i out = _mm512_conflict_epi32(in);
Now we get:
conflict_output =
[
00000000|00000001|00000003|00000007|
00000000|00000010|00000030|00000070|
000000f0|000001f0|0000000f|0000040f|
00000c0f|00001c0f|00003c0f|00007c0f
]
bit representation =
[
................|...............1|..............11|.............111|
................|...........1....|..........11....|.........111....|
........1111....|.......11111....|............1111|.....1......1111|
....11......1111|...111......1111|..1111......1111|.11111......1111
]
If you wish to get a mask based on first appearance of non-repeating value
const __m512i set1 = _mm512_set1_epi32(0xFFFFFFFF);
const __mmask16 mask = _mm512_testn_epi32_mask(out, set1);
Now you can do all the usual stuff with the mmask16
[1000100000000000]
you can also compress it:
const __m512i out3 = _mm512_mask_compress_epi32(set0, mask, in);
[00000001|00000002|00000000|00000000|
00000000|00000000|00000000|00000000|
00000000|00000000|00000000|00000000|
00000000|00000000|00000000|00000000]
There are lots of things you can do with the mask; However, I noticed interestingly the vplzcntd and don't know where I can use it:
const __m512i out1 = _mm512_conflict_epi32(in);
const __m512i out2 = _mm512_lzcnt_epi32(out1);
output2 = [
00000020|0000001f|0000001e|0000001d|
00000020|0000001b|0000001a|00000019|
00000018|00000017|0000001c|00000015|
00000014|00000013|00000012|00000011
]
= [
..........1.....|...........11111|...........1111.|...........111.1|
..........1.....|...........11.11|...........11.1.|...........11..1|
...........11...|...........1.111|...........111..|...........1.1.1|
...........1.1..|...........1..11|...........1..1.|...........1...1
]
See also some AVX512 histogram links and info I dug up a while ago in this answer.
I think the basic idea is to scatter the conflict-free set of elements, then re-gather, re-process, and re-scatter the next conflict-free set of elements. Repeat until there are no more conflicts.
Note that the first appearance of a repeated index is a "conflict-free" element, according to vpconflictd, so a simple repeat loop makes forward progress.
Steps in this process:
Turn a vpconflictd result into a mask that you can use with a gather instruction: _mm512_testn_epi32_mask (as suggested by #veritas) against a vector of all-ones looks good for this, since you need to invert it. You can't just test it against itself.
Remove the already-done elements: vpcompressd is probably good for this. We can even fill up the "empty" spaces in our vector with new elements, so we don't re-run the gather / process / scatter loop with most of the elements masked.
For example, this might work as a histogram loop, if I'm doing this right:
// probably slow, since it assumes conflicts and has a long loop-carried dep chain
// TOTALLY untested.
__m512i all_ones = _mm512_set1_epi32(-1); // easy to gen on the fly (vpternlogd)
__m512i indices = _mm512_loadu_si512(p);
p += 16;
// pessimistic loop that assumes conflicts
while (p < endp) {
// unmasked gather, so it can run in parallel with conflict detection
__m512i v = _mm512_i32gather_epi32(indices, base, 4);
v = _mm512_sub_epi32(gather, all_ones); // -= -1 to reuse the constant.
// scatter the no-conflict elements
__m512i conflicts = _mm512_conflict_epi32(indices);
__mmask16 knoconflict = _mm512_testn_epi32_mask(conflicts, all_ones);
_mm512_mask_i32scatter_epi32(base, knoconflict, indices, v, 4);
// if(knoconflict == 0xffff) { goto optimistic_loop; }
// keep the conflicting elements and merge in new indices to refill the vector
size_t done = _popcnt32(knoconflict);
p += done; // the elements that overlap will be replaced with the conflicts from last time
__m512i newidx = _mm512_loadu_si512(p);
// merge-mask into the bottom of the newly-loaded index vector
indices = _mm512_mask_compress_epi32(newidx, ~knoconflict, indices);
}
We end up needing the mask both ways (knoconflict and ~knoconflict). It might be best to use _mm512_test_epi32_mask(same,same) and avoid the need for a vector constant to testn against. That might shorten the loop-carried dependency chain from indices in mask_compress, by putting the inversion of the mask onto the scatter dependency chain. When there are no conflicts (including between iterations), the scatter is independent.
If conflicts are rare, it's probably better to branch on it. This branchless handling of conflicts is a bit like using cmov in a loop: it creates a long loop-carried dependency chain.
Branch prediction + speculative execution would break those chains, and allow multiple gathers / scatters to be in flight at once. (And avoid running popcnt / vpcompressd at all when the are no conflicts).
Also note that vpconflictd is slow-ish on Skylake-avx512 (but not on KNL). When you expect conflicts to be very rare, you might even use a fast any_conflicts() check that doesn't find out where they are before running the conflict-handling.
See Fallback implementation for conflict detection in AVX2 for a ymm AVX2 implementation, which should be faster than Skylake-AVX512's micro-coded vpconflictd ymm. Expanding it to 512b zmm vectors shouldn't be difficult (and might be even more efficient if you can take advantage of AVX512 masked-compare into mask to replace a boolean operation between two compare results). Maybe with AVX512 vpcmpud k0{k1}, zmm0, zmm1 with a NEQ predicate.

Removing any element from an associative array

I'd like to remove an(y) element from an associative array and process it.
Currently I'm using a RedBlackTree together with .removeAny(), but I don't need the data to be in any order. I could use .byKey() on the AA, but that always produces an array with all keys. I only need one at a time and will probably change the AA while processing every other element. Is there any other smart way to get exactly one key without (internally) traversing the whole data structure?
There is a workaround, which works as well as using .byKeys():
auto anyKey(K, V)(inout ref V[K] aa)
{
foreach (K k, ref inout(V) v; aa)
return k;
assert(0, "Associative array hasn't any keys.");
}
For my needs, .byKeys().front seems to be fast enough though. Not sure if the workaround is actually faster.

New to OCaml: How would I go about implementing Gaussian Elimination?

I'm new to OCaml, and I'd like to implement Gaussian Elimination as an exercise. I can easily do it with a stateful algorithm, meaning keep a matrix in memory and recursively operating on it by passing around a reference to it.
This statefulness, however, smacks of imperative programming. I know there are capabilities in OCaml to do this, but I'd like to ask if there is some clever functional way I haven't thought of first.
OCaml arrays are mutable, and it's hard to avoid treating them just like arrays in an imperative language.
Haskell has immutable arrays, but from my (limited) experience with Haskell, you end up switching to monadic, mutable arrays in most cases. Immutable arrays are probably amazing for certain specific purposes. I've always imagined you could write a beautiful implementation of dynamic programming in Haskell, where the dependencies among array entries are defined entirely by the expressions in them. The key is that you really only need to specify the contents of each array entry one time. I don't think Gaussian elimination follows this pattern, and so it seems it might not be a good fit for immutable arrays. It would be interesting to see how it works out, however.
You can use a Map to emulate a matrix. The key would be a pair of integers referencing the row and column. You'll want to use your own get x y function to ensure x < n and y < n though, instead of accessing the Map directly. (edit) You can use the compare function in Pervasives directly.
module OrderedPairs = struct
type t = int * int
let compare = Pervasives.compare
end
module Pairs = Map.Make (OrderedPairs)
let get_ n set x y =
assert( x < n && y < n );
Pairs.find (x,y) set
let set_ n set x y v =
assert( x < n && y < n );
Pairs.add (x,y) set v
Actually, having a general set of functions (get x y and set x y at a minimum), without specifying the implementation, would be an even better option. The functions then can be passed to the function, or be implemented in a module through a functor (a better solution, but having a set of functions just doing what you need would be a first step since you're new to OCaml). In this way you can use a Map, Array, Hashtbl, or a set of functions to access a file on the hard-drive to implement the matrix if you wanted. This is the really important aspect of functional programming; that you trust the interface over exploiting the side-effects, and not worry about the underlying implementation --since it's presumed to be pure.
The answers so far are using/emulating mutable data-types, but what does a functional approach look like?
To see, let's decompose the problem into some functional components:
Gaussian elimination involves a sequence of row operations, so it is useful first to define a function taking 2 rows and scaling factors, and returning the resultant row operation result.
The row operations we want should eliminate a variable (column) from a particular row, so lets define a function which takes a pair of rows and a column index and uses the previously defined row operation to return the modified row with that column entry zero.
Then we define two functions, one to convert a matrix into triangular form, and another to back-substitute a triangular matrix to the diagonal form (using the previously defined functions) by eliminating each column in turn. We could iterate or recurse over the columns, and the matrix could be defined as a list, vector or array of lists, vectors or arrays. The input is not changed, but a modified matrix is returned, so we can finally do:
let out_matrix = to_diagonal (to_triangular in_matrix);
What makes it functional is not whether the data-types (array or list) are mutable, but how they they are used. This approach may not be particularly 'clever' or be the most efficient way to do Gaussian eliminations in OCaml, but using pure functions lets you express the algorithm cleanly.

Performing operations on CUDA matrices while reading from a global Point

Hey there,
I have a mathematical function (multidimensional which means that there's an index which I pass to the C++-function on which single mathematical function I want to return. E.g. let's say I have a mathematical function like that:
f = Vector(x^2*y^2 / y^2 / x^2*z^2)
I would implement it like that:
double myFunc(int function_index)
{
switch(function_index)
{
case 1:
return PNT[0]*PNT[0]*PNT[1]*PNT[1];
case 2:
return PNT[1]*PNT[1];
case 3:
return PNT[2]*PNT[2]*PNT[1]*PNT[1];
}
}
whereas PNT is defined globally like that: double PNT[ NUM_COORDINATES ]. Now I want to implement the derivatives of each function for each coordinate thus generating the derivative matrix (columns = coordinates; rows = single functions). I wrote my kernel already which works so far and which call's myFunc().
The Problem is: For calculating the derivative of the mathematical sub-function i concerning coordinate j, I would use in sequential mode (on CPUs e.g.) the following code (whereas this is simplified because usually you would decrease h until you reach a certain precision of your derivative):
f0 = myFunc(i);
PNT[ j ] += h;
derivative = (myFunc(j)-f0)/h;
PNT[ j ] -= h;
now as I want to do this on the GPU in parallel, the problem is coming up: What to do with PNT? As I have to increase certain coordinates by h, calculate the value and than decrease it again, there's a problem coming up: How to do it without 'disturbing' the other threads? I can't modify PNT because other threads need the 'original' point to modify their own coordinate.
The second idea I had was to save one modified point for each thread but I discarded this idea quite fast because when using some thousand threads in parallel, this is a quite bad and probably slow (perhaps not realizable at all because of memory limits) idea.
'FINAL' SOLUTION
So how I do it currently is the following, which adds the value 'add' on runtime (without storing it somewhere) via preprocessor macro to the coordinate identified by coord_index.
#define X(n) ((coordinate_index == n) ? (PNT[n]+add) : PNT[n])
__device__ double myFunc(int function_index, int coordinate_index, double add)
{
//*// Example: f[i] = x[i]^3
return (X(function_index)*X(function_index)*X(function_index));
// */
}
That works quite nicely and fast. When using a derivative matrix with 10000 functions and 10000 coordinates, it just takes like 0.5seks. PNT is defined either globally or as constant memory like __constant__ double PNT[ NUM_COORDINATES ];, depending on the preprocessor variable USE_CONST.
The line return (X(function_index)*X(function_index)*X(function_index)); is just an example where every sub-function looks the same scheme, mathematically spoken:
f = Vector(x0^3 / x1^3 / ... / xN^3)
NOW THE BIG PROBLEM ARISES:
myFunc is a mathematical function which the user should be able to implement as he likes to. E.g. he could also implement the following mathematical function:
f = Vector(x0^2*x1^2*...*xN^2 / x0^2*x1^2*...*xN^2 / ... / x0^2*x1^2*...*xN^2)
thus every function looking the same. You as a programmer should only code once and not depending on the implemented mathematical function. So when the above function is being implemented in C++, it looks like the following:
__device__ double myFunc(int function_index, int coordinate_index, double add)
{
double ret = 1.0;
for(int i = 0; i < NUM_COORDINATES; i++)
ret *= X(i)*X(i);
return ret;
}
And now the memory accesses are very 'weird' and bad for performance issues because each thread needs access to each element of PNT twice. Surely, in such a case where each function looks the same, I could rewrite the complete algorithm which surrounds the calls to myFunc, but as I stated already: I don't want to code depending on the user-implemented function myFunc...
Could anybody come up with an idea how to solve this problem??
Thanks!
Rewinding back to the beginning and starting with a clean sheet, it seems you want to be able to do two things
compute an arbitrary scalar valued
function over an input array
approximate the partial derivative of an arbitrary scalar
valued function over the input array
using first order accurate finite differencing
While the function is scalar valued and arbitrary, it seems that there are, in fact, two clear forms which this function can take:
A scalar valued function with scalar arguments
A scalar valued function with vector arguments
You appeared to have started with the first type of function and have put together code to deal with computing both the function and the approximate derivative, and are now wrestling with the problem of how to deal with the second case using the same code.
If this is a reasonable summary of the problem, then please indicate so in a comment and I will continue to expand it with some code samples and concepts. If it isn't, I will delete it in a few days.
In comments, I have been trying to suggest that conflating the first type of function with the second is not a good approach. The requirements for correctness in parallel execution, and the best way of extracting parallelism and performance on the GPU are very different. You would be better served by treating both types of functions separately in two different code frameworks with different usage models. When a given mathematical expression needs to be implemented, the "user" should make a basic classification as to whether that expression is like the model of the first type of function, or the second. The act of classification is what drives algorithmic selection in your code. This type of "classification by algorithm" is almost universal in well designed libraries - you can find it in C++ template libraries like Boost and the STL, and you can find it in legacy Fortran codes like the BLAS.

Create a Hash Table with two arrays

Does anyone know how to do this and what the pseudo code would look like?
As we all know a hash table stores key,value pairs and when a key is a called, the function will return the value associated with that key. What I want to do is understand the underlying structure in creating that mapping function. For example, if we lived in a world where there were no previously defined functions except for arrays, how could we replicate the Hashmaps that we have today?
Actually, some of todays Hashmap implentations are indeed made out of arrays as you propose. Let me sketch how this works:
Hash Function
A hash function transforms your keys into an index for the first array (array K). A hash function such as MD5 or a simpler one, usually including a modulo operator, can be used for this.
Buckets
A simple array-based Hashmap implementation could use buckets to cope with collissions. Each element ('bucket') in array K contains itself an array (array P) of pairs. When adding or querying for an element, the hash function points you to the correct bucket in K, which contains your desired array P. You then iterate over the elements in P until you find a matching key, or you assign a new element at the end of P.
Mapping keys to buckets using the Hash
You should make sure that the number of buckets (i.e. the size of K) is a power of 2, let's say 2^b. To find the correct bucket index for some key, compute Hash(key) but only keep the first b bits. This is your index when cast to an integer.
Rescaling
Computing the hash of a key and finding the right bucket is very quick. But once a bucket becomes fuller, you will have to iterate more and more items before you get to the right one. So it is important to have enough buckets to properly distribute the objects, or your Hashmap will become slow.
Because you generally don't know how much objects you will want to store in the Hashmap in advance, it is desirable to dynamically grow or shrink the map. You can keep a count of the number of objects stored, and once it goes over a certain threshold you recreate the entire structure, but this time with a larger or smaller size for array K. In this way some of the buckets in K that were very full will now have their elements divided among several buckets, so that performance will be better.
Alternatives
You may also use a two-dimensional array instead of an array-of-arrays, or you may exchange array P for a linked list. Furthermore, instead of keeping a total count of stored objects, you may simply choose to recreate (i.e. rescale) the hashmap once one of the buckets contains more than some configured number of items.
A variation of what you are asking is described as 'array hash table' in the Hash table Wikipedia entry.
Code
For code samples, take a look here.
Hope this helps.
Could you be more precise? Does one array contain the keys, the other one the values?
If so, here is an example in Java (but there are few specificities of this language here):
for (int i = 0; i < keysArray.length; i++) {
map.put(keysArray[i], valuesArray[i]);
}
Of course, you will have to instantiate your map object (if you are using Java, I suggest to use a HashMap<Object, Object> instead of an obsolete HashTable), and also test your arrays in order to avoid null objects and check if they have the same size.
Sample Explanation:
At the below source, basically it does two things:
1. Map Representation
Some (X number of List) of lists
X being 2 power N number of lists is bad. A (2 power N)-1, or (2 power N)+1, or a prime number is good.
Example:
List myhashmap [hash_table_size];
// an array of (short) lists
// if its long lists, then there are more collisions
NOTE: this is array of arrays, not two arrays (I can't see a possible generic hashmap, in a good way with just 2 arrays)
If you know Algorithms > Graph theory > Adjacency list, this looks exactly same.
2. Hash function
And the hash function converts string (input) to a number (hash value), which is index of an array
initialize the hash value to first char (after converted to int)
for each further char, left shift 4 bits, then add char (after converted to int)
Example,
int hash = input[0];
for (int i=1; i<input.length(); i++) {
hash = (hash << 4) + input[i]
}
hash = hash % list.size()
// list.size() here represents 1st dimension of (list of lists)
// that is 1st dimension size of our map representation from point #1
// which is hash_table_size
See at the first link:
int HTable::hash (char const * str) const
Source:
http://www.relisoft.com/book/lang/pointer/8hash.html
How does a hash table work?
Update
This is the Best source: http://algs4.cs.princeton.edu/34hash/
You mean like this?
The following is using Ruby's irb as an illustration:
cities = ["LA", "SF", "NY"]
=> ["LA", "SF", "NY"]
items = ["Big Mac", "Hot Fudge Sundae"]
=> ["Big Mac", "Hot Fudge Sundae"]
price = {}
=> {}
price[[cities[0], items[1]]] = 1.29
=> 1.29
price
=> {["LA", "Hot Fudge Sundae"]=>1.29}
price[[cities[0], items[0]]] = 2.49
=> 2.49
price[[cities[1], items[0]]] = 2.99
=> 2.99
price
=> {["LA", "Hot Fudge Sundae"]=>1.29, ["LA", "Big Mac"]=>2.49, ["SF", "Big Mac"]=>2.99}
price[["LA", "Big Mac"]]
=> 2.49

Resources