Pushing element at back of vec in armadillo - vector

How can I push an element at the end of vector in vec of armadillo? I am performing adding and removing an element in a sorted list in a loop. This is very expensive thing. The way I am currently doing in case of removing an element from a vec x to vec x_curr as:
x_curr = x(find(x != element))
However its not trivial in case of adding an element in loop.
x_curr = x; x_curr << element; x_curr = sort(x_curr);
This not correct. In addition not very efficient. What would be most efficient way to do this in armadillo. Any other STL library solution. I am using this in Rcpp armadillo. I can perhaps sorting every loop. x_curr is used to store of indices of column of arma::mat i.e. I am going to use it as mat.col(x_curr).

I don't understand your question.
Armadillo is a math library, so it operates on vectors. If you do not know your size, you could allocate a guessed N elements and resize in the common 'times two' idiom as needed, and shrink at the end. If you know the size, well then you have no problem.
The STL has the so-called generic containers and algorithms, but it does not do linear algebra. You need to figure out what you need most, and plan your implementation accordingly.

I am not sure that I understood what you want to do,
but if you want to append an element at the end of your vector,
you can do it like this:
int sz = yourvector.size();
yourvector.resize(sz+1);
yourvector(sz) = element;

Related

Usage of AVX512-CD

I am currently working with the KNL and try to understand the new opportunities of AVX512. Besides the extended register side, AVX512 comes along with new instruction sets. The conflict detection seems to be promising.
The intrinsic
_mm512_conflict_epi32(...)
creates a vector register, containing a conflict free subset of the given source register:
As one can see, the first appearence of a value results in a 0 at the corresponding position within the result vector. If the value is present multiple times, the result register holds a zero-extended value.
So far so good! BUT I wonder how one can utilize this result for further aggregations or computations. I read that one could use it along side a leading zeros count, but I don't think that is should be enough to determine the values of the subsets.
Does anyone know how one can utilize this result?
Sincerely
Now I understand that your question is how to utilize results from VPCONFLICTD/Q to build subsets for further aggregations or computations ...
Using your own example:
conflict_input =
[
00000001|00000001|00000001|00000001|
00000002|00000002|00000002|00000002|
00000002|00000002|00000001|00000001|
00000001|00000001|00000001|00000001
]
Applying VPCONFLICTD:
__m512i out = _mm512_conflict_epi32(in);
Now we get:
conflict_output =
[
00000000|00000001|00000003|00000007|
00000000|00000010|00000030|00000070|
000000f0|000001f0|0000000f|0000040f|
00000c0f|00001c0f|00003c0f|00007c0f
]
bit representation =
[
................|...............1|..............11|.............111|
................|...........1....|..........11....|.........111....|
........1111....|.......11111....|............1111|.....1......1111|
....11......1111|...111......1111|..1111......1111|.11111......1111
]
If you wish to get a mask based on first appearance of non-repeating value
const __m512i set1 = _mm512_set1_epi32(0xFFFFFFFF);
const __mmask16 mask = _mm512_testn_epi32_mask(out, set1);
Now you can do all the usual stuff with the mmask16
[1000100000000000]
you can also compress it:
const __m512i out3 = _mm512_mask_compress_epi32(set0, mask, in);
[00000001|00000002|00000000|00000000|
00000000|00000000|00000000|00000000|
00000000|00000000|00000000|00000000|
00000000|00000000|00000000|00000000]
There are lots of things you can do with the mask; However, I noticed interestingly the vplzcntd and don't know where I can use it:
const __m512i out1 = _mm512_conflict_epi32(in);
const __m512i out2 = _mm512_lzcnt_epi32(out1);
output2 = [
00000020|0000001f|0000001e|0000001d|
00000020|0000001b|0000001a|00000019|
00000018|00000017|0000001c|00000015|
00000014|00000013|00000012|00000011
]
= [
..........1.....|...........11111|...........1111.|...........111.1|
..........1.....|...........11.11|...........11.1.|...........11..1|
...........11...|...........1.111|...........111..|...........1.1.1|
...........1.1..|...........1..11|...........1..1.|...........1...1
]
See also some AVX512 histogram links and info I dug up a while ago in this answer.
I think the basic idea is to scatter the conflict-free set of elements, then re-gather, re-process, and re-scatter the next conflict-free set of elements. Repeat until there are no more conflicts.
Note that the first appearance of a repeated index is a "conflict-free" element, according to vpconflictd, so a simple repeat loop makes forward progress.
Steps in this process:
Turn a vpconflictd result into a mask that you can use with a gather instruction: _mm512_testn_epi32_mask (as suggested by #veritas) against a vector of all-ones looks good for this, since you need to invert it. You can't just test it against itself.
Remove the already-done elements: vpcompressd is probably good for this. We can even fill up the "empty" spaces in our vector with new elements, so we don't re-run the gather / process / scatter loop with most of the elements masked.
For example, this might work as a histogram loop, if I'm doing this right:
// probably slow, since it assumes conflicts and has a long loop-carried dep chain
// TOTALLY untested.
__m512i all_ones = _mm512_set1_epi32(-1); // easy to gen on the fly (vpternlogd)
__m512i indices = _mm512_loadu_si512(p);
p += 16;
// pessimistic loop that assumes conflicts
while (p < endp) {
// unmasked gather, so it can run in parallel with conflict detection
__m512i v = _mm512_i32gather_epi32(indices, base, 4);
v = _mm512_sub_epi32(gather, all_ones); // -= -1 to reuse the constant.
// scatter the no-conflict elements
__m512i conflicts = _mm512_conflict_epi32(indices);
__mmask16 knoconflict = _mm512_testn_epi32_mask(conflicts, all_ones);
_mm512_mask_i32scatter_epi32(base, knoconflict, indices, v, 4);
// if(knoconflict == 0xffff) { goto optimistic_loop; }
// keep the conflicting elements and merge in new indices to refill the vector
size_t done = _popcnt32(knoconflict);
p += done; // the elements that overlap will be replaced with the conflicts from last time
__m512i newidx = _mm512_loadu_si512(p);
// merge-mask into the bottom of the newly-loaded index vector
indices = _mm512_mask_compress_epi32(newidx, ~knoconflict, indices);
}
We end up needing the mask both ways (knoconflict and ~knoconflict). It might be best to use _mm512_test_epi32_mask(same,same) and avoid the need for a vector constant to testn against. That might shorten the loop-carried dependency chain from indices in mask_compress, by putting the inversion of the mask onto the scatter dependency chain. When there are no conflicts (including between iterations), the scatter is independent.
If conflicts are rare, it's probably better to branch on it. This branchless handling of conflicts is a bit like using cmov in a loop: it creates a long loop-carried dependency chain.
Branch prediction + speculative execution would break those chains, and allow multiple gathers / scatters to be in flight at once. (And avoid running popcnt / vpcompressd at all when the are no conflicts).
Also note that vpconflictd is slow-ish on Skylake-avx512 (but not on KNL). When you expect conflicts to be very rare, you might even use a fast any_conflicts() check that doesn't find out where they are before running the conflict-handling.
See Fallback implementation for conflict detection in AVX2 for a ymm AVX2 implementation, which should be faster than Skylake-AVX512's micro-coded vpconflictd ymm. Expanding it to 512b zmm vectors shouldn't be difficult (and might be even more efficient if you can take advantage of AVX512 masked-compare into mask to replace a boolean operation between two compare results). Maybe with AVX512 vpcmpud k0{k1}, zmm0, zmm1 with a NEQ predicate.

fast apply_along_axis equivalent in Julia

Is there an equivalent to numpy's apply_along_axis() (or R's apply())in Julia? I've got a 3D array and I would like to apply a custom function to each pair of co-ordinates of dimensions 1 and 2. The results should be in a 2D array.
Obviously, I could do two nested for loops iterating over the first and second dimension and then reshape, but I'm worried about performance.
This Example produces the output I desire (I am aware this is slightly pointless for sum(). It's just a dummy here:
test = reshape(collect(1:250), 5, 10, 5)
a=[]
for(i in 1:5)
for(j in 1:10)
push!(a,sum(test[i,j,:]))
end
end
println(reshape(a, 5,10))
Any suggestions for a faster version?
Cheers
Julia has the mapslices function which should do exactly what you want. But keep in mind that Julia is different from other languages you might know: library functions are not necessarily faster than your own code, because they may be written to a level of generality higher than what you actually need, and in Julia loops are fast. So it's quite likely that just writing out the loops will be faster.
That said, a couple of tips:
Read the performance tips section of the manual. From that you'd learn to put everything in a function, and to not use untyped arrays like a = [].
The slice or sub function can avoid making a copy of the data.
How about
f = sum # your function here
Int[f(test[i, j, :]) for i in 1:5, j in 1:10]
The last line is a two-dimensional array comprehension.
The Int in front is to guarantee the type of the elements; this should not be necessary if the comprehension is inside a function.
Note that you should (almost) never use untyped (Any) arrays, like your a = [], since this will be slow. You can write a = Int[] instead to create an empty array of Ints.
EDIT: Note that in Julia, loops are fast. The need for creating functions like that in Python and R comes from the inherent slowness of loops in those languages. In Julia it's much more common to just write out the loop.

Comparing a c++ std::vector's elements with each other

I have a std::vector of double values. Now I need to know if two succeeding elements are within a certain distance in order to process them. I have sorted the vector previously with std::sort.
I have thought about a solution to determine these elements with std::find_if and a lambda expression (c++11), like this:
std::vector<std::vector<double>::iterator> foundElements;
std::vector<double>::iterator foundElement;
while (foundElement != gradients.end())
{
foundElement = std::find_if(gradients.begin(), gradients.end(),
[] (double grad)->bool{
return ...
});
foundElements.push_back(foundElement);
}
But what should the predicate actually return? The plan is that I use the vector of iterators to later modify the vector.
Am I on the right track with this approach or is it too complicated/impossible? What are other, propably more practical solutions?
EDIT: I think I will go after the std::adjacent_find function, as proposed by one answerer.
Read about std::adjacent_find.
Can you enhance the grammar of your question?
"I have a std::vector with different double values. Now I need to know if two succeeding ones (I have sorted the vector previously with std::sort) are within a certain distance to process them)."
Do you imply that each element of a vector of type double is a unique value? If that's the case, can it be reasonably inferred that your goal to find the distance between each of these elements?

Removing any element from an associative array

I'd like to remove an(y) element from an associative array and process it.
Currently I'm using a RedBlackTree together with .removeAny(), but I don't need the data to be in any order. I could use .byKey() on the AA, but that always produces an array with all keys. I only need one at a time and will probably change the AA while processing every other element. Is there any other smart way to get exactly one key without (internally) traversing the whole data structure?
There is a workaround, which works as well as using .byKeys():
auto anyKey(K, V)(inout ref V[K] aa)
{
foreach (K k, ref inout(V) v; aa)
return k;
assert(0, "Associative array hasn't any keys.");
}
For my needs, .byKeys().front seems to be fast enough though. Not sure if the workaround is actually faster.

New to OCaml: How would I go about implementing Gaussian Elimination?

I'm new to OCaml, and I'd like to implement Gaussian Elimination as an exercise. I can easily do it with a stateful algorithm, meaning keep a matrix in memory and recursively operating on it by passing around a reference to it.
This statefulness, however, smacks of imperative programming. I know there are capabilities in OCaml to do this, but I'd like to ask if there is some clever functional way I haven't thought of first.
OCaml arrays are mutable, and it's hard to avoid treating them just like arrays in an imperative language.
Haskell has immutable arrays, but from my (limited) experience with Haskell, you end up switching to monadic, mutable arrays in most cases. Immutable arrays are probably amazing for certain specific purposes. I've always imagined you could write a beautiful implementation of dynamic programming in Haskell, where the dependencies among array entries are defined entirely by the expressions in them. The key is that you really only need to specify the contents of each array entry one time. I don't think Gaussian elimination follows this pattern, and so it seems it might not be a good fit for immutable arrays. It would be interesting to see how it works out, however.
You can use a Map to emulate a matrix. The key would be a pair of integers referencing the row and column. You'll want to use your own get x y function to ensure x < n and y < n though, instead of accessing the Map directly. (edit) You can use the compare function in Pervasives directly.
module OrderedPairs = struct
type t = int * int
let compare = Pervasives.compare
end
module Pairs = Map.Make (OrderedPairs)
let get_ n set x y =
assert( x < n && y < n );
Pairs.find (x,y) set
let set_ n set x y v =
assert( x < n && y < n );
Pairs.add (x,y) set v
Actually, having a general set of functions (get x y and set x y at a minimum), without specifying the implementation, would be an even better option. The functions then can be passed to the function, or be implemented in a module through a functor (a better solution, but having a set of functions just doing what you need would be a first step since you're new to OCaml). In this way you can use a Map, Array, Hashtbl, or a set of functions to access a file on the hard-drive to implement the matrix if you wanted. This is the really important aspect of functional programming; that you trust the interface over exploiting the side-effects, and not worry about the underlying implementation --since it's presumed to be pure.
The answers so far are using/emulating mutable data-types, but what does a functional approach look like?
To see, let's decompose the problem into some functional components:
Gaussian elimination involves a sequence of row operations, so it is useful first to define a function taking 2 rows and scaling factors, and returning the resultant row operation result.
The row operations we want should eliminate a variable (column) from a particular row, so lets define a function which takes a pair of rows and a column index and uses the previously defined row operation to return the modified row with that column entry zero.
Then we define two functions, one to convert a matrix into triangular form, and another to back-substitute a triangular matrix to the diagonal form (using the previously defined functions) by eliminating each column in turn. We could iterate or recurse over the columns, and the matrix could be defined as a list, vector or array of lists, vectors or arrays. The input is not changed, but a modified matrix is returned, so we can finally do:
let out_matrix = to_diagonal (to_triangular in_matrix);
What makes it functional is not whether the data-types (array or list) are mutable, but how they they are used. This approach may not be particularly 'clever' or be the most efficient way to do Gaussian eliminations in OCaml, but using pure functions lets you express the algorithm cleanly.

Resources