Usage of AVX512-CD

Usage of AVX512-CD - intel

I am currently working with the KNL and try to understand the new opportunities of AVX512. Besides the extended register side, AVX512 comes along with new instruction sets. The conflict detection seems to be promising.
The intrinsic
_mm512_conflict_epi32(...)
creates a vector register, containing a conflict free subset of the given source register:
As one can see, the first appearence of a value results in a 0 at the corresponding position within the result vector. If the value is present multiple times, the result register holds a zero-extended value.
So far so good! BUT I wonder how one can utilize this result for further aggregations or computations. I read that one could use it along side a leading zeros count, but I don't think that is should be enough to determine the values of the subsets.
Does anyone know how one can utilize this result?
Sincerely

Now I understand that your question is how to utilize results from VPCONFLICTD/Q to build subsets for further aggregations or computations ...
Using your own example:
conflict_input =
[
00000001|00000001|00000001|00000001|
00000002|00000002|00000002|00000002|
00000002|00000002|00000001|00000001|
00000001|00000001|00000001|00000001
]
Applying VPCONFLICTD:
__m512i out = _mm512_conflict_epi32(in);
Now we get:
conflict_output =
[
00000000|00000001|00000003|00000007|
00000000|00000010|00000030|00000070|
000000f0|000001f0|0000000f|0000040f|
00000c0f|00001c0f|00003c0f|00007c0f
]
bit representation =
[
................|...............1|..............11|.............111|
................|...........1....|..........11....|.........111....|
........1111....|.......11111....|............1111|.....1......1111|
....11......1111|...111......1111|..1111......1111|.11111......1111
]
If you wish to get a mask based on first appearance of non-repeating value
const __m512i set1 = _mm512_set1_epi32(0xFFFFFFFF);
const __mmask16 mask = _mm512_testn_epi32_mask(out, set1);
Now you can do all the usual stuff with the mmask16
[1000100000000000]
you can also compress it:
const __m512i out3 = _mm512_mask_compress_epi32(set0, mask, in);
[00000001|00000002|00000000|00000000|
00000000|00000000|00000000|00000000|
00000000|00000000|00000000|00000000|
00000000|00000000|00000000|00000000]
There are lots of things you can do with the mask; However, I noticed interestingly the vplzcntd and don't know where I can use it:
const __m512i out1 = _mm512_conflict_epi32(in);
const __m512i out2 = _mm512_lzcnt_epi32(out1);
output2 = [
00000020|0000001f|0000001e|0000001d|
00000020|0000001b|0000001a|00000019|
00000018|00000017|0000001c|00000015|
00000014|00000013|00000012|00000011
]
= [
..........1.....|...........11111|...........1111.|...........111.1|
..........1.....|...........11.11|...........11.1.|...........11..1|
...........11...|...........1.111|...........111..|...........1.1.1|
...........1.1..|...........1..11|...........1..1.|...........1...1
]

See also some AVX512 histogram links and info I dug up a while ago in this answer.
I think the basic idea is to scatter the conflict-free set of elements, then re-gather, re-process, and re-scatter the next conflict-free set of elements. Repeat until there are no more conflicts.
Note that the first appearance of a repeated index is a "conflict-free" element, according to vpconflictd, so a simple repeat loop makes forward progress.
Steps in this process:
Turn a vpconflictd result into a mask that you can use with a gather instruction: _mm512_testn_epi32_mask (as suggested by #veritas) against a vector of all-ones looks good for this, since you need to invert it. You can't just test it against itself.
Remove the already-done elements: vpcompressd is probably good for this. We can even fill up the "empty" spaces in our vector with new elements, so we don't re-run the gather / process / scatter loop with most of the elements masked.
For example, this might work as a histogram loop, if I'm doing this right:
// probably slow, since it assumes conflicts and has a long loop-carried dep chain
// TOTALLY untested.
__m512i all_ones = _mm512_set1_epi32(-1); // easy to gen on the fly (vpternlogd)
__m512i indices = _mm512_loadu_si512(p);
p += 16;
// pessimistic loop that assumes conflicts
while (p < endp) {
// unmasked gather, so it can run in parallel with conflict detection
__m512i v = _mm512_i32gather_epi32(indices, base, 4);
v = _mm512_sub_epi32(gather, all_ones); // -= -1 to reuse the constant.
// scatter the no-conflict elements
__m512i conflicts = _mm512_conflict_epi32(indices);
__mmask16 knoconflict = _mm512_testn_epi32_mask(conflicts, all_ones);
_mm512_mask_i32scatter_epi32(base, knoconflict, indices, v, 4);
// if(knoconflict == 0xffff) { goto optimistic_loop; }
// keep the conflicting elements and merge in new indices to refill the vector
size_t done = _popcnt32(knoconflict);
p += done; // the elements that overlap will be replaced with the conflicts from last time
__m512i newidx = _mm512_loadu_si512(p);
// merge-mask into the bottom of the newly-loaded index vector
indices = _mm512_mask_compress_epi32(newidx, ~knoconflict, indices);
}
We end up needing the mask both ways (knoconflict and ~knoconflict). It might be best to use _mm512_test_epi32_mask(same,same) and avoid the need for a vector constant to testn against. That might shorten the loop-carried dependency chain from indices in mask_compress, by putting the inversion of the mask onto the scatter dependency chain. When there are no conflicts (including between iterations), the scatter is independent.
If conflicts are rare, it's probably better to branch on it. This branchless handling of conflicts is a bit like using cmov in a loop: it creates a long loop-carried dependency chain.
Branch prediction + speculative execution would break those chains, and allow multiple gathers / scatters to be in flight at once. (And avoid running popcnt / vpcompressd at all when the are no conflicts).
Also note that vpconflictd is slow-ish on Skylake-avx512 (but not on KNL). When you expect conflicts to be very rare, you might even use a fast any_conflicts() check that doesn't find out where they are before running the conflict-handling.
See Fallback implementation for conflict detection in AVX2 for a ymm AVX2 implementation, which should be faster than Skylake-AVX512's micro-coded vpconflictd ymm. Expanding it to 512b zmm vectors shouldn't be difficult (and might be even more efficient if you can take advantage of AVX512 masked-compare into mask to replace a boolean operation between two compare results). Maybe with AVX512 vpcmpud k0{k1}, zmm0, zmm1 with a NEQ predicate.

Related

Parallel iteration over array with step size greater than 1

I'm working on a practice program for doing belief propagation stereo vision. The relevant aspect of that here is that I have a fairly long array representing every pixel in an image, and want to carry out an operation on every second entry in the array at each iteration of a for loop - first one half of the entries, and then at the next iteration the other half (this comes from an optimisation described by Felzenswalb & Huttenlocher in their 2006 paper 'Efficient belief propagation for early vision'.) So, you could see it as having an outer for loop which runs a number of times, and for each iteration of that loop I iterate over half of the entries in the array.
I would like to parallelise the operation of iterating over the array like this, since I believe it would be thread-safe to do so, and of course potentially faster. The operation involved updates values inside the data structures representing the neighbouring pixels, which are not themselves used in a given iteration of the outer loop. Originally I just iterated over the entire array in one go, which meant that it was fairly trivial to carry this out - all I needed to do was put .Parallel between Array and .iteri. Changing to operating on every second array entry is trickier, however.
To make the change from simply iterating over every entry, I from Array.iteri (fun i p -> ... to using for i in startIndex..2..(ArrayLength - 1) do, where startIndex is either 1 or 0 depending on which one I used last (controlled by toggling a boolean). This means though that I can't simply use the really nice .Parallel to make things run in parallel.
I haven't been able to find anything specific about how to implement a parallel for loop in .NET which has a step size greater than 1. The best I could find was a paragraph in an old MSDN document on parallel programming in .NET, but that paragraph only makes a vague statement about transforming an index inside a loop body. I do not understand what is meant there.
I looked at Parallel.For and Parallel.ForEach, as well as creating a custom partitioner, but none of those seemed to include options for changing the step size.
The other option that occurred to me was to use a sequence expression such as
let getOddOrEvenArrayEntries myarray oddOrEven =
seq {
let startingIndex =
if oddOrEven then
1
else
0
for i in startingIndex..2..(Array.length myarray- 1) do
yield (i, myarray.[i])
}
and then using PSeq.iteri from ParallelSeq, but I'm not sure whether it will work correctly with .NET Core 2.2. (Note that, currently at least, I need to know the index of the given element in the array, as it is used as the index into another array during the processing).
How can I go about iterating over every second element of an array in parallel? I.e. iterating over an array using a step size greater than 1?

You could try PSeq.mapi which provides not only a sequence item as a parameter but also the index of an item.
Here's a small example
let res = nums
|> PSeq.mapi(fun index item -> if index % 2 = 0 then item else item + 1)
You can also have a look over this sampling snippet. Just be sure to substitute Seq with PSeq

Pushing element at back of vec in armadillo

How can I push an element at the end of vector in vec of armadillo? I am performing adding and removing an element in a sorted list in a loop. This is very expensive thing. The way I am currently doing in case of removing an element from a vec x to vec x_curr as:
x_curr = x(find(x != element))
However its not trivial in case of adding an element in loop.
x_curr = x; x_curr << element; x_curr = sort(x_curr);
This not correct. In addition not very efficient. What would be most efficient way to do this in armadillo. Any other STL library solution. I am using this in Rcpp armadillo. I can perhaps sorting every loop. x_curr is used to store of indices of column of arma::mat i.e. I am going to use it as mat.col(x_curr).

I don't understand your question.
Armadillo is a math library, so it operates on vectors. If you do not know your size, you could allocate a guessed N elements and resize in the common 'times two' idiom as needed, and shrink at the end. If you know the size, well then you have no problem.
The STL has the so-called generic containers and algorithms, but it does not do linear algebra. You need to figure out what you need most, and plan your implementation accordingly.

I am not sure that I understood what you want to do,
but if you want to append an element at the end of your vector,
you can do it like this:
int sz = yourvector.size();
yourvector.resize(sz+1);
yourvector(sz) = element;

Replacing functions with Table Lookups

I've been watching this MSDN video with Brian Beckman and I'd like to better understand something he says:
Every imperitive programmer goes through this phase of learning that
functions can be replaced with table lookups
Now, I'm a C# programmer who never went to university, so perhaps somewhere along the line I missed out on something everyone else learned to understand.
What does Brian mean by:
functions can be replaced with table lookups
Are there practical examples of this being done and does it apply to all functions? He gives the example of the sin function, which I can make sense of, but how do I make sense of this in more general terms?

Brian just showed that the functions are data too. Functions in general are just a mapping of one set to another: y = f(x) is mapping of set {x} to set {y}: f:X->Y. The tables are mappings as well: [x1, x2, ..., xn] -> [y1, y2, ..., yn].
If function operates on finite set (this is the case in programming) then it's can be replaced with a table which represents that mapping. As Brian mentioned, every imperative programmer goes through this phase of understanding that the functions can be replaced with the table lookups just for performance reason.
But it doesn't mean that all functions easily can or should be replaced with the tables. It only means that you theoretically can do that for every function. So the conclusion would be that the functions are data because tables are (in the context of programming of course).

There is a lovely trick in Mathematica that creates a table as a side-effect of evaluating function-calls-as-rewrite-rules. Consider the classic slow-fibonacci
fib[1] = 1
fib[2] = 1
fib[n_] := fib[n-1] + fib[n-2]
The first two lines create table entries for the inputs 1 and 2. This is exactly the same as saying
fibTable = {};
fibTable[1] = 1;
fibTable[2] = 1;
in JavaScript. The third line of Mathematica says "please install a rewrite rule that will replace any occurrence of fib[n_], after substituting the pattern variable n_ with the actual argument of the occurrence, with fib[n-1] + fib[n-2]." The rewriter will iterate this procedure, and eventually produce the value of fib[n] after an exponential number of rewrites. This is just like the recursive function-call form that we get in JavaScript with
function fib(n) {
var result = fibTable[n] || ( fib(n-1) + fib(n-2) );
return result;
}
Notice it checks the table first for the two values we have explicitly stored before making the recursive calls. The Mathematica evaluator does this check automatically, because the order of presentation of the rules is important -- Mathematica checks the more specific rules first and the more general rules later. That's why Mathematica has two assignment forms, = and :=: the former is for specific rules whose right-hand sides can be evaluated at the time the rule is defined; the latter is for general rules whose right-hand sides must be evaluated when the rule is applied.
Now, in Mathematica, if we say
fib[4]
it gets rewritten to
fib[3] + fib[2]
then to
fib[2] + fib[1] + 1
then to
1 + 1 + 1
and finally to 3, which does not change on the next rewrite. You can imagine that if we say fib[35], we will generate enormous expressions, fill up memory, and melt the CPU. But the trick is to replace the final rewrite rule with the following:
fib[n_] := fib[n] = fib[n-1] + fib[n-2]
This says "please replace every occurrence of fib[n_] with an expression that will install a new specific rule for the value of fib[n] and also produce the value." This one runs much faster because it expands the rule-base -- the table of values! -- at run time.
We can do likewise in JavaScript
function fib(n) {
var result = fibTable[n] || ( fib(n-1) + fib(n-2) );
fibTable[n] = result;
return result;
}
This runs MUCH faster than the prior definition of fib.
This is called "automemoization" [sic -- not "memorization" but "memoization" as in creating a memo for yourself].
Of course, in the real world, you must manage the sizes of the tables that get created. To inspect the tables in Mathematica, do
DownValues[fib]
To inspect them in JavaScript, do just
fibTable
in a REPL such as that supported by Node.JS.

In the context of functional programming, there is the concept of referential transparency. A function that is referentially transparent can be replaced with its value for any given argument (or set of arguments), without changing the behaviour of the program.
Referential Transparency
For example, consider a function F that takes 1 argument, n. F is referentially transparent, so F(n) can be replaced with the value of F evaluated at n. It makes no difference to the program.
In C#, this would look like:
public class Square
{
public static int apply(int n)
{
return n * n;
}
public static void Main()
{
//Should print 4
Console.WriteLine(Square.apply(2));
}
}
(I'm not very familiar with C#, coming from a Java background, so you'll have to forgive me if this example isn't quite syntactically correct).
It's obvious here that the function apply cannot have any other value than 4 when called with an argument of 2, since it's just returning the square of its argument. The value of the function only depends on its argument, n; in other words, referential transparency.
I ask you, then, what the difference is between Console.WriteLine(Square.apply(2)) and Console.WriteLine(4). The answer is, there's no difference at all, for all intents are purposes. We could go through the entire program, replacing all instances of Square.apply(n) with the value returned by Square.apply(n), and the results would be the exact same.
So what did Brian Beckman mean with his statement about replacing function calls with a table lookup? He was referring to this property of referentially transparent functions. If Square.apply(2) can be replaced with 4 with no impact on program behaviour, then why not just cache the values when the first call is made, and put it in a table indexed by the arguments to the function. A lookup table for values of Square.apply(n) would look somewhat like this:
n: 0 1 2 3 4 5 ...
Square.apply(n): 0 1 4 9 16 25 ...
And for any call to Square.apply(n), instead of calling the function, we can simply find the cached value for n in the table, and replace the function call with this value. It's fairly obvious that this will most likely bring about a large speed increase in the program.

How to force an error if non-finite values (NA, NaN, or Inf) are encountered

There's a conditional debugging flag I miss from Matlab: dbstop if infnan described here. If set, this condition will stop code execution when an Inf or NaN is encountered (IIRC, Matlab doesn't have NAs).
How might I achieve this in R in a more efficient manner than testing all objects after every assignment operation?
At the moment, the only ways I see to do this are via hacks like the following:
Manually insert a test after all places where these values might be encountered (e.g. a division, where division by 0 may occur). The testing would be to use is.finite(), described in this Q & A, on every element.
Use body() to modify the code to call a separate function, after each operation or possibly just each assignment, which tests all of the objects (and possibly all objects in all environments).
Modify R's source code (?!?)
Attempt to use tracemem to identify those variables that have changed, and check only these for bad values.
(New - see note 2) Use some kind of call handlers / callbacks to invoke a test function.
The 1st option is what I am doing at present. This is tedious, because I can't guarantee I've checked everything. The 2nd option will test everything, even if an object hasn't been updated. That is a massive waste of time. The 3rd option would involve modifying assignments of NA, NaN, and infinite values (+/- Inf), so that an error is produced. That seems like it's better left to R Core. The 4th option is like the 2nd - I'd need a call to a separate function listing all of the memory locations, just to ID those that have changed, and then check the values; I'm not even sure this will work for all objects, as a program may do an in-place modification, which seems like it would not invoke the duplicate function.
Is there a better approach that I'm missing? Maybe some clever tool by Mark Bravington, Luke Tierney, or something relatively basic - something akin to an options() parameter or a flag when compiling R?
Example code Here is some very simple example code to test with, incorporating the addTaskCallback function proposed by Josh O'Brien. The code isn't interrupted, but an error does occur in the first scenario, while no error occurs in the second case (i.e. badDiv(0,0,FALSE) doesn't abort). I'm still investigating callbacks, as this looks promising.
badDiv <- function(x, y, flag){
z = x / y
if(flag == TRUE){
return(z)
} else {
return(FALSE)
}
}
addTaskCallback(stopOnNaNs)
badDiv(0, 0, TRUE)
addTaskCallback(stopOnNaNs)
badDiv(0, 0, FALSE)
Note 1. I'd be satisfied with a solution for standard R operations, though a lot of my calculations involve objects used via data.table or bigmemory (i.e. disk-based memory mapped matrices). These appear to have somewhat different memory behaviors than standard matrix and data.frame operations.
Note 2. The callbacks idea seems a bit more promising, as this doesn't require me to write functions that mutate R code, e.g. via the body() idea.
Note 3. I don't know whether or not there is some simple way to test the presence of non-finite values, e.g. meta information about objects that indexes where NAs, Infs, etc. are stored in the object, or if these are stored in place. So far, I've tried Simon Urbanek's inspect package, and have not found a way to divine the presence of non-numeric values.
Follow-up: Simon Urbanek has pointed out in a comment that such information is not available as meta information for objects.
Note 4. I'm still testing the ideas presented. Also, as suggested by Simon, testing for the presence of non-finite values should be fastest in C/C++; that should surpass even compiled R code, but I'm open to anything. For large datasets, e.g. on the order of 10-50GB, this should be a substantial savings over copying the data. One may get further improvements via use of multiple cores, but that's a bit more advanced.

The idea sketched below (and its implementation) is very imperfect. I'm hesitant to even suggest it, but: (a) I think it's kind of interesting, even in all of its ugliness; and (b) I can think of situations where it would be useful. Given that it sounds like you are right now manually inserting a check after each computation, I'm hopeful that your situation is one of those.
Mine is a two-step hack. First, I define a function nanDetector() which is designed to detect NaNs in several of the object types that might be returned by your calculations. Then, it using addTaskCallback() to call the function nanDetector() on .Last.value after each top-level task/calculation is completed. When it finds an NaN in one of those returned values, it throws an error, which you can use to avoid any further computations.
Among its shortcomings:
If you do something like setting stop(error = recover), it's hard to tell where the error was triggered, since the error is always thrown from inside of stopOnNaNs().
When it throws an error, stopOnNaNs() is terminated before it can return TRUE. As a consequence, it is removed from the task list, and you'll need to reset with addTaskCallback(stopOnNaNs) it you want to use it again. (See the 'Arguments' section of ?addTaskCallback for more details).
Without further ado, here it is:
# Sketch of a function that tests for NaNs in several types of objects
nanDetector <- function(X) {
# To examine data frames
if(is.data.frame(X)) {
return(any(unlist(sapply(X, is.nan))))
}
# To examine vectors, matrices, or arrays
if(is.numeric(X)) {
return(any(is.nan(X)))
}
# To examine lists, including nested lists
if(is.list(X)) {
return(any(rapply(X, is.nan)))
}
return(FALSE)
}
# Set up the taskCallback
stopOnNaNs <- function(...) {
if(nanDetector(.Last.value)) {stop("NaNs detected!\n")}
return(TRUE)
}
addTaskCallback(stopOnNaNs)
# Try it out
j <- 1:00
y <- rnorm(99)
l <- list(a=1:4, b=list(j=1:4, k=NaN))
# Error in function (...) : NaNs detected!
# Subsequent time consuming code that could be avoided if the
# error thrown above is used to stop its evaluation.

I fear there is no such shortcut. In theory on unix there is SIGFPE that you could trap on, but in practice
there is no standard way to enable FP operations to trap it (even C99 doesn't include a provision for that) - it is highly system-specifc (e.g. feenableexcept on Linux, fp_enable_all on AIX etc.) or requires the use of assembler for your target CPU
FP operations are nowadays often done in vector units like SSE so you can't be even sure that FPU is involved and
R intercepts some operations on things like NaNs, NAs and handles them separately so they won't make it to the FP code
That said, you could hack yourself an R that will catch some exceptions for your platform and CPU if you tried hard enough (disable SSE etc.). It is not something we would consider building into R, but for a special purpose it may be doable.
However, it would still not catch NaN/NA operations unless you change R internal code. In addition, you would have to check every single package you are using since they may be using FP operations in their C code and may also handle NA/NaN separately.
If you are only worried about things like division by zero or over/underflows, the above will work and is probably the closest to something like a solution.
Just checking your results may not be very reliable, because you don't know whether a result is based on some intermediate NaN calculation that changed an aggregated value which may not need to be NaN as well. If you are willing to discard such case, then you could simply walk recursively through your result objects or the workspace. That should not be extremely inefficient, because you only need to worry about REALSXP and not anything else (unless you don't like NAs either - then you'd have more work).
This is an example code that could be used to traverse R object recursively:
static int do_isFinite(SEXP x) {
/* recurse into generic vectors (lists) */
if (TYPEOF(x) == VECSXP) {
int n = LENGTH(x);
for (int i = 0; i < n; i++)
if (!do_isFinite(VECTOR_ELT(x, i))) return 0;
}
/* recurse into pairlists */
if (TYPEOF(x) == LISTSXP) {
while (x != R_NilValue) {
if (!do_isFinite(CAR(x))) return 0;
x = CDR(x);
}
return 1;
}
/* I wouldn't bother with attributes except for S4
where attributes are slots */
if (IS_S4_OBJECT(x) && !do_isFinite(ATTRIB(x))) return 0;
/* check reals */
if (TYPEOF(x) == REALSXP) {
int n = LENGTH(x);
double *d = REAL(x);
for (int i = 0; i < n; i++) if (!R_finite(d[i])) return 0;
}
return 1;
}
SEXP isFinite(SEXP x) { return ScalarLogical(do_isFinite(x)); }
# in R: .Call("isFinite", x)

Performing operations on CUDA matrices while reading from a global Point

Hey there,
I have a mathematical function (multidimensional which means that there's an index which I pass to the C++-function on which single mathematical function I want to return. E.g. let's say I have a mathematical function like that:
f = Vector(x^2*y^2 / y^2 / x^2*z^2)
I would implement it like that:
double myFunc(int function_index)
{
switch(function_index)
{
case 1:
return PNT[0]*PNT[0]*PNT[1]*PNT[1];
case 2:
return PNT[1]*PNT[1];
case 3:
return PNT[2]*PNT[2]*PNT[1]*PNT[1];
}
}
whereas PNT is defined globally like that: double PNT[ NUM_COORDINATES ]. Now I want to implement the derivatives of each function for each coordinate thus generating the derivative matrix (columns = coordinates; rows = single functions). I wrote my kernel already which works so far and which call's myFunc().
The Problem is: For calculating the derivative of the mathematical sub-function i concerning coordinate j, I would use in sequential mode (on CPUs e.g.) the following code (whereas this is simplified because usually you would decrease h until you reach a certain precision of your derivative):
f0 = myFunc(i);
PNT[ j ] += h;
derivative = (myFunc(j)-f0)/h;
PNT[ j ] -= h;
now as I want to do this on the GPU in parallel, the problem is coming up: What to do with PNT? As I have to increase certain coordinates by h, calculate the value and than decrease it again, there's a problem coming up: How to do it without 'disturbing' the other threads? I can't modify PNT because other threads need the 'original' point to modify their own coordinate.
The second idea I had was to save one modified point for each thread but I discarded this idea quite fast because when using some thousand threads in parallel, this is a quite bad and probably slow (perhaps not realizable at all because of memory limits) idea.
'FINAL' SOLUTION
So how I do it currently is the following, which adds the value 'add' on runtime (without storing it somewhere) via preprocessor macro to the coordinate identified by coord_index.
#define X(n) ((coordinate_index == n) ? (PNT[n]+add) : PNT[n])
__device__ double myFunc(int function_index, int coordinate_index, double add)
{
//*// Example: f[i] = x[i]^3
return (X(function_index)*X(function_index)*X(function_index));
// */
}
That works quite nicely and fast. When using a derivative matrix with 10000 functions and 10000 coordinates, it just takes like 0.5seks. PNT is defined either globally or as constant memory like __constant__ double PNT[ NUM_COORDINATES ];, depending on the preprocessor variable USE_CONST.
The line return (X(function_index)*X(function_index)*X(function_index)); is just an example where every sub-function looks the same scheme, mathematically spoken:
f = Vector(x0^3 / x1^3 / ... / xN^3)
NOW THE BIG PROBLEM ARISES:
myFunc is a mathematical function which the user should be able to implement as he likes to. E.g. he could also implement the following mathematical function:
f = Vector(x0^2*x1^2*...*xN^2 / x0^2*x1^2*...*xN^2 / ... / x0^2*x1^2*...*xN^2)
thus every function looking the same. You as a programmer should only code once and not depending on the implemented mathematical function. So when the above function is being implemented in C++, it looks like the following:
__device__ double myFunc(int function_index, int coordinate_index, double add)
{
double ret = 1.0;
for(int i = 0; i < NUM_COORDINATES; i++)
ret *= X(i)*X(i);
return ret;
}
And now the memory accesses are very 'weird' and bad for performance issues because each thread needs access to each element of PNT twice. Surely, in such a case where each function looks the same, I could rewrite the complete algorithm which surrounds the calls to myFunc, but as I stated already: I don't want to code depending on the user-implemented function myFunc...
Could anybody come up with an idea how to solve this problem??
Thanks!

Rewinding back to the beginning and starting with a clean sheet, it seems you want to be able to do two things
compute an arbitrary scalar valued
function over an input array
approximate the partial derivative of an arbitrary scalar
valued function over the input array
using first order accurate finite differencing
While the function is scalar valued and arbitrary, it seems that there are, in fact, two clear forms which this function can take:
A scalar valued function with scalar arguments
A scalar valued function with vector arguments
You appeared to have started with the first type of function and have put together code to deal with computing both the function and the approximate derivative, and are now wrestling with the problem of how to deal with the second case using the same code.
If this is a reasonable summary of the problem, then please indicate so in a comment and I will continue to expand it with some code samples and concepts. If it isn't, I will delete it in a few days.
In comments, I have been trying to suggest that conflating the first type of function with the second is not a good approach. The requirements for correctness in parallel execution, and the best way of extracting parallelism and performance on the GPU are very different. You would be better served by treating both types of functions separately in two different code frameworks with different usage models. When a given mathematical expression needs to be implemented, the "user" should make a basic classification as to whether that expression is like the model of the first type of function, or the second. The act of classification is what drives algorithmic selection in your code. This type of "classification by algorithm" is almost universal in well designed libraries - you can find it in C++ template libraries like Boost and the STL, and you can find it in legacy Fortran codes like the BLAS.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex