How can MPI be used efficiently in a portion of code? - mpi

My application involves run myfuns() in a serial time execution. It calls dothings(...), which has a object instances and others passing to it. This function involves a loop each does a breadth-first search and it is really time consuming. I have used OpenMP for the loop and it speed up just a little bit, not good enough. I am thinking of using MPI parallelism to get more processes to work on, but not sure how to use it in an efficient way for this portion of code embedded deep inside a sequential code.
void dothings (object obji...) {
std::vector retvec;
for (i = 0; i < somenumber; i++) {
/* this function involves breadth first search using std:queue */
retval = compute(obji, i);
retvec.push_back(retval);
}
}
/* myfuns() gets called in a sequential manner */
void myfuns() {
dothings(objectInstance,...)
}

Related

port SYCL/DPC++ code originally written for GPUs to FPGAs

I'm kinda new to the world of FPGAs and I'm trying to port some code written for GPUs to FPGAs, to compare the performances.
From my understanding, using parallel_for ain't a good practice (in fact it runs very slow), instead (I think) I should use a single_task and an unrolled for loop. I'm struggling to make it work properly though.
So, I have
q.submit([&](sycl::handler &h){
h.parallel_for<class Foo>(sycl::nd_range<1>(n_blocks * n_threads, n_threads),
[=](auto& it) {
some_kernel(it, <other params here ...> );
});
}).wait();
and my attempt is
q.submit([&](sycl::handler &h){
h.single_task<class Foo>(
#pragma unroll
for(int i = 0; i < n_blocks * n_threads; ++i)
some_kernel(...)
);
}).wait();
But I'm not sure how to adapt what I was previously doing with a sycl::item (for instance, how to use the loop index to replace the calls to the methods get_group, get_local_id? ).
Should I entirely change the design of the kernel ? In other word, is the "work_groups - work_group_size" approach not appropriate with FPGAs ?

Using OpenMP with GPU

Everyone good time of day!
I would like to ask the advice of the respected community about the use of GPU computing power instead of or together with the CPU.
I have a well-functioning program based on recursive search of all kinds of combinations of some events, paralleled using OpenMP to run on all available processor cores.
The pseudocode C++ is as follows:
// #includes
// function announcements
// declaring a global variable:
QVector<QVector<QVector<float>>> variant; // (or "std::vector")
int main() {
// reads data from file
// data are converted and analyzed
// the variant variable containing the current best result is filled in (here - by pre-analysis)
#pragma omp parallel shared(variant)
#pragma omp master
// occurs call a recursive algorithm of search all variants:
PEREBOR(Tabl_1, a, i_a, ..., reс_depth);
return 0;
}
void PEREBOR(QVector<QVector<uint8_t>> Tabl_1, QVector<A_struct> a, uint8_t i_a, ..., uint8_t reс_depth)
{
// looking for the boundaries of the first cycle for some reasons
for (int i = quantity; i < another_quantity; i++) {
// the Tabl_1 is processed and modified to determine the number of steps in the subsequent for cycle
for (int k = 0; k < the_quantity_just_found; k++) {
if the recursion depth is not 1, we go down further: {
// add descent to the next recursion level to the call stack:
#pragma omp task
PEREBOR(Tabl_1_COPY, a, i_a, ..., reс_depth-1);
}
else (if we went down to the lowest level): {
if (condition fulfilled) // condition check - READ variant variable
variant = it_is_equal_to_that_,_to_that...;
else
continue;
}
}
}
}
Unfortunately, I don't have a CPU with a thousand cores at my disposal, and without this, the algorithm works for a very long time. At the place where I work, I was advised to think about using a GPU to speed up calculations. I learned that OpenMP can work with video cards (and especially with NVidia), but OpenACC also does it well.
In this regard, my main question is whether it is possible to simply and, at the same time, effectively set the execution of a recursive algorithm on a GPU? Can this give a noticeable acceleration relative to the CPU? If so, maybe OpenACC will do better? And is it possible to give instructions to the video card through the "#pragma omp task", or are other instructions REQUIRED? And how would it be possible to combine calculations on the CPU and GPU?
Thank you so much for any help!
P.S. I apologize for my English, which is not my native language :)

Practical uses for Recursion

Are there any times where it is better to use recursion for a task instead of any other methods? By recursion I am referring to:
int factorial(int value)
{
if (value == 0)
{
return 1;
} else
{
return value * factorial(value - 1);
}
}
Well, there are a few reasons I can think of.
Recursion is often easier to understand than a purely iterative solution. For example, in the case of recursive-descent parsers.
In compilers with support for tail call optimization, there's no additional overhead to using recursion over iteration, and it often results in fewer lines of code (and, as a result, fewer bugs).
First of all your example doesn't make any sense.
The way you wrote it would just lead to an endless loop without any result ever.
A "real" function would more look like this:
int factorial(int value)
{
if (value == 0)
return 1;
else
return value * factorial(value - 1);
}
Of course you could accomplish the same thing with a loop (which might even be better, especially if the function call incurs the penalty of a stack frame). Usually, when people use recursion they do so because it's easier to read (for certain problem domains).

MPI irecv,isend communication between tasks

I am new to MPI, so sorry if this sounds stupid. I want a process to have an MPI_Irecv. If it has been called for a task, then it finds a result and sends the result back to the process that called it. How can I check if it has been actually assigned to a task? So that I can have an if{} in which that task takes place while the rest of the process continues with other stuff.
Code example:
for (i=0;i<size_of_Q;i++) {
MPI_Irecv( &shmeio, 1, mpi_point_C, root, 77, MPI_COMM_WORLD, &req );
//I want to put an if right here.
//If it's true process does task.
//Finds a number. then
MPI_Isend( &Bestcandidate, 1, mpi_point_C, root, 66, MPI_COMM_WORLD, &req );
//so that it can return the result.
//if it wasn't assigned a task it carries on with its other tasks.
} //(here is where for loop ends)
You might be confusing what MPI is supposed to do. MPI isn't really a tasking-based model as compared to some others (map reduce, some parts of OpenMP, etc.). MPI has historically focused on SPMD (single program multiple data) types of applications. That's not to say that MPI can't handle MPMD (there's an entire chapter in the standard about dynamic processes and most launchers can run different executables on different ranks.
With that in mind, when you start your job, you'll usually have all of the processes that you'll ever have (unless you're using dynamic processing like MPI_COMM_SPAWN). You probably used something like:
mpiexec -n 8 ./my_program arg1 arg2 arg3
Many times, if people are trying to emulate a tasking (or master/worker) model, they'll treat rank 0 as the special "master":
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (0 == rank) {
while (/* work not done */) {
/* Check if parts of work are done */
/* Send work to ranks without work */
}
} else {
while (/* work not done */ {
/* Get work from master */
/* Compute */
/* Send results to master */
}
}
Often, when waiting for the work, you'll do something like:
for (i = 1; i < num_process; i++) {
MPI_Irecv(&result[i], ..., &requests[i]);
}
This will set up the receives for each rank that will send you work. Then later, you can do something like:
MPI_Testany(num_processes - 1, requests, &index, &flag, statuses);
if (flag) {
/* Process work */
MPI_Send(work_data, ..., index, ...);
}
This will check to see if any of the requests (the handles used to track the status of a nonblocking operation) are completed and will then send new work to the worker that finished.
Obviously, all of this code is not copy/paste ready. You'll have to figure out how/if it applies to your work and adapt it accordingly.

Recursion in functional programming will not give high concurrency. Is it correct?

I am new to functional programming. Loops in imperative programming replaces recursion in FP. Another statement is FP gives high concurrency. The instructions being executed parallelly on multi-core/cpu systems as the data is immutable.
Whereas in recursion, steps cannot be executed parallelly due to a step execution is dependent on the previous steps result.
So, I am assuming that recursion in FP will not give high concurrency. Am I correct?
Sort of. You cannot get more execution parallelism than the data parallelism; this is Amdahl's law. However, you frequently have more data parallelism than is expressed in typical sequential algorithms, whether functional or imperative. Consider for example taking the scalar multiple of a vector: (note: this is some made-up algol-style language):1
function scalar_multiple(scalar c, vector v) {
vector v1;
for (int i = 0; i < length(v); i++) {
v1[i] = c * v[i];
}
return v1;
}
Obviously, this isn't going to run in parallel. The situation isn't improved if we re-write in a functional language, using recursion (you can think of this as Haskell):
scalar_multiple c [] = []
scalar_multiple c (x:xn) = c * x : scalar_multiple c xn
This is still a sequential algorithm!
However, you can notice that there is no data dependency --- you don't actually need the result of earlier / later multiplications to calculate later ones. So we have the potential for parallelization here. This can be accomplished in an imperative language:
function scalar_multiple(scalar c, vector v) {
vector v1;
parallel_for (int i in 0..length(v)-1) {
v1[i] = c * v[i];
}
return v1;
}
But this parallel_for is a dangerous construct. Consider a search function:
function first(predicate p, vector v) {
for (int i = 0; i < length(v); i++) {
if (p(v[i])) return i;
}
return -1;
}
If we try speeding this up by replacing for with parallel_for:
function first(predicate p, vector v) {
parallel_for (int i in 0..length(v)-1) {
if (p(v[i])) return i;
}
return -1;
}
Now we won't necessarily return the index of the first element to satisfy the condition, just an element that satisfies it. We broke the contract of the function by parallelizing it.
The obvious solution is 'don't allow return inside parallel_for. But there are lots of other dangerous constructs; in fact, you'll notice I had to abandon the C-style for loop because the increment-and-test pattern itself is dangerous in parallel languages. Consider:
function sequence(int n) {
vector v;
int c = 0;
parallel_for (int i = 0..n-1) {
v[i] = c++;
}
return v;
}
This is again a 'toy' example ("just use v[i] = i;!"), but it illustrates the point: this function initializes v in a random order, due to parallelism. So it turns out that the constructs that are 'safe' to use inside a construct like parallel_for are precisely the constructs that are allowed in purely-functional languages, which makes adding parallel constructs to those languages 'safer' than adding them to imperative languages.
1 This is just a very simple example; of course, real parallelism involves finding bigger chunks of work to parallize than this!
Not sure, if I understand you right, but it generally depends on what you want to accomplish.
One recursion alone cannot execute its subcalls parallel. But you CAN have 2 recursions working on the same dataset. i.e. processing an array from left AND from right simultaneosly trough two concurrent running recursive functions. Those (two) functions can then (theretically) run parallel.
In detail it does not matter if you have a recursive function or a function with a loop inside as long as there is a function who can run on its own. So in respect to your question:
No, a recursive function per definition does not give you any concurrency.
Loops are replaced by higher-order functions more frequently than by direct recursion. Recursion is sort of a catch-all measure in functional programming for when higher-order functions don't already exist for what you need to do.
For example, if you want to run the same calculation on all elements of a list, you use a map, which is highly parallelizable. Finding which elements meet certain criteria is a filter, also highly parallelizable.
Some algorithms just plain require the result of the previous iteration in order to proceed. Those are the ones that tend to require a recursive function, and you're right, they are not generally easy to make highly concurrent.

Resources