How to check completion of non-blocking reduce? - mpi

It is not clear to me how to properly use non-blocking collective in MPI, particularly MPI_Ireduce() in this case:
Say I want to collect a sum from root rank:
int local_cnt;
int total_cnt;
MPI_Request request;
MPI_Ireduce(&local_cnt, &total_cnt, 1, MPI_INT, MPI_SUM, 0, MPI_WORLD_COMM, &request);
/* now I want to check if the reduce is finished */
if (rank == 0) {
int flag = 0;
MPI_Status status;
MPI_Test(&request, &flag, &status);
if (flag) {
/* reduce is finished? */
}
}
Is this the correct way to check if non-blocking reduce is done? My confusion comes from two aspects: one, Can or should just root process check for it using MPI_Test() since this is meaningful only to root? second, since MPI_Test() is local op, how can this local op knows the operation is complete? it does require all processes to finish, right?

You must check for completion on all participating ranks, not just the root.
From a user perspective, you need to know about the completion of the communication because you must not do anything with the memory provided to a non-blocking operation. I.e. if you send a local scope variable like local_cnt, you cannot write to it or leave it's scope before you have confirmed that the operation is complete.
One option to ensure completion is calling MPI_Test until it eventually returns flag==true. Use this only if you can do something useful between the calls to MPI_Test:
{
int local_cnt;
int total_cnt;
// fill local_cnt on all ranks
MPI_Request request;
MPI_Ireduce(&local_cnt, &total_cnt, 1, MPI_INT, MPI_SUM, 0, MPI_WORLD_COMM, &request);
int flag;
do {
// perform some useful computation
MPI_Status status;
MPI_Test(&request, &flag, &status);
} while (!flag)
}
Do not call MPI_Test in a loop if you have nothing useful to do in between calls. Instead use MPI_Wait, which blocks until completion.
{
int local_cnt;
int total_cnt;
// fill local_cnt on all ranks
MPI_Request request;
MPI_Ireduce(&local_cnt, &total_cnt, 1, MPI_INT, MPI_SUM, 0, MPI_WORLD_COMM, &request);
// perform some useful computation
MPI_Status status;
MPI_Wait(&request, &status);
}
Remember, if you have no useful computation at all, and don't need to be non-blocking for deadlock reasons, use a blocking communication in the first place. If you have multiple ongoing non-blocking communication, there are MPI_Waitany, MPI_Waitsome, MPI_Waitall and their Test variant's.

Zulan brilliantly answered the first part of your question.
MPI_Reduce() returns when
the send buffer can be overwritten on a non root rank
the result is available on the root rank (which implies all the ranks have completed)
So there is no way for a non root rank to know whether the root rank completed. If you do need this information, then you need to manually add a MPI_Barrier(). That being said, you generally do not require this information, and if you believe you do need it, there might be something wrong with your app.
This remains true if you use non blocking collectives (e.g. MPI_Wait() corresponding to MPI_Ireduce() completes on a non root rank : that simply means the send buffer can be overwritten.

Related

broadcasting structs without knowing attributes

I'm trying to use MPI_Bcast to share an instance of cudaIpcMemHandler_t, but I cannot figure out how to create the corresponding MPI_Datatype needed for Bcast. I do not know the underlying structure of the cuda type, hence methods like this one don't seem to work. Am I missing something ?
Following up from the useful comment , I have a solution that seems to work, though its only been tested in a toy program:
// Broadcast the memory handle
cudaIpcMemHandler_t memHandler[1];
if (rank==0){
// set the handler content, e.g. call cudpaIpcGetMemHandle
}
MPI_Barrier(MPI_COMM_WORLD);
// share the size of the handler with other objects
int hand_size[1];
if (rank==0)
hand_size[0]= sizeof(memHand[0]);
MPI_Bcast(&hand_size[0], 1, MPI_INT, 0, MPI_COMM_WORLD);
// broadcase the handler as byte array
char memHand_C[hand_size[0]];
if (rank==0)
memcpy(&memHand_C, &memHand[0], hand_size[0]);
MPI_Bcast(&memHand_C, hand_size[0], MPI_BYTE, 0, MPI_COMM_WORLD);
if (rank >0)
memcpy(&memHand[0], &memHand_C, hand_size[0]);

Using MPI_Scatter with 3 processes

I am new to MPI , and my question is how the root(for example rank-0) initializes all its values (in the array) before other processes receive their i'th value from the root?
for example:
in the root i initialize: arr[0]=20,arr[1]=90,arr[2]=80.
My question is ,If i have for example process (number -2) that starts a little bit before the root process. Can the MPI_Scatter sends incorrect value instead 80?
How can i assure the root initialize all his memory before others use Scatter ?
Thank you !
The MPI standard specifies that
If comm is an intracommunicator, the outcome is as if the root executed n
send operations, MPI_Send(sendbuf+i, sendcount, extent(sendtype), sendcount, sendtype, i,...), and each process executed a receive, MPI_Recv(recvbuf, recvcount, recvtype, i,...).
This means that all the non-root processes will wait until their recvcount respective elements have been transmitted. This is also known as synchronized routine (the process waits until the communication is completed).
You as the programmer are responsible of ensuring that the data being sent is correct by the time you call any communication routine and until the send buffer available again (in this case, until MPI_Scatter returns). In a MPI only program, this is as simple as placing the initialization code before the call to MPI_Scatter, as each process executes the program sequentially.
The following is an example based in the document's Example 5.11:
MPI_Comm comm = MPI_COMM_WORLD;
int grank, gsize,*sendbuf;
int root, rbuf[100];
MPI_Comm_rank( comm, &grank );
MPI_Comm_size(comm, &gsize);
root = 0;
if( grank == root ) {
sendbuf = (int *)malloc(gsize*100*sizeof(int));
// Initialize sendbuf. None of its values are valid at this point.
for( int i = 0; i < gsize * 100; i++ )
sendbuf[i] = i;
}
rbuf = (int *)malloc(100*sizeof(int));
// Distribute sendbuf data
// At the root process, all sendbuf values are valid
// In non-root processes, sendbuf argument is ignored.
MPI_Scatter(sendbuf, 100, MPI_INT, rbuf, 100, MPI_INT, root, comm);
MPI_Scatter() is a collective operation, so the MPI library does take care of everything, and the outcome of a collective operation does not depend on which rank called earlier than an other.
In this specific case, a non root rank will block (at least) until the root rank calls MPI_Scatter().
This is no different than a MPI_Send() / MPI_Recv().
MPI_Recv() blocks if called before the remote peer MPI_Send() a matching message.

What is the right way to "notify" processors without blocking?

Suppose I have a very large array of things and I have to do some operation on all these things.
In case operation fails for one element, I want to stop the work [this work is distributed across number of processors] across all the array.
I want to achieve this while keeping the number of sent/received messages to a minimum.
Also, I don't want to block processors if there is no need to.
How can I do it using MPI?
This seems to be a common question with no easy answer. Both other answer have scalability issues. The ring-communication approach has linear communication cost, while in the one-sided MPI_Win-solution, a single process will be hammered with memory requests from all workers. This may be fine for low number of ranks, but pose issues when increasing the rank count.
Non-blocking collectives can provide a more scalable better solution. The main idea is to post a MPI_Ibarrier on all ranks except on one designated root. This root will listen to point-to-point stop messages via MPI_Irecv and complete the MPI_Ibarrier once it receives it.
The tricky part is that there are four different cases "{root, non-root} x {found, not-found}" that need to be handled. Also it can happen that multiple ranks send a stop message, requiring an unknown number of matching receives on the root. That can be solved with an additional reduction that counts the number of ranks that sent a stop-request.
Here is an example how this can look in C:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
const int iter_max = 10000;
const int difficulty = 20000;
int find_stuff()
{
int num_iters = rand() % iter_max;
for (int i = 0; i < num_iters; i++) {
if (rand() % difficulty == 0) {
return 1;
}
}
return 0;
}
const int stop_tag = 42;
const int root = 0;
int forward_stop(MPI_Request* root_recv_stop, MPI_Request* all_recv_stop, int found_count)
{
int flag;
MPI_Status status;
if (found_count == 0) {
MPI_Test(root_recv_stop, &flag, &status);
} else {
// If we find something on the root, we actually wait until we receive our own message.
MPI_Wait(root_recv_stop, &status);
flag = 1;
}
if (flag) {
printf("Forwarding stop signal from %d\n", status.MPI_SOURCE);
MPI_Ibarrier(MPI_COMM_WORLD, all_recv_stop);
MPI_Wait(all_recv_stop, MPI_STATUS_IGNORE);
// We must post some additional receives if multiple ranks found something at the same time
MPI_Reduce(MPI_IN_PLACE, &found_count, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
for (found_count--; found_count > 0; found_count--) {
MPI_Recv(NULL, 0, MPI_CHAR, MPI_ANY_SOURCE, stop_tag, MPI_COMM_WORLD, &status);
printf("Additional stop from: %d\n", status.MPI_SOURCE);
}
return 1;
}
return 0;
}
int main()
{
MPI_Init(NULL, NULL);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
srand(rank);
MPI_Request root_recv_stop;
MPI_Request all_recv_stop;
if (rank == root) {
MPI_Irecv(NULL, 0, MPI_CHAR, MPI_ANY_SOURCE, stop_tag, MPI_COMM_WORLD, &root_recv_stop);
} else {
// You may want to use an extra communicator here, to avoid messing with other barriers
MPI_Ibarrier(MPI_COMM_WORLD, &all_recv_stop);
}
while (1) {
int found = find_stuff();
if (found) {
printf("Rank %d found something.\n", rank);
// Note: We cannot post this as blocking, otherwise there is a deadlock with the reduce
MPI_Request req;
MPI_Isend(NULL, 0, MPI_CHAR, root, stop_tag, MPI_COMM_WORLD, &req);
if (rank != root) {
// We know that we are going to receive our own stop signal.
// This avoids running another useless iteration
MPI_Wait(&all_recv_stop, MPI_STATUS_IGNORE);
MPI_Reduce(&found, NULL, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
MPI_Wait(&req, MPI_STATUS_IGNORE);
break;
}
MPI_Wait(&req, MPI_STATUS_IGNORE);
}
if (rank == root) {
if (forward_stop(&root_recv_stop, &all_recv_stop, found)) {
break;
}
} else {
int stop_signal;
MPI_Test(&all_recv_stop, &stop_signal, MPI_STATUS_IGNORE);
if (stop_signal)
{
MPI_Reduce(&found, NULL, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
printf("Rank %d stopping after receiving signal.\n", rank);
break;
}
}
};
MPI_Finalize();
}
While this is not the simplest code, it should:
Introduce no additional blocking
Scale with the implementation of a barrier (usually O(log N))
Have a worst-case-latency from one found, to all stop of 2 * loop time ( + 1 p2p + 1 barrier + 1 reduction).
If many/all ranks find a solution at the same time, it still works but may be less efficient.
A possible strategy to derive this global stop condition in a non-blocking fashion is to rely on MPI_Test.
scenario
Consider that each process posts an asynchronous receive of type MPI_INT to its left rank with a given tag to build a ring. Then start your computation. If a rank encounters the stop condition it sends its own rank to its right rank. In the meantime each rank uses MPI_Test to check for the completion of the MPI_Irecv during the computation if it is completed then enter a branch first waiting the message and then transitively propagating the received rank to the right except if the right rank is equal to the payload of the message (not to loop).
This done you should have all processes in the branch, ready to trigger an arbitrary recovery operation.
Complexity
The topology retained is a ring as it minimizes the number of messages at most (n-1) however it augments the propagation time. Other topologies could be retained with more messages but lower spatial complexity for example a tree with a n.ln(n) complexity.
Implementation
Something like this.
int rank, size;
MPI_Init(&argc,&argv);
MPI_Comm_rank( MPI_COMM_WORLD, &rank);
MPI_Comm_size( MPI_COMM_WORLD, &size);
int left_rank = (rank==0)?(size-1):(rank-1);
int right_rank = (rank==(size-1))?0:(rank+1)%size;
int stop_cond_rank;
MPI_Request stop_cond_request;
int stop_cond= 0;
while( 1 )
{
MPI_Irecv( &stop_cond_rank, 1, MPI_INT, left_rank, 123, MPI_COMM_WORLD, &stop_cond_request);
/* Compute Here and set stop condition accordingly */
if( stop_cond )
{
/* Cancel the left recv */
MPI_Cancel( &stop_cond_request );
if( rank != right_rank )
MPI_Send( &rank, 1, MPI_INT, right_rank, 123, MPI_COMM_WORLD );
break;
}
int did_recv = 0;
MPI_Test( &stop_cond_request, &did_recv, MPI_STATUS_IGNORE );
if( did_recv )
{
stop_cond = 1;
MPI_Wait( &stop_cond_request, MPI_STATUS_IGNORE );
if( right_rank != stop_cond_rank )
MPI_Send( &stop_cond_rank, 1, MPI_INT, right_rank, 123, MPI_COMM_WORLD );
break;
}
}
if( stop_cond )
{
/* Handle the stop condition */
}
else
{
/* Cleanup */
MPI_Cancel( &stop_cond_request );
}
That is a question I've asked myself a few times without finding any completely satisfactory answer... The only thing I thought of (beside MPI_Abort() that does it but which is a bit extreme) is to create an MPI_Win storing a flag that will be raise by whichever process facing the problem, and checked by all processes regularly to see if they can go on processing. This is done using non-blocking calls, the same way as described in this answer.
The main weaknesses of this are:
This depends on the processes to willingly check the status of the flag. There is no immediate interruption of their work to notifying them.
The frequency of this checking must be adjusted by hand. You have to find the trade-off between the time you waste processing data while there's no need to because something happened elsewhere, and the time it takes to check the status...
In the end, what we would need is a way of defining a callback action triggered by an MPI call such as MPI_Abort() (basically replacing the abort action by something else). I don't think this exists, but maybe I overlooked it.

MPI irecv,isend communication between tasks

I am new to MPI, so sorry if this sounds stupid. I want a process to have an MPI_Irecv. If it has been called for a task, then it finds a result and sends the result back to the process that called it. How can I check if it has been actually assigned to a task? So that I can have an if{} in which that task takes place while the rest of the process continues with other stuff.
Code example:
for (i=0;i<size_of_Q;i++) {
MPI_Irecv( &shmeio, 1, mpi_point_C, root, 77, MPI_COMM_WORLD, &req );
//I want to put an if right here.
//If it's true process does task.
//Finds a number. then
MPI_Isend( &Bestcandidate, 1, mpi_point_C, root, 66, MPI_COMM_WORLD, &req );
//so that it can return the result.
//if it wasn't assigned a task it carries on with its other tasks.
} //(here is where for loop ends)
You might be confusing what MPI is supposed to do. MPI isn't really a tasking-based model as compared to some others (map reduce, some parts of OpenMP, etc.). MPI has historically focused on SPMD (single program multiple data) types of applications. That's not to say that MPI can't handle MPMD (there's an entire chapter in the standard about dynamic processes and most launchers can run different executables on different ranks.
With that in mind, when you start your job, you'll usually have all of the processes that you'll ever have (unless you're using dynamic processing like MPI_COMM_SPAWN). You probably used something like:
mpiexec -n 8 ./my_program arg1 arg2 arg3
Many times, if people are trying to emulate a tasking (or master/worker) model, they'll treat rank 0 as the special "master":
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (0 == rank) {
while (/* work not done */) {
/* Check if parts of work are done */
/* Send work to ranks without work */
}
} else {
while (/* work not done */ {
/* Get work from master */
/* Compute */
/* Send results to master */
}
}
Often, when waiting for the work, you'll do something like:
for (i = 1; i < num_process; i++) {
MPI_Irecv(&result[i], ..., &requests[i]);
}
This will set up the receives for each rank that will send you work. Then later, you can do something like:
MPI_Testany(num_processes - 1, requests, &index, &flag, statuses);
if (flag) {
/* Process work */
MPI_Send(work_data, ..., index, ...);
}
This will check to see if any of the requests (the handles used to track the status of a nonblocking operation) are completed and will then send new work to the worker that finished.
Obviously, all of this code is not copy/paste ready. You'll have to figure out how/if it applies to your work and adapt it accordingly.

MPI Barrier C++

I want to use MPI (MPICH2) on windows. I write this command:
MPI_Barrier(MPI_COMM_WORLD);
And I expect it blocks all Processors until all group members have called it. But it is not happen. I add a schematic of my code:
int a;
if(myrank == RootProc)
a = 4;
MPI_Barrier(MPI_COMM_WORLD);
cout << "My Rank = " << myrank << "\ta = " << a << endl;
(With 2 processor:) Root processor (0) acts correctly, but processor with rank 1 doesn't know the a variable, so it display -858993460 instead of 4.
Can any one help me?
Regards
You're only assigning a in process 0. MPI doesn't share memory, so if you want the a in process 1 to get the value of 4, you need to call MPI_Send from process 0 and MPI_Recv from process 1.
Variable a is not initialized - it is possible that is why it displays that number. In MPI, variable a is duplicated between the processes - so there are two values for a, one of which is uninitialized. You want to write:
int a = 4;
if (myrank == RootProc)
...
Or, alternatively, do an MPI_send in the Root (id 0), and an MPI_recv in the slave (id 1) so the value in the root is also set in the slave.
Note: that code triggers a small alarm in my head, so I need to check something and I'll edit this with more info. Until then though, the uninitialized value is most certainly a problem for you.
Ok I've checked the facts - your code was not properly indented and I missed the missing {}. The barrier looks fine now, although the snippet you posted does not do too much, and is not a very good example of a barrier because the slave enters it directly, whereas the root will set the value of the variable to 4 and then enter it. To test that it actually works, you probably want some sort of a sleep mechanism in one of the processes - that will yield (hope it's the correct term) the other process as well, preventing it from printing the cout until the sleep is over.
Blocking is not enough, you have to send data to other processes (memory in not shared between processes).
To share data across ALL processes use:
int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )
so in your case:
MPI_Bcast(&a, 1, MPI_INT, 0, MPI_COMM_WORLD);
here you send one integer pointed by &a form process 0 to all other.
//MPI_Bcast is sender for root process and receiver for non-root processes
You can also send some data to specyfic process by:
int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm )
and then receive by:
int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

Resources