Deadlock with MPI - mpi

I'm experimenting with MPI and was wondering if this code could cause a deadlock.
MPI_Comm_rank (comm, &my_rank);
if (my_rank == 0) {
MPI_Send (sendbuf, count, MPI_INT, 1, tag, comm);
MPI_Recv (recvbuf, count, MPI_INT, 1, tag, comm, &status);
} else if (my_rank == 1) {
MPI_Send (sendbuf, count, MPI_INT, 0, tag, comm);
MPI_Recv (recvbuf, count, MPI_INT, 0, tag, comm, &status);
}

MPI_Send may or may not block. It will block until the sender can reuse the sender buffer. Some implementations will return to the caller when the buffer has been sent to a lower communication layer. Some others will return to the caller when there's a matching MPI_Recv() at the other end. So it's up to your MPI implementation whether if this program will deadlock or not.
Because of this program behaves differently among different MPI implementations, you may consider rewritting it so there won't be possible deadlocks:
MPI_Comm_rank (comm, &my_rank);
if (my_rank == 0) {
MPI_Send (sendbuf, count, MPI_INT, 1, tag, comm);
MPI_Recv (recvbuf, count, MPI_INT, 1, tag, comm, &status);
} else if (my_rank == 1) {
MPI_Recv (recvbuf, count, MPI_INT, 0, tag, comm, &status);
MPI_Send (sendbuf, count, MPI_INT, 0, tag, comm);
}
Always be aware that for every MPI_Send() there must be a pairing MPI_Recv(), both "parallel" in time. For example, this may end in deadlock because pairing send/recv calls are not aligned in time. They cross each other:
RANK 0 RANK 1
---------- -------
MPI_Send() --- ---- MPI_Send() |
--- --- |
------ |
-- | TIME
------ |
--- --- |
MPI_Recv() <-- ---> MPI_Recv() v
These processes, on the other way, won't end in deadlock, provided of course, that there are indeed two processes with ranks 0 and 1 in the same communicator domain.
RANK 0 RANK 1
---------- -------
MPI_Send() ------------------> MPI_Recv() |
| TIME
|
MPI_Recv() <------------------ MPI_Send() v
The above fixed program may fail if the size of the communicator com does not allow rank 1 (only 0). That way, the if-else won't take the else route and thus, no process will be listening for the MPI_Send() and rank 0 will deadlock.
If you need to use your current communication layout, then you may prefer to use MPI_Isend() or MPI_Issend() instead for nonblocking sends, thus avoiding deadlock.

The post by #mcleod_ideafix is very good. I want to add a couple more things about non-blocking MPI calls.
The way most MPI implementations is that they copy the data out of the user buffer into some other place. It might be a buffer internal to the implementation, it might be something better on the right kind of networks. When that data is copied out of the user buffer and the buffer can be reused by the application, the MPI_SEND call returns. This may be before the matching MPI_RECV is called or it may not. The larger the data you are sending, the more likely that your message will block until the MPI_RECV call is made.
The best way to avoid this is to use non-blocking calls MPI_IRECV and MPI_ISEND. This way you can post your MPI_IRECV first, then make your call to MPI_ISEND. This avoids extra copies when the messages arrive (because the buffer to hold them is already available via the MPI_IRECV) which makes things faster, and it avoids the deadlock situation. So now your code would look like this:
MPI_Comm_rank (comm, &my_rank);
if (my_rank == 0) {
MPI_Irecv (recvbuf, count, MPI_INT, 1, tag, comm, &status, &requests[0]);
MPI_Isend (sendbuf, count, MPI_INT, 1, tag, comm, &requests[1]);
} else if (my_rank == 1) {
MPI_Irecv (recvbuf, count, MPI_INT, 0, tag, comm, &status, &requests[0]);
MPI_Isend (sendbuf, count, MPI_INT, 0, tag, comm, &requests[1]);
}
MPI_Waitall(2, request, &statuses);

As mcleod_ideafix explained your code can result in a deadlock.
Here you go: Explanation and two possible issue Solutions, one by rearranging execution order, one by async send recv calls
Heres the solution with async calls:
if (rank == 0) {
MPI_Isend(..., 1, tag, MPI_COMM_WORLD, &req);
MPI_Recv(..., 1, tag, MPI_COMM_WORLD, &status);
MPI_Wait(&req, &status);
} else if (rank == 1) {
MPI_Recv(..., 0, tag, MPI_COMM_WORLD, &status);
MPI_Send(..., 0, tag, MPI_COMM_WORLD);
}

Related

Could MPI_Bcast lead to the issue of data uncertainty?

If different process broadcast different value to the other processes in the group of a certain communicator,what would happen?
Take the following program run by two processes as an example,
int rank, size;
int x;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0)
{
x = 0;
MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);
}
else if (rank==1)
{
x = 1;
MPI_Bcast(&x, 1, MPI_INT, 1, MPI_COMM_WORLD);
}
cout << "Process " << rank << "'s value is:" << x << endl;
MPI_Finalize();
I think there could be different possibilities of the printed result at the end of the program. If the process 0 runs faster than process 1, it would broadcast its value earlier than process 1, so process 1 would have the same value with process 0 when it starts broadcasting its value, therefore making the printed value of x both 0. But if the process 0 runs slower than process 1, the process 0 would have the same value as process 1 which is 1 at the end. Does what I described happen actually?
I think you don't understand the MPI_Bcast function well. Actually MPI_Bcast is a kind of MPI collective communication method in which every process that belongs to a certain communicator need to get involve. So for the function of MPI_Bcast, not only the process who sends the data to be broadcasted, but also the processes who receive the broadcasted data need to call the function synchronously in order to achieve the goal of data broadcasting among all participating processes.
In your given program, especially this part:
if (rank == 0)
{
x = 0;
MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);
}
else if (rank==1)
{
x = 1;
MPI_Bcast(&x, 1, MPI_INT, 1, MPI_COMM_WORLD);
}
I think you meant to let process whose rank is 0 (process 0) broadcasts its value of x to other processes, but in your code, only process 0 calls the MPI_Bcast function when you use the if-else segment. So what do other processes do? For process whose rank is 1 (process 1), it doesn't call the same MPI_Bcast function the process 0 calls, although it does call another MPI_Bcast function to broadcast its value of x (the argument of root is different between those two MPI_Bcast functions).Thus if only process 0 calls the MPI_Bcast function, it just broadcasts the value of x to only itself actually and the value of x stored in other processes wouldn't be affected at all. Also it's the same condition for process 1. As a result in your program, the printed value of x for each process should be the same as when it's assigned initially and there wouldn't be the data uncertainty issue as you concerned.
MPI_Bcast is used primarily so that rank 0 [root] can calculate values and broadcast them so everybody starts with the same values.
Here is a valid usage:
int x;
// not all ranks may get the same time value ...
srand(time(NULL));
// so we must get the value once ...
if (rank == 0)
x = rand();
// and broadcast it here ...
MPI_Bcast(&x,1,MPI_INT,0,MPI_COMM_WORLD);
Notice the difference from your usage. The same MPI_Bcast call for all ranks. The root will do a send and the others will do recv.

What is the right way to "notify" processors without blocking?

Suppose I have a very large array of things and I have to do some operation on all these things.
In case operation fails for one element, I want to stop the work [this work is distributed across number of processors] across all the array.
I want to achieve this while keeping the number of sent/received messages to a minimum.
Also, I don't want to block processors if there is no need to.
How can I do it using MPI?
This seems to be a common question with no easy answer. Both other answer have scalability issues. The ring-communication approach has linear communication cost, while in the one-sided MPI_Win-solution, a single process will be hammered with memory requests from all workers. This may be fine for low number of ranks, but pose issues when increasing the rank count.
Non-blocking collectives can provide a more scalable better solution. The main idea is to post a MPI_Ibarrier on all ranks except on one designated root. This root will listen to point-to-point stop messages via MPI_Irecv and complete the MPI_Ibarrier once it receives it.
The tricky part is that there are four different cases "{root, non-root} x {found, not-found}" that need to be handled. Also it can happen that multiple ranks send a stop message, requiring an unknown number of matching receives on the root. That can be solved with an additional reduction that counts the number of ranks that sent a stop-request.
Here is an example how this can look in C:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
const int iter_max = 10000;
const int difficulty = 20000;
int find_stuff()
{
int num_iters = rand() % iter_max;
for (int i = 0; i < num_iters; i++) {
if (rand() % difficulty == 0) {
return 1;
}
}
return 0;
}
const int stop_tag = 42;
const int root = 0;
int forward_stop(MPI_Request* root_recv_stop, MPI_Request* all_recv_stop, int found_count)
{
int flag;
MPI_Status status;
if (found_count == 0) {
MPI_Test(root_recv_stop, &flag, &status);
} else {
// If we find something on the root, we actually wait until we receive our own message.
MPI_Wait(root_recv_stop, &status);
flag = 1;
}
if (flag) {
printf("Forwarding stop signal from %d\n", status.MPI_SOURCE);
MPI_Ibarrier(MPI_COMM_WORLD, all_recv_stop);
MPI_Wait(all_recv_stop, MPI_STATUS_IGNORE);
// We must post some additional receives if multiple ranks found something at the same time
MPI_Reduce(MPI_IN_PLACE, &found_count, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
for (found_count--; found_count > 0; found_count--) {
MPI_Recv(NULL, 0, MPI_CHAR, MPI_ANY_SOURCE, stop_tag, MPI_COMM_WORLD, &status);
printf("Additional stop from: %d\n", status.MPI_SOURCE);
}
return 1;
}
return 0;
}
int main()
{
MPI_Init(NULL, NULL);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
srand(rank);
MPI_Request root_recv_stop;
MPI_Request all_recv_stop;
if (rank == root) {
MPI_Irecv(NULL, 0, MPI_CHAR, MPI_ANY_SOURCE, stop_tag, MPI_COMM_WORLD, &root_recv_stop);
} else {
// You may want to use an extra communicator here, to avoid messing with other barriers
MPI_Ibarrier(MPI_COMM_WORLD, &all_recv_stop);
}
while (1) {
int found = find_stuff();
if (found) {
printf("Rank %d found something.\n", rank);
// Note: We cannot post this as blocking, otherwise there is a deadlock with the reduce
MPI_Request req;
MPI_Isend(NULL, 0, MPI_CHAR, root, stop_tag, MPI_COMM_WORLD, &req);
if (rank != root) {
// We know that we are going to receive our own stop signal.
// This avoids running another useless iteration
MPI_Wait(&all_recv_stop, MPI_STATUS_IGNORE);
MPI_Reduce(&found, NULL, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
MPI_Wait(&req, MPI_STATUS_IGNORE);
break;
}
MPI_Wait(&req, MPI_STATUS_IGNORE);
}
if (rank == root) {
if (forward_stop(&root_recv_stop, &all_recv_stop, found)) {
break;
}
} else {
int stop_signal;
MPI_Test(&all_recv_stop, &stop_signal, MPI_STATUS_IGNORE);
if (stop_signal)
{
MPI_Reduce(&found, NULL, 1, MPI_INT, MPI_SUM, root, MPI_COMM_WORLD);
printf("Rank %d stopping after receiving signal.\n", rank);
break;
}
}
};
MPI_Finalize();
}
While this is not the simplest code, it should:
Introduce no additional blocking
Scale with the implementation of a barrier (usually O(log N))
Have a worst-case-latency from one found, to all stop of 2 * loop time ( + 1 p2p + 1 barrier + 1 reduction).
If many/all ranks find a solution at the same time, it still works but may be less efficient.
A possible strategy to derive this global stop condition in a non-blocking fashion is to rely on MPI_Test.
scenario
Consider that each process posts an asynchronous receive of type MPI_INT to its left rank with a given tag to build a ring. Then start your computation. If a rank encounters the stop condition it sends its own rank to its right rank. In the meantime each rank uses MPI_Test to check for the completion of the MPI_Irecv during the computation if it is completed then enter a branch first waiting the message and then transitively propagating the received rank to the right except if the right rank is equal to the payload of the message (not to loop).
This done you should have all processes in the branch, ready to trigger an arbitrary recovery operation.
Complexity
The topology retained is a ring as it minimizes the number of messages at most (n-1) however it augments the propagation time. Other topologies could be retained with more messages but lower spatial complexity for example a tree with a n.ln(n) complexity.
Implementation
Something like this.
int rank, size;
MPI_Init(&argc,&argv);
MPI_Comm_rank( MPI_COMM_WORLD, &rank);
MPI_Comm_size( MPI_COMM_WORLD, &size);
int left_rank = (rank==0)?(size-1):(rank-1);
int right_rank = (rank==(size-1))?0:(rank+1)%size;
int stop_cond_rank;
MPI_Request stop_cond_request;
int stop_cond= 0;
while( 1 )
{
MPI_Irecv( &stop_cond_rank, 1, MPI_INT, left_rank, 123, MPI_COMM_WORLD, &stop_cond_request);
/* Compute Here and set stop condition accordingly */
if( stop_cond )
{
/* Cancel the left recv */
MPI_Cancel( &stop_cond_request );
if( rank != right_rank )
MPI_Send( &rank, 1, MPI_INT, right_rank, 123, MPI_COMM_WORLD );
break;
}
int did_recv = 0;
MPI_Test( &stop_cond_request, &did_recv, MPI_STATUS_IGNORE );
if( did_recv )
{
stop_cond = 1;
MPI_Wait( &stop_cond_request, MPI_STATUS_IGNORE );
if( right_rank != stop_cond_rank )
MPI_Send( &stop_cond_rank, 1, MPI_INT, right_rank, 123, MPI_COMM_WORLD );
break;
}
}
if( stop_cond )
{
/* Handle the stop condition */
}
else
{
/* Cleanup */
MPI_Cancel( &stop_cond_request );
}
That is a question I've asked myself a few times without finding any completely satisfactory answer... The only thing I thought of (beside MPI_Abort() that does it but which is a bit extreme) is to create an MPI_Win storing a flag that will be raise by whichever process facing the problem, and checked by all processes regularly to see if they can go on processing. This is done using non-blocking calls, the same way as described in this answer.
The main weaknesses of this are:
This depends on the processes to willingly check the status of the flag. There is no immediate interruption of their work to notifying them.
The frequency of this checking must be adjusted by hand. You have to find the trade-off between the time you waste processing data while there's no need to because something happened elsewhere, and the time it takes to check the status...
In the end, what we would need is a way of defining a callback action triggered by an MPI call such as MPI_Abort() (basically replacing the abort action by something else). I don't think this exists, but maybe I overlooked it.

MPI message received in different communicator

It was my understanding that MPI communicators restrict the scope of communication, such
that messages sent from one communicator should never be received in a different one.
However, the program inlined below appears to contradict this.
I understand that the MPI_Send call returns before a matching receive is posted because of the internal buffering it does under the hood (as opposed to MPI_Ssend). I also understand that MPI_Comm_free doesn't destroy the communicator right away, but merely marks it for deallocation and waits for any pending operations to finish. I suppose that my unmatched send operation will be forever pending, but then I wonder how come the same object (integer value) is reused for the second communicator!?
Is this normal behaviour, a bug in the MPI library implementation, or is it that my program is just incorrect?
Any suggestions are much appreciated!
LATER EDIT: posted follow-up question
#include "stdio.h"
#include "unistd.h"
#include "mpi.h"
int main(int argc, char* argv[]) {
int rank, size;
MPI_Group group;
MPI_Comm my_comm;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_group(MPI_COMM_WORLD, &group);
MPI_Comm_create(MPI_COMM_WORLD, group, &my_comm);
if (rank == 0) printf("created communicator %d\n", my_comm);
if (rank == 1) {
int msg = 123;
MPI_Send(&msg, 1, MPI_INT, 0, 0, my_comm);
printf("rank 1: message sent\n");
}
sleep(1);
if (rank == 0) printf("freeing communicator %d\n", my_comm);
MPI_Comm_free(&my_comm);
sleep(2);
MPI_Comm_create(MPI_COMM_WORLD, group, &my_comm);
if (rank == 0) printf("created communicator %d\n", my_comm);
if (rank == 0) {
int msg;
MPI_Recv(&msg, 1, MPI_INT, 1, 0, my_comm, MPI_STATUS_IGNORE);
printf("rank 0: message received\n");
}
sleep(1);
if (rank == 0) printf("freeing communicator %d\n", my_comm);
MPI_Comm_free(&my_comm);
MPI_Finalize();
return 0;
}
outputs:
created communicator -2080374784
rank 1: message sent
freeing communicator -2080374784
created communicator -2080374784
rank 0: message received
freeing communicator -2080374784
The number you're seeing is simply a handle for the communicator. It's safe to reuse the handle since you've freed it. As to why you're able to send the message, look at how you're creating the communicator. When you use MPI_Comm_group, you're getting a group containing the ranks associated with the specified communicator. In this case, you get all of the ranks, since you are getting the group for MPI_COMM_WORLD. Then, you are using MPI_Comm_create to create a communicator based on a group of ranks. You are using the same group you just got, which will contain all of the ranks. So your new communicator has all of the ranks from MPI_COMM_WORLD. If you want your communicator to only contain a subset of ranks, you'll need to use a different function (or multiple functions) to make the desired group(s). I'd recommend reading through Chapter 6 of the MPI Standard, it includes all of the functions you'll need. Pick what you need to build the communicator you want.

How to improve this programme to work master also

In this mpi programme only works slave nodes. How to modify it to work master also. Because working of the master also improve the performance of the system.
int A,B,C, slaveid,recvid,root, rank,size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
/*-------------------------- master ------------------------------*/
if(rank == 0){
N =10;
for(slaveid=1; slaveid<size; slaveid++){
MPI_Send(&N, 1, MPI_INT, slaveid, 1, MPI_COMM_WORLD);
}
for(recvid=1; recvid<size; recvid++){
MPI_Recv(&A, 1, MPI_INT, recvid, 2, MPI_COMM_WORLD, &status);
printf(" My id = %d and i send = %d\n",recvid,A);
}
}
/*-------------------------- Slave ------------------------------*/
if(rank>0){
MPI_Recv(&B, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
C = B*3;
MPI_Send(&C, 1, MPI_INT, 0, 2, MPI_COMM_WORLD);
}
MPI_Finalize();
}
Within the block delimited by
if(rank == 0){
}
insert, at the appropriate location, the line
work_like_a_slave(argument1, argument2,...)
The appropriate location is probably between the loop that sends messages and the loop that receives messages so that the master isn't entirely idle while the slaves toil.
Whether this has a measurable impact on performance depends on a number of factors your question doesn't provide enough information on which to base a good guess; factors such as: how many slaves there are and therefore how busy the master is sending and receiving messages, how much work each process does compared with the messaging it does, etc.
Be prepared, if the numbers work against you, for any measurable impact to be negative, that is for pressing the master into service to actually slow down your computation.

MPI in C and pointer transmissions

I am having a new little problem;
I have a little pointer called:
int *a;
Now ..somewhere inside my main method I allocate some space for it using the following lines and assign a value:
a = (int *) malloc(sizeof(int));
*a=5;
..and then I attempt to transmit it (say to process 1):
MPI_Bsend(a, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
On the other end, if I try to receive that pointer
int *b;
MPI_Recv(b, 1, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
printf("This is what I received: %d \n", *b);
I get an error about the buffer!
However if instead of declaring 'b' a pointer I do the following:
int b;
MPI_Recv(&b, 1, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
printf("This is what I received: %d \n", b);
...all seems to be good! Could someone help me figure out what's happening and how to only use pointer?
Thanks in advance!
The meaning of the line
MPI_Bsend(a, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
is the following: "a is a point in memory where I have 1 integer. Send it.`
In the code you posted above, this is absolutely true: a does point to an integer, and so it is sent. This is why you can receive it using your second method, since the meaning of the line
MPI_Recv(&b, 1, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
is "receive 1 integer, and store it at &b. b is a regular int, so it's all good. In the first receive, you're trying to receive an integer into an int* variable there is no allocated memory that b is pointing to, so Recv has nowhere to write to. However, I should point out:
NEVER pass a pointer's contents to another process in MPI
MPI processes cannot read each others' memory, and virtual addressing makes one process' pointer completely meaningless to another.
This problem is related to handling pointers and allocating memory; it's not an MPI specific issue.
In your second variant, int a automatically allocates memory for one integer. By passing &a you are passing a pointer to an allocated memory segment. In your first variant, memory for the pointer is automatically allocated, but NOT for the memory the pointer is pointing to. Thus, when you pass in the pointer, MPI tries to write to non-allocated memory, which causes the error.
It would work this way though:
int *b = (int *) malloc(sizeof(int));
MPI_Recv(b, 1, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
The error you are getting is that you are copying a result from MPI_Recv in some memory *b that you don't own and isn't initialised.
Not an expert on MPI but surely you can't transfer a pointer (ie a memory address) to a process that could be running on another machine!

Resources