I'm completely stuck. This code
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if(myrank==0) i=1;
if(myrank==1) i=0;
MPI_Send(sendbuf, 1, MPI_INT, i, 99, MPI_COMM_WORLD);
MPI_Recv(recvbuf, 1, MPI_INT, i, 99, MPI_COMM_WORLD,&status);
does work running on two processes. Why is there no deadlock?
Same thing with non-blocking version
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if(myrank==0) i=1;
if(myrank==1) i=0;
MPI_Isend(sendbuf, 1, MPI_INT, i, 99, MPI_COMM_WORLD,&req);
MPI_Wait(&req, &status);
MPI_Irecv(recvbuf, 1, MPI_INT, i, 99, MPI_COMM_WORLD,&req);
MPI_Wait(&req, &status);
Logically there should be a blocking, but it is not. Why?
How do I force the MPI to block?
MPI has two kinds of point-to-point operations - blocking and non-blocking. This might be a bit confusing initially, especially when it comes to sending messages, but the blocking behaviour does not relate to the physical data transfer operation. It rather relates to the time period, in which MPI might still access the buffer (either send or receive) and therefore the application is not supposed to modify its content (with send operations) or may still read old garbage (with read operations).
When you make a blocking call to MPI, it only returns once it has finished using the send or the receive buffer. With receive operations this means that the message has been received and stored in the buffer, but with send operations things are more complex. There are several modes for the send operation:
buffered -- data isn't sent immediately, but rather copied to a local user-supplied buffer and the call returns right after. The actual message transfer happens at some point in the future when MPI decides to do so. There is a special MPI call, MPI_Bsend, which always behaves this way. There are also calls for attaching and detaching the user-supplied buffer.
synchronous -- MPI waits until a matching receive operation is posted and the data transfer is in progress. It returns once the entire message is in transit and the send buffer is free to be reused. There is a special MPI call, MPI_Ssend, which always behaves this way.
ready -- MPI tries to send the message and it only succeeds if the receive operation has already been posted. The idea here is to skip the handshake between the ranks, which may reduce the latency, but it is unspecified what exactly happens if the receiver is not ready. There is a special call for that, MPI_Rsend, but it is advised to not use it unless you really know what you are doing.
MPI_Send invokes what is known as the standard send mode, which could be any combination of the synchronous and the buffered mode, with the latter using a buffer supplied by the MPI system and not the user-supplied one. The actual details are left to the MPI implementation and hence differ wildly.
It is most often the case that small messages are buffered while larger messages are sent synchronously, which is what you observe in your case, but one should never ever rely on this behaviour and the definition of "small" varies with the implementation and the network kind. A correct MPI program will not deadlock if all MPI_Send calls are replaced with MPI_Ssend, which means you must never assume that small messages are buffered. But a correct program will also not expect MPI_Send to be synchronous for larger messages and rely on this behaviour for synchronisation between the ranks, i.e., replacing MPI_Send with MPI_Bsend and providing a large enough buffer should not alter the program behaviour.
There is a very simple solution that always works and it frees you from having to remember to not rely on any assumptions -- simply use MPI_Sendrecv. It is a combined send and receive operation that never deadlocks, except when the send or the receive operation (or both) is unmatched. With send-receive, your code will look like this:
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank==0) i=1;
if (myrank==1) i=0;
MPI_Sendrecv(sendbuf, 1, MPI_INT, i, 99,
recvbuf, 1, MPI_INT, i, 99, MPI_COMM_WORLD, &status);
Related
Usaully, one would have to define a new type and register it with MPI to use it. I am wondering if using protobuf to serialize a object and send it over using MPI as byte stream. I have two questions:
(1) do you foresee any problem with this approach?
(2) do I need to send length information through a separate MPI_Send(), or can I probe and use MPI_Get_count(&status, MPI_BYTE, &count)?
An example would be:
// sender
MyObj myobj;
...
size_t size = myobj.ByteSizeLong();
void *buf = malloc(size);
myobj.SerializePartialToArray(buf, size);
MPI_Isend(buf, size, MPI_BYTE, ... )
...
// receiver
MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &status);
if (flag) {
MPI_Get_count(&status, MPI_BYTE, &size);
MPI_Recv(buf, size, MPI_BYTE, ... , &status);
MyObject obj;
obj.ParseFromArray(buf, size);
...
}
Generally you can do that. Your code sketch looks also fine (except for the omitted buf allocation on the receiver side). As Gilles points out, makes sure to use status.MPI_SOURCE and status.MPI_TAG for the actual MPI_Recv, not MPI_*_ANY.
However, there are some performance limitations.
Protobuf isn't very fast, particularly due to en-/decoding. It very much depends what your performance expectations are. If you run on a high performance network, assume a significant impact. Here are some basic benchmarks.
Not knowing the message size ahead and thus always posting the receive after the send also has performance implications. This means the actual transmission will likely start later, which may or may not have an impact on the senders side since you are using non-blocking sends. There could be cases, where you run into some practical limitations regarding number of unexpected messages. That is not a general correctness issues, but might require some configuration tuning.
If you go ahead with your approach, remember to do some performance analysis on the implementation. Use an MPI-aware performance analysis tool to make sure your approach doesn't introduce critical bottlenecks.
The application I'm working on involves multiple processes all working on similar tasks, and only occasionally sharing information. I have a working implementation using openMPI, but I'm having issues with messages sometimes being received much later than they are sent.
At the moment, the main loop of each process begins by processing any waiting messages, then does a whole bunch of computation, then sends the results to each of the other processes using MPI_ISend. Something like:
while problem is unsolved:
bool flag;
MPI_Iprobe(MPI_ANY_SOURCE, ..., &flag, MPI_IGNORE_STATUS);
while flag:
MPI_Recv(&message, ...);
// update local information based on message contents
MPI_Iprobe(MPI_ANY_SOURCE, ..., &flag, MPI_IGNORE_STATUS);
result = doComputation(); // between 2s and 1m
MPI_Request request;
for dest in other_processes:
MPI_Isend(result, ..., dest, &request);
MPI_Wait(&request, MPI_STATUS_IGNORE); // Doesn't seem to make any difference
This works OK, but the following problem often occurs: Process X sends a message, but the next time Process Y probes, it finds no message. Often it is only one or two loops (and many seconds) later that Process Y gets the message sent by Process X. The messages always get through eventually, but Process Y shouldn't have to wait until the second or third time it probes to actually receive the message.
I don't have a solid understanding of how MPI works, but from reading other questions I think the problem has something to do with MPI not having a chance to progress the message, because in my program the MPI functions are only called very occasionally, rather than within a tight loop. Trying to give MPI some extra time to progress, I added a few dummy calls to Iprobe:
bool flag;
MPI_Iprobe(MPI_ANY_SOURCE, ..., &flag, MPI_IGNORE_STATUS);
MPI_Iprobe(MPI_ANY_SOURCE, ..., &flag, MPI_IGNORE_STATUS);
MPI_Iprobe(MPI_ANY_SOURCE, ..., &flag, MPI_IGNORE_STATUS);
MPI_Iprobe(MPI_ANY_SOURCE, ..., &flag, MPI_IGNORE_STATUS);
while flag:
MPI_Recv(&message, ...);
// update local information based on message contents
MPI_Iprobe(MPI_ANY_SOURCE, ..., &flag, MPI_IGNORE_STATUS);
And it works - any sent messages are always received the very next time a process probes for them.
But this is ugly and I suspect it may give different results on different platforms. So is there an alternate way to allow MPI some time to complete message progression before probing? I don't want to use blocking receive without probing because Process Y should be able to continue computation when there are no messages waiting.
Thanks.
I know that MPI_SENDRECV allow to overcome the problem of deadlocks (when we use the classic MPI_SEND and MPI_RECV functions).
I would like to know if MPI_SENDRECV(sent_to_process_1, receive_from_process_0) is equivalent to:
MPI_ISEND(sent_to_process_1, request1)
MPI_IRECV(receive_from_process_0, request2)
MPI_WAIT(request1)
MPI_WAIT(request2)
with asynchronous MPI_ISEND and MPI_RECV functions?
From I have seen, MPI_ISEND and MPI_RECV creates a fork (i.e. 2 processes). So if I follow this logic, the first call of MPI_ISEND generates 2 processes. One does the communication and the other calls MPI_RECV which forks itself 2 processes.
But once the communication of first MPI_ISEND is finished, does the second process call MPI_IRECV again? With this logic, the above equivalent doesn't seem to be valid...
Maybe I should change to this:
MPI_ISEND(sent_to_process_1, request1)
MPI_WAIT(request1)
MPI_IRECV(receive_from_process_0, request2)
MPI_WAIT(request2)
But I think that it could be create also deadlocks.
Anyone could give to me another solution using MPI_ISEND, MPI_IRECV and MPI_WAIT to get the same behaviour of MPI_SEND_RECV?
There's some dangerous lines of thought in the question and other answers. When you start a non-blocking MPI operation, the MPI library doesn't create a new process/thread/etc. You're thinking of something more like a parallel region of OpenMP I believe, where new threads/tasks are created to do some work.
In MPI, starting a non-blocking operation is like telling the MPI library that you have some things that you'd like to get done whenever MPI gets a chance to do them. There are lots of equally valid options for when they are actually completed:
It could be that they all get done later when you call a blocking completion function (like MPI_WAIT or MPI_WAITALL). These functions guarantee that when the blocking completion call is done, all of the requests that you passed in as arguments are finished (in your case, the MPI_ISEND and the MPI_IRECV). Regardless of when the operations actually take place (see next few bullets), you as an application can't consider them done until they are actually marked as completed by a function like MPI_WAIT or MPI_TEST.
The operations could get done "in the background" during another MPI operation. For instance, if you do something like the code below:
MPI_Isend(..., MPI_COMM_WORLD, &req[0]);
MPI_Irecv(..., MPI_COMM_WORLD, &req[1]);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Waitall(2, req);
The MPI_ISEND and the MPI_IRECV would probably actually do the data transfers in the background during the MPI_BARRIER. This is because as an application, you are transferring "control" of your application to the MPI library during the MPI_BARRIER call. This lets the library make progress on any ongoing MPI operation that it wants. Most likely, when the MPI_BARRIER is complete, so are most other things that finished first.
Some MPI libraries allow you to specify that you want a "progress thread". This tells the MPI library to start up another thread (not that thread != process) in the background that will actually do the MPI operations for you while your application continues in the main thread.
Remember that all of these in the end require that you actually call MPI_WAIT or MPI_TEST or some other function like it to ensure that your operation is actually complete, but none of these spawn new threads or processes to do the work for you when you call your nonblocking functions. Those really just act like you stick them on a list of things to do (which in reality, is how most MPI libraries implement them).
The best way to think of how MPI_SENDRECV is implemented is to do two non-blocking calls with one completion function:
MPI_Isend(..., MPI_COMM_WORLD, &req[0]);
MPI_Irecv(..., MPI_COMM_WORLD, &req[1]);
MPI_Waitall(2, req);
How I usually do this on node i communicating with node i+1:
mpi_isend(send_to_process_iPlus1, requests(1))
mpi_irecv(recv_from_process_iPlus1, requests(2))
...
mpi_waitall(2, requests)
You can see how ordering your commands this way with non-blocking communication allows you (during the ... above) to perform any computation that does not rely on the send/recv buffers to be done during your communication. Overlapping computation with communication is often crucial for maximizing performance.
mpi_send_recv on the other hand (while avoiding any deadlock issues) is still a blocking operation. Thus, your program must remain in that routine during the entire send/recv process.
Final points: you can initialize more than 2 requests and wait on all of them the same way using the above structure as dealing with 2 requests. For instance, it's quite easy to start communication with node i-1 as well and wait on all 4 of the requests. Using mpi_send_recv you must always have a paired send and receive; what if you only want to send?
I have a routine named transfer(int) which calls MPI routines. In my main program, transfer() is called twice.
... // do some work
transfer(1); // first transfer
... // do some work again
transfer(2); // second transfer
The transfer(int) function looks like this
... // do some work
MPI_Barrier(MPI_COMM_WORLD);
t0 = clock();
for(int k=0; k<mpisize; k++) {
MPI_Irecv( (void*)rbuffer[k], rsize[k], MPI_BYTE, k, 0, MPI_COMM_WORLD, reqs+2*k );
MPI_Isend( (void*)sbuffer[k], ssize[k], MPI_BYTE, k, 0, MPI_COMM_WORLD, reqs+2*k+1);
}
MPI_Waitall(2*mpisize, reqs, status);
MPI_Barrier(MPI_COMM_WORLD);
if (mpirank==0) cerr << "Transfer took "<< (double)(clock()-t0)/CLOCKS_PER_SEC << " secs" << endl;
Note that I only measure the communication time, excluding the pre-processing.
For transfer(1), all send and receive buffers have size 0 for each k. So essentially, there's no communication going on. Yet, the transfer took 1.95 seconds.
For transfer(2), each processor has to send/receive about 20KB to/from every other processor. Yet, the whole transfer took only 0.07 seconds.
I ran the experiment many times with 1024 processors and the measurements are consistent. Can you explain this phenomenon or what could possibly be wrong?
Thanks.
You could use a MPI performance analysis tool such as Vampir or Scalasca to better understand what is happening:
Are all communications slow or just a few the rest is waiting at the barrier?
What is the influence of the barrier?
The actual answer highly depends on your System and MPI implementation. Anycorn's comment that zero size messages still require communication and the first communication can have additional overhead is a good start of investigating. So another question you should try to answer is:
How does a second zero size message behave?
Also MPI implementations can handle messages of different sizes fundamentally different, e.g. by using unexpected message buffers, but that again is implementation and system dependent.
I am looking at the code here which I am doing for practice.
http://www.mcs.anl.gov/research/projects/mpi/usingmpi/examples/simplempi/main.html
I am confused about the part shown here.
MPI::COMM_WORLD.Reduce(&mypi, &pi, 1, MPI::DOUBLE, MPI::SUM, 0);
if (rank == 0)
cout << "pi is approximately " << pi
<< ", Error is " << fabs(pi - PI25DT)
<< endl;
My question is does the mpi reduce function know when all the other processes (in this case the programs with rank 1-3) have finished and that its result is complete?
All collective communication calls (Reduce, Gather, Scatter, etc) are blocking.
#g.inozemtsev is correct. The MPI collective calls -- including those in Open MPI -- are "blocking" in the MPI sense of the word, meaning that you can use the buffer when the call returns. In an operation like MPI_REDUCE, it means that the root process will have the answer in its buffer when it returns. Further, it means that non-root processes in an MPI_REDUCE can safely overwrite their buffer when MPI_REDUCE returns (which usually means that their part in the reduction is complete).
However, note that as mentioned above, the return from a collective operation such as an MPI_REDUCE in one process has no bearing on the return of the same collective operation in a peer process. The only exception to this rule is MPI_BARRIER, because barrier is defined as an explicit synchronization, whereas all the other MPI-2.2 collective operations do not necessarily need to explicitly synchronize.
As a concrete example, say that all non-root processes call MPI_REDUCE at time X. The root finally calls MPI_REDUCE at time X+N (for this example, assume N is large). Depending on the implementation, the non-root processes may return much earlier than X+N or they may block until X+N(+M). The MPI standard is intentionally vague on this point to allow MPI implementations to do what they want / need (which may also be dictated by resource consumption/availability).
Hence, #g.inozemtsev's point of "You cannot rely on synchronization" (except for with MPI_BARRIER) is correct.