I don't completely understand how MPI's nonblocking communication and rendezvous protocol are supposed to interact.
Firstly, consider this pseudocode, which can block when rendezvous protocol is used (assume we are having 2 processes):
if (rank == 0) {
MPI_Send (big_message, destination=1)
MPI_Recv(source=1)
} else {
MPI_Send(big_message, destination=0)
MPI_Recv(source=0)
}
This can obviously block when message is too big to fit in the internal buffer, as MPI_Sends in both processes would wait for a matching receive to be posted.
On my system, I have found the following modification to work:
if (rank == 0) {
MPI_Isend (big_message, destination=1, &request)
MPI_Recv(source=1)
MPI_Wait(request)
} else {
MPI_Isend(big_message, destination=0, &request)
MPI_Recv(source=0)
MPI_Wait(request)
}
We use nonblocking communication for sending the message. Would my solution be correct on every implementation of MPI? I have read that the implementations are not mandated to initiate any form of communication when MPI_Isend is called, and can perform it upon calling MPI_Wait. Would such implementation break my code? My understanding is that in such cicumstances MPI_Isend is basically a no-op and, for my code, both processes would wait in MPI_Recv for a send which does not come.
If my pseudocode is non-portable is there a way of using nonblocking communication to fix it?
Saying that all communication can happen at the wait call is a simplistic view, and probably only correct if all communication is non-blocking. Taken strictly it would mean that your code would deadlock on the blocking receives because the sends would happen after them. That does not happen.
For your case, section 3.7.4 of the standard say that that "[...] a call to MPI_Wait that completes a send will eventually return if a matching receive has been started [... some notes...]" So, yes, your code is correct.
Related
I know that MPI_SENDRECV allow to overcome the problem of deadlocks (when we use the classic MPI_SEND and MPI_RECV functions).
I would like to know if MPI_SENDRECV(sent_to_process_1, receive_from_process_0) is equivalent to:
MPI_ISEND(sent_to_process_1, request1)
MPI_IRECV(receive_from_process_0, request2)
MPI_WAIT(request1)
MPI_WAIT(request2)
with asynchronous MPI_ISEND and MPI_RECV functions?
From I have seen, MPI_ISEND and MPI_RECV creates a fork (i.e. 2 processes). So if I follow this logic, the first call of MPI_ISEND generates 2 processes. One does the communication and the other calls MPI_RECV which forks itself 2 processes.
But once the communication of first MPI_ISEND is finished, does the second process call MPI_IRECV again? With this logic, the above equivalent doesn't seem to be valid...
Maybe I should change to this:
MPI_ISEND(sent_to_process_1, request1)
MPI_WAIT(request1)
MPI_IRECV(receive_from_process_0, request2)
MPI_WAIT(request2)
But I think that it could be create also deadlocks.
Anyone could give to me another solution using MPI_ISEND, MPI_IRECV and MPI_WAIT to get the same behaviour of MPI_SEND_RECV?
There's some dangerous lines of thought in the question and other answers. When you start a non-blocking MPI operation, the MPI library doesn't create a new process/thread/etc. You're thinking of something more like a parallel region of OpenMP I believe, where new threads/tasks are created to do some work.
In MPI, starting a non-blocking operation is like telling the MPI library that you have some things that you'd like to get done whenever MPI gets a chance to do them. There are lots of equally valid options for when they are actually completed:
It could be that they all get done later when you call a blocking completion function (like MPI_WAIT or MPI_WAITALL). These functions guarantee that when the blocking completion call is done, all of the requests that you passed in as arguments are finished (in your case, the MPI_ISEND and the MPI_IRECV). Regardless of when the operations actually take place (see next few bullets), you as an application can't consider them done until they are actually marked as completed by a function like MPI_WAIT or MPI_TEST.
The operations could get done "in the background" during another MPI operation. For instance, if you do something like the code below:
MPI_Isend(..., MPI_COMM_WORLD, &req[0]);
MPI_Irecv(..., MPI_COMM_WORLD, &req[1]);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Waitall(2, req);
The MPI_ISEND and the MPI_IRECV would probably actually do the data transfers in the background during the MPI_BARRIER. This is because as an application, you are transferring "control" of your application to the MPI library during the MPI_BARRIER call. This lets the library make progress on any ongoing MPI operation that it wants. Most likely, when the MPI_BARRIER is complete, so are most other things that finished first.
Some MPI libraries allow you to specify that you want a "progress thread". This tells the MPI library to start up another thread (not that thread != process) in the background that will actually do the MPI operations for you while your application continues in the main thread.
Remember that all of these in the end require that you actually call MPI_WAIT or MPI_TEST or some other function like it to ensure that your operation is actually complete, but none of these spawn new threads or processes to do the work for you when you call your nonblocking functions. Those really just act like you stick them on a list of things to do (which in reality, is how most MPI libraries implement them).
The best way to think of how MPI_SENDRECV is implemented is to do two non-blocking calls with one completion function:
MPI_Isend(..., MPI_COMM_WORLD, &req[0]);
MPI_Irecv(..., MPI_COMM_WORLD, &req[1]);
MPI_Waitall(2, req);
How I usually do this on node i communicating with node i+1:
mpi_isend(send_to_process_iPlus1, requests(1))
mpi_irecv(recv_from_process_iPlus1, requests(2))
...
mpi_waitall(2, requests)
You can see how ordering your commands this way with non-blocking communication allows you (during the ... above) to perform any computation that does not rely on the send/recv buffers to be done during your communication. Overlapping computation with communication is often crucial for maximizing performance.
mpi_send_recv on the other hand (while avoiding any deadlock issues) is still a blocking operation. Thus, your program must remain in that routine during the entire send/recv process.
Final points: you can initialize more than 2 requests and wait on all of them the same way using the above structure as dealing with 2 requests. For instance, it's quite easy to start communication with node i-1 as well and wait on all 4 of the requests. Using mpi_send_recv you must always have a paired send and receive; what if you only want to send?
As we know, there is a thing called an MPI send buffer used during a send action.
And for following code:
MPI_Isend(data, ..., req);
...
MPI_Wait(req, &status)
Is it safe to use data between MPI_Isend and MPI_Wait ?
That means, will MPI_Isend use data as the internal send buffer?
And more, if I don't use data anymore, could I indicate MPI to use data as the send buffer rather than waste time to copy data?
BTW, I've heard of MPI_Bsend, but I don't think it could save memory and time in this case.
MPI provides two kinds of operations: blocking and non-blocking. The difference between the two is when it is safe to reuse the data buffer passed to the MPI function.
When a blocking call like MPI_Send returns, the buffer is no longer needed by the MPI library and can be safely reused. On the other hand, non-blocking calls only initiate the corresponding operation and let it continue asynchronously. Only after a successful call to a routine like MPI_Wait or after a positive test result from MPI_Test one can safely reuse the buffer.
As for how the library utilises the user buffer, that is very implementation-specific. Shorter messages are usually copied to internal (for the MPI library) buffers for performance reasons. Longer messages are usually directly read from the user buffer and sent to the network, therefore the buffer will be in use by MPI until the whole message has been sent.
It is absolutely not save to use data between MPI_Isend and MPI_Wait.
Between MPI_Isend and MPI_Wait you actually don't know when data can be reused. Only after MPI_Wait you can be sure that data is sent and you can reuse it.
If you don't use data anymore you should call MPI_Wait at the end of your program.
In a spatially decomposed 2D domain, I need to send particles to the 8 neighbors. I know how many I'm sending but not how many I'll receive from these neighbors.
I had implemented a code with MPI_Send(), MPI_Probe() and MPI_Recv() but I realized that it caused deadlocks whenever the message was too big.
I decided to go for non-blocking communications but then I can't figure out in what order MPI_Isend, MPI_Irecv and MPI_Iprobe should be called? I definitely need to know the size my receiving buffer should be allocated to before actually calling MPI_Irecv so I'm tempted by the order MPI_Isend() then MPI_Iprobe() then MPI_Irecv(), but the problem is that MPI_Iprove() always returns a flag equal to false and I get stuck in the while loop. As far as I understand there no obligation for MPI to actually complete the send before the call to MPI_Wait(), therefore I understand that MPI_Iprobe might never return true. But if so, how does one receives an unknown size message in non-blocking MPI point-to-point communications?
You don't have to make all 3 operations non-blocking. You can use an MPI_ISEND with a regular MPI_PROBE and/or MPI_RECV. It sounds like that might be a better option for you.
Need clarification to my understanding of isend and issend as given in Send Types
My understanding is that isend will return once the send buffer is free, i.e. when all the data has been released. Issend on the other hand returns only when it receives an ack from the receive of getting/not getting the entire data. Is this all there is to it?
Both MPI_Isend() and MPI_Issend() return immediately, but in both cases you can't use the send buffer immediately.
Think of the difference that there is between MPI_Send() and MPI_Ssend():
MPI_Send() can be buffered or it can be synchronous if the buffer is too
large to be buffered locally, and in this case it waits to complete sending the
data to the corresponding receive operation.
MPI_Ssend() is always synchronous: it always waits to complete sending the data
to the corresponding receive operation.
The inner working of the corresponding "I"-operations is very similar, except for the fact that they both don't block (return immediately): the difference is only when the MPI library signals to the user program that you can use the send-buffer (that is: MPI_Wait() returns or MPI_Test() returns true - the so called send-complete operation of the non-blocking send):
with MPI_Isend() this can happen either when the data has been copied locally in a buffer owned by the MPI library, if below the "synchronous threshold", or when the data has been actually moved to the sibling task: the send-complete operation can be local, in case the underlying send operation is buffered.
With MPI_Issend() MPI doesn't ever buffer data locally and the "buffer-free condition" is returned only after the data has been actually transferred (and probably ack'ed, at low level): the send-complete operation is non-local.
The MPI standard document is quite pedantic on these aspects. See section 3.7 Nonblocking Communication.
Correct. Obviously both of those will only be true when the request that you get back from the call to MPI_ISEND or MPI_ISSEND is completed via a MPI_WAIT* or MPI_TEST* function.
I am seeing that a small set of message written to a non-blocking TCP socket using write(2) are not seen on the source interface and also not received by the destination.
What could be the problem? Is there any way that the application can detect this and retry?
while (len > 0) {
res = write (c->sock_fd, tcp_buf, len);
if (res < 0) {
switch (errno) {
case EAGAIN:
case EINTR:
<handle case>
break;
default:
<close connection>
}
}
else {
len -= res;
}
}
Non blocking write(2) means that whatever the difficulties, the call will return. The proper way to detect what happened is to inspect the return value of the function.
If it returns -1 check errno. A value of EAGAIN means the write did not happen and you have to do it again.
It could also return a short write (i.e. a value less than the size of the buffer you passed it) in which case you’ll probably want to retry the missing part.
If this is happening on short lived sockets also read The ultimate SO_LINGER page, or: why is my tcp not reliable. It explains a particular problem regarding the closing part of a transmission.
when we naively use TCP to just send the data we need to transmit, it often fails to do what we want - with the final kilobytes or sometimes megabytes of data transmitted never arriving.
and the conclusions is:
The best advice is to send length information, and to have the remote program actively acknowledge that all data was received.
It also describes a hack for Linux.
write() returns the number of bytes written, this might be less than the amount of bytes you send in, and even 0! Make sure you check this and retransmit whatever was dropped (due to not enough buffer space on the NIC or whatever)
You want to read up on TCP_NODELAY option and the nature of the TCP send buffer.