I got slightly mixed up regarding the concept of synchronous - asynchronous in the context of blocking & non blocking operations (in OpenMPI) from here:
link 1
: MPI_Isend is not necessarily asynchronous ( so it can synchronous ?)
link 2
:The MPI_Isend() and MPI_Irecv() are the ASYNCHRONOUS communication primitives of MPI.
I have already gone through the previous sync - async - blocking - non blocking questions on stackoverflow (asynchronous vs non-blocking), but were of no help to me.
As far as i know :
Immediate (MPI_Isend): method returns & executes next line -> nonblocking
Standard/non-immediate (MPI_Send) : for large messages it blocks until transfer completes
A synchronous operation blocks (http://www.cs.unc.edu/~dewan/242/s07/notes/ipc/node9.html)
An asynchronous operation is non-blocking (http://www.cs.unc.edu/~dewan/242/s07/notes/ipc/node9.html)
so how & why MPI_ISEND may be blocking (link 1) as well as non blocking (link 2) ?
i.e.what is meant by asynchronous & synchronous MPI_Isend here ?
Similar confusion arises regarding MPI_Ssend & MPI_Issend, since the S in MPI_SSEND means synchronous(or blocking) and:-
MPI_Ssend: synchronous send blocks until data is received on remote process & ack is
received by sender,
MPI_Issend: means immediate synchronous send
also the Immediate is non-blocking,
So, how can MPI_ISSEND be Synchronous & return Immediately ?
I guess more clarity is needed in asynchronous & synchronous in context of blocking & non blocking OpenMPI communication .
A practical example or analogy in this regard will be very useful.
There is a distinction between when MPI function calls return (blocking vs. non-blocking) and when the corresponding operation completes (standard, synchronous, buffered, ready-mode).
Non-blocking calls MPI_I... return immediately, no matter if the operation has completed or not. The operation continues in the background, or asynchronously. Blocking calls do not return unless the operation has completed. Non-blocking operations are represented by their handle which could be used to perform a blocking wait (MPI_WAIT) or non-blocking test (MPI_TEST) for completion.
Completion of an operation means that the supplied data buffer is no longer being accessed by MPI and it could therefore be reused. Send buffers become free for reuse either after the message has been put in its entirety on the network (including the case where part of the message might still be buffered by the network equipment and/or the driver), or has been buffered somewhere by the MPI implementation. The buffered case does not require that the receiver has posted a matching receive operation and hence is not synchronising - the receive could be posted at a much later time. The blocking synchronous send MPI_SSEND does not return unless the receiver has posted a receive operation, thus it synchronises both ranks. The non-blocking synchronous send MPI_ISSEND returns immediately but the asynchronous (background) operation won't complete unless the receiver has posted a matching receive.
A blocking operation is equivalent to a non-blocking one immediately followed by a wait. For example:
MPI_Ssend(buf, len, MPI_TYPE, dest, tag, MPI_COMM_WORLD);
is equivalent to:
MPI_Request req;
MPI_Status status;
MPI_Issend(buf, len, MPI_TYPE, dest, tag, MPI_COMM_WORLD, &req);
MPI_Wait(&req, &status);
The standard send (MPI_SEND / MPI_ISEND) completes once the message has been constructed and the data buffer provided as its first argument might be reused. There is no synchronisation guarantee - the message might get buffered locally or remotely. With most implementations, there is usually some size threshold: messages up to that size get buffered while longer messages are sent synchronously. The threshold is implementation-dependent.
Buffered sends always buffer the messages into a user-supplied intermediate buffer, essentially performing a more complex memory copy operation. The difference between the blocking (MPI_BSEND) and the non-blocking version (MPI_IBSEND) is that the former does not return before all the message data has been buffered.
The ready send is a very special kind of operation. It only completes successfully if the destination rank has already posted the receive operation by the time when the sender makes the send call. It might reduce the communication latency by eliminating the need to performing some kind of a handshake.
Related
According to the documentation, MPI_Ssend and MPI_Issend are a blocking and a non-blocking send operations, both synchronous. The MPI specification says that a synchronous send completes when the receiver has started to receive the message and after that it is safe to update the send buffer:
The functions MPI_WAIT and MPI_TEST are used to complete a nonblocking
communication. The completion of a send operation indicates that the
sender is now free to update the locations in the send buffer (the
send operation itself leaves the content of the send buffer
unchanged). It does not indicate that the message has been received,
rather, it may have been buffered by the communication subsystem.
However, if a synchronous mode send was used, the completion of the
send operation indicates that a matching receive was initiated, and
that the message will eventually be received by this matching receive.
Bearing in mind that a synchronous send is considered to be completed when it's just started to be received, I am not sure of the following:
It is possible that only a part of the data has been read from the send buffer at the moment when MPI_Ssend or MPI_Issend signal about send completion? For example, the first N bytes have been sent and received while the next M bytes are still being sent.
How can the caller be safe to modify the data until the whole message is received? Does it mean that the data is necessarily copied to the system buffer? As far as I understand, the MPI standard permits the use of a system buffer but does not require it. Moreover, from here I read that MPI_Issend() doesn't ever buffer data locally.
MPI_Ssend() (or the MPI_Wait() related to MPI_Issend()) returns when :
the receiver has started to receive the message
and the send buffer can be reused
the second condition is met if the message was fully received, or the MPI library buffers the data locally.
I did not read that the MPI standard prohibits data buffering.
From the standard, MPI 3.1 chpt 3.4 page 37
A send that uses the synchronous mode can be started whether or not a matching
receive was posted. However, the send will complete successfully only if a matching receive is
posted, and the receive operation has started to receive the message sent by the synchronous
send. Thus, the completion of a synchronous send not only indicates that the send buffer
can be reused, but it also indicates that the receiver has reached a certain point in its
execution, namely that it has started executing the matching receive. If both sends and
receives are blocking operations then the use of the synchronous mode provides synchronous
communication semantics: a communication does not complete at either end before both
processes rendezvous at the communication. A send executed in this mode is non-local.
In the MPI Standard Section 3.4 (page 37):http://mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
the synchronous send completion means
1. the send-buffer can be reused
2. the receiver has started to receive data.
The standard says "has started" instead of "has completed", so I have a question about this: Imagine a case:
The sender calls MPI_Ssend, then a receiver is matched and has started to receive data. At this time, the send is complete and returned. As the MPI standard said, the send-buffer can be reused, so the sender modifies some data of the send-buffer. At the same time, the receiver is receiving data very slowly (e.g. network is very bad), so how can we guarantee the data finally received by the receiver is same as the original data stored in sender's send-buffer?
Ssend is synchronous. It means that Ssend cannot return before the corresponding Recv is called.
Ssend is Blocking. It means that the function return only when it is safe to touch the "send-buffer".
Synchronous and blocking are 2 different thing, I know it can be confusing.
Most implementation of Send works as follow (MPICH,OpenMPI,CRAY-MPI):
For small message the send-buffer is copied to the memory which is reserved for MPI. As soon as the copy is done the send return.
For large message, no copy are done, therefore the Send return once the entire send-buffer has been send to the network (which cannot be done before the Revc has been called, to avoid to overload the network memory)
So a MPI_Send is: Blocking, asynchronous for small message,synchronous for large one.
A Ssend works as follow:
As soon as the Recv is started AND the send-buffer is either copied or fully in the network, the Ssend return.
Ssend should be avoided as much as one can. As it slow down the communication (due to the fact that the network need to tell the sender that the recv has started)
Suppose my MPI process is waiting for a very big message, and I am waiting for it with MPI_Probe. Is it correct to suppose the MPI_Probe call will return as soon as the process receives the first notice of the message from the network (like a header with the size or something like)?
I.e., will it return much faster than if I was waiting for the message with MPI_Recv, because it wouldn't need to receive the full message?
The standard is fairly silent on this matter (MPI-3.0, section 3.8.1), but does offer this:
The MPI implementation of MPI_PROBE and MPI_IPROBE needs to guarantee progress:
if a call to MPI_PROBE has been issued by a process, and a send that matches the probe
has been initiated by some process, then the call to MPI_PROBE will return, unless the
message is received by another concurrent receive operation (that is executed by another
thread at the probing process).
Since both MPI_PROBE and MPI_RECV will engage the progress engine, I would doubt there is much difference between the two functions, aside from a memory copy. By engaging the progress engine, it's likely the message will be received (internally) by the MPI implementation. The last step of copying it into the user's buffer can be avoided in MPI_PROBE.
If you are worried about performance, then avoiding MPI_ANY_SOURCE and MPI_ANY_TAG if possible will help most implementations (certainly MPICH) take a faster path.
I have a program where there is a master/slave setup, and I have some functions implemented for the master which sends different kinds of data to the slaves. Some functions send to individual slaves, but some broadcast information to all the slaves via MPI_Bcast.
I want to have only one receive function in the slaves, so I want to know if I can probe for a message and know if it was broadcasted or sent as a normal blocking message, since there are different method to receive what was broadcasted and what was sent normally.
No, you can't decide whether to call Bcast or Recv on the basis of a probe call.
A MPI_Bcast call is a collective operation -- all MPI tasks must participate. As a result, these are not like point to point communication; they make use of the fact that all processes are involved to make higher-order optimizations.
Because the collective operations imply so much synchronization, it just doesn't make sense to allow other tasks to check to see whether they should start participating in a collective; it's something which has to be built into the logic of a program.
The root process' role in a broadcast is not like a Send; it can't, in general, just call MPI_Bcast and then proceed. The implementation will almost certainly block until some other number of processes have participated in the broadcast; and
The other process' role in a broadcast is not like receiving a message; in general it will be both receiving and sending information. So participating in a broadcast is different from making a simple Recv call.
So Probe won't work; the documentation for MPI_Probe is fairly clear that it returns information about what would happen upon the next MPI_Recv, and Recv is a different operation than Bcast.
You may be able to get some of what you want in MPI 3.0, which is being finalized now, which allows for nonblocking collectives -- eg, MPI_Ibcast. In that case you could start the Broadcast and call MPI_Test to check on the status of the request. However, even here, everyone would need to call the MPI_Ibcast first; this just allows easier interleaving of collective and point-to-point communication.
I was wondering how tcp/ip communication is implemented in unix. When you do a send over the socket, does the tcp/level work (assembling packets, crc, etc) get executed in the same execution context as the calling code?
Or, what seems more likely, a message is sent to some other daemon process responsible for tcp communication? This process then takes the message and performs the requested work of copying memory buffers and assembling packets etc.? So, the calling code resumes execution right away and tcp work is done in parallel? Is this correct?
Details would be appreciated. Thanks!
The TCP/IP stack is part of your kernel. What happens is that you call a helper method which prepares a "kernel trap". This is a special kind of exception which puts the CPU into a mode with more privileges ("kernel mode"). Inside of the trap, the kernel examines the parameters of the exception. One of them is the number of the function to call.
When the function is called, it copies the data into a kernel buffer and prepares everything for the data to be processed. Then it returns from the trap, the CPU restores registers and its original mode and execution of your code resumes.
Some kernel thread will pick up the copy of the data and use the network driver to send it out, do all the error handling, etc.
So, yes, after copying the necessary data, your code resumes and the actual data transfer happens in parallel.
Note that this is for TCP packets. The TCP protocol does all the error handling and handshaking for you, so you can give it all the data and it will know what to do. If there is a problem with the connection, you'll notice only after a while since the TCP protocol can handle short network outages by itself. That means you'll have "sent" some data already before you'll get an error. That means you will get the error code for the first packet only after the Nth call to send() or when you try to close the connection (the close() will hang until the receiver has acknowledged all packets).
The UDP protocol doesn't buffer. When the call returns, the packet is on it's way. But it's "fire and forget", so you only know that the driver has put it on the wire. If you want to know whether it has arrived somewhere, you must figure out a way to achieve that yourself. The usual approach is have the receiver send an ack UDP packet back (which also might get lost).
No - there is no parallel execution. It is true that the execution context when you're making a system call is not the same as your usual execution context. When you make a system call, such as for sending a packet over the network, you must switch into the kernel's context - the kernel's own memory map and stack, instead of the virtual memory you get inside your process.
But there are no daemon processes magically dispatching your call. The rest of the execution of your program has to wait for the system call to finish and return whatever values it will return. This is why you can count on return values being available right away when you return from the system call - values like the number of bytes actually read from the socket or written to a file.
I tried to find a nice explanation for how the context switch to kernel space works. Here's a nice in-depth one that even focuses on architecture-specific implementation:
http://www.ibm.com/developerworks/linux/library/l-system-calls/