MPI standard 3: when synchronous send is complete? - mpi

In the MPI Standard Section 3.4 (page 37):
the synchronous send completion means
1. the send-buffer can be reused
2. the receiver has started to receive data.
The standard says "has started" instead of "has completed", so I have a question about this: Imagine a case:
The sender calls MPI_Ssend, then a receiver is matched and has started to receive data. At this time, the send is complete and returned. As the MPI standard said, the send-buffer can be reused, so the sender modifies some data of the send-buffer. At the same time, the receiver is receiving data very slowly (e.g. network is very bad), so how can we guarantee the data finally received by the receiver is same as the original data stored in sender's send-buffer?

Ssend is synchronous. It means that Ssend cannot return before the corresponding Recv is called.
Ssend is Blocking. It means that the function return only when it is safe to touch the "send-buffer".
Synchronous and blocking are 2 different thing, I know it can be confusing.
Most implementation of Send works as follow (MPICH,OpenMPI,CRAY-MPI):
For small message the send-buffer is copied to the memory which is reserved for MPI. As soon as the copy is done the send return.
For large message, no copy are done, therefore the Send return once the entire send-buffer has been send to the network (which cannot be done before the Revc has been called, to avoid to overload the network memory)
So a MPI_Send is: Blocking, asynchronous for small message,synchronous for large one.
A Ssend works as follow:
As soon as the Recv is started AND the send-buffer is either copied or fully in the network, the Ssend return.
Ssend should be avoided as much as one can. As it slow down the communication (due to the fact that the network need to tell the sender that the recv has started)


Does MPI_Ssend/MPI_Issend use a system buffer?

According to the documentation, MPI_Ssend and MPI_Issend are a blocking and a non-blocking send operations, both synchronous. The MPI specification says that a synchronous send completes when the receiver has started to receive the message and after that it is safe to update the send buffer:
The functions MPI_WAIT and MPI_TEST are used to complete a nonblocking
communication. The completion of a send operation indicates that the
sender is now free to update the locations in the send buffer (the
send operation itself leaves the content of the send buffer
unchanged). It does not indicate that the message has been received,
rather, it may have been buffered by the communication subsystem.
However, if a synchronous mode send was used, the completion of the
send operation indicates that a matching receive was initiated, and
that the message will eventually be received by this matching receive.
Bearing in mind that a synchronous send is considered to be completed when it's just started to be received, I am not sure of the following:
It is possible that only a part of the data has been read from the send buffer at the moment when MPI_Ssend or MPI_Issend signal about send completion? For example, the first N bytes have been sent and received while the next M bytes are still being sent.
How can the caller be safe to modify the data until the whole message is received? Does it mean that the data is necessarily copied to the system buffer? As far as I understand, the MPI standard permits the use of a system buffer but does not require it. Moreover, from here I read that MPI_Issend() doesn't ever buffer data locally.
MPI_Ssend() (or the MPI_Wait() related to MPI_Issend()) returns when :
the receiver has started to receive the message
and the send buffer can be reused
the second condition is met if the message was fully received, or the MPI library buffers the data locally.
I did not read that the MPI standard prohibits data buffering.
From the standard, MPI 3.1 chpt 3.4 page 37
A send that uses the synchronous mode can be started whether or not a matching
receive was posted. However, the send will complete successfully only if a matching receive is
posted, and the receive operation has started to receive the message sent by the synchronous
send. Thus, the completion of a synchronous send not only indicates that the send buffer
can be reused, but it also indicates that the receiver has reached a certain point in its
execution, namely that it has started executing the matching receive. If both sends and
receives are blocking operations then the use of the synchronous mode provides synchronous
communication semantics: a communication does not complete at either end before both
processes rendezvous at the communication. A send executed in this mode is non-local.

peer-to-peer epoll clients and deadlock

Suppose a peer-to-peer program uses epoll to perform asynchronous TCP reads from and writes to multiple peers. Naturally, this means that every file descriptor is set to nonblocking to allow epoll_wait to be called and for multiple sockets to be checked.
However, there is a potential issue. Suppose there are two peers: A and B. A tries to write a message to B, but B is congested or something and so the call to write returns -1 with errno set to EAGAIN. At this point, A goes to sleep on the call to epoll_wait.
But note that B is already stuck on its own call to epoll_wait. If B is never notified about A's failed attempt to send it a message, then B will never wake up and try to perform a read on A's socket, and the entire thing will deadlock. So my question is, is B guaranteed be notified that A is attempting to send it a message, even if A gives up on the original write call and goes to sleep?
Even if the answer to the above is "yes", is it possible for a system like this to deadlock indefinitely because of application-layer desynchronization? i.e. A tries to write to B but fails, so it goes to sleep. Then B wakes up and tries to read from A, but fails because A went to sleep. etc.
Any protocol that had a possible state where both sides are permitted to wait for the other side to read before they read would be a fundamentally broken protocol. For peer-to-peer protocols, typically each end is prohibited from delaying reads just because it cannot write.
On the implementation side, typically every call to epoll_wait (or the equivalent way you discover ready I/O) checks for input on all descriptors the program is using. Reading is never deferred unless the application has unprocessed data that it has already read and it stops deferring as soon as that data is processed. Waiting for network activity before reading is generally a very bad idea.
This is why typical protocol-neutral TCP proxies use two processes or two threads. You can't just read from A and then go do a blocking write to B because you don't know if B is required to read before it writes.
This is also why calling recv with MSG_WAITALL is almost always a bad idea. The other end might be waiting for you to receive the bytes it has already sent before it sends any more. No protocol can allow one side to wait for all the bytes to be sent before reading any of them if it also allows the other side to wait until some bytes have been read before sending the rest of them!

Some questionts about MPI send modes

I'm trying to understand the specific of MPI send modes (send, bsend, ssend, rsend) and I have next questions:
MPI_Send uses some buffer if the is not initialized appropriate MPI_|i|recv and message size not too big and not exceeded buffer size (otherwise, MPI_Send will wait appropriate recv). I know, it's true (this situation described here: Deadlock with MPI ).
MPI_Bsend uses buffer (denoted MPI_Buffer_attach function) only when not initialized appropriate recv. It's true?
Buffer for MPI_Bsend is the same as that buffer for MPI_send?
MPI_Ssend never uses buffer. It's true? Or behavior of MPI_Ssend like MPI_Send (buffer uses, if message size is not exceeded buffer size)?
If an answer on my questions "it's not true", could not you give me detailed answer with explanations?
MPI_Send precise behavior is subject to change depending on the implementation. In addition, some implementations allow the threshold size to be tuned by the user.
Check MPI's Send Modes for some detailed information. If you want to make sure your program is portable to other MPI implementations, refer to MPI standard (section 3.4: Communication Modes). For the standard mode (MPI_Send), here's what the standard says (as of MPI 3.1).
The send call described in Section 3.2.1 uses the standard communication mode. In this mode, it is up to MPI to decide whether outgoing messages will be buffered. MPI may buffer outgoing messages. In such a case, the send call may complete before a matching receive is invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer outgoing messages, for performance reasons. In this case, the send call will not complete until a matching receive has been posted, and the data has been moved to the
Thus, a send in standard mode can be started whether or not a matching receive has
been posted. It may complete before a matching receive is posted. The standard mode send
is non-local: successful completion of the send operation may depend on the occurrence of
a matching receive.
The main misconception you have is that you think MPI_Send uses buffering if MPI_Recv has not been called by the receiver process. Actually, it usually depends on message size regardless if the matching receive has been called.
If buffering is used, the user's send buffer is released after the data is copied to a temporary buffer. Then, the program can continue its execution regardless the corresponding receive has been issued or not.

WSAECONNABORTED when using recv for the second time

I am writing a 2D multiplayer game consisting of two applications, a console server and windowed client. So far, the client has a FD_SET which is filled with connected clients, a list of my game object pointers and some other things. In the main(), I initialize listening on a socket and create three threads, one for accepting incoming connections and placing them within the FD_SET, another one for processing objects' location, velocity and acceleration and flagging them (if needed) as the ones that have to be updated on the client. The third thread uses the send() function to send update info of every object (iterating through the list of object pointers). Such a packet consists of an operation code, packet size & the actual data. On the client I parse it, by reading first 5 bytes (the opcode and packet size) which are received correctly, but when I want to read the remaining part of the packet (since I now know the size of it), I get a WSAECONNABORTED (error code 10053). I've read about this error, but can't see why it occurs in my application. Any help would be appreciated.
The error means the system closed the socket. This could be because it detected that the client disconnected, or because it was sending more data than you were reading.
A parser for network protocols typcally needs a lot of work to make it robust, and you can't tell how much data you will get in a single read(), e.g. you may get more than your operation code and packet size in the first chunk you read, you might even get less (e.g. only the operation code). Double check this isn't happening in your failure case.

unix network process

I was wondering how tcp/ip communication is implemented in unix. When you do a send over the socket, does the tcp/level work (assembling packets, crc, etc) get executed in the same execution context as the calling code?
Or, what seems more likely, a message is sent to some other daemon process responsible for tcp communication? This process then takes the message and performs the requested work of copying memory buffers and assembling packets etc.? So, the calling code resumes execution right away and tcp work is done in parallel? Is this correct?
Details would be appreciated. Thanks!
The TCP/IP stack is part of your kernel. What happens is that you call a helper method which prepares a "kernel trap". This is a special kind of exception which puts the CPU into a mode with more privileges ("kernel mode"). Inside of the trap, the kernel examines the parameters of the exception. One of them is the number of the function to call.
When the function is called, it copies the data into a kernel buffer and prepares everything for the data to be processed. Then it returns from the trap, the CPU restores registers and its original mode and execution of your code resumes.
Some kernel thread will pick up the copy of the data and use the network driver to send it out, do all the error handling, etc.
So, yes, after copying the necessary data, your code resumes and the actual data transfer happens in parallel.
Note that this is for TCP packets. The TCP protocol does all the error handling and handshaking for you, so you can give it all the data and it will know what to do. If there is a problem with the connection, you'll notice only after a while since the TCP protocol can handle short network outages by itself. That means you'll have "sent" some data already before you'll get an error. That means you will get the error code for the first packet only after the Nth call to send() or when you try to close the connection (the close() will hang until the receiver has acknowledged all packets).
The UDP protocol doesn't buffer. When the call returns, the packet is on it's way. But it's "fire and forget", so you only know that the driver has put it on the wire. If you want to know whether it has arrived somewhere, you must figure out a way to achieve that yourself. The usual approach is have the receiver send an ack UDP packet back (which also might get lost).
No - there is no parallel execution. It is true that the execution context when you're making a system call is not the same as your usual execution context. When you make a system call, such as for sending a packet over the network, you must switch into the kernel's context - the kernel's own memory map and stack, instead of the virtual memory you get inside your process.
But there are no daemon processes magically dispatching your call. The rest of the execution of your program has to wait for the system call to finish and return whatever values it will return. This is why you can count on return values being available right away when you return from the system call - values like the number of bytes actually read from the socket or written to a file.
I tried to find a nice explanation for how the context switch to kernel space works. Here's a nice in-depth one that even focuses on architecture-specific implementation:
